- netboot-base.nix with SSH key auth - Launch scripts for node01/02/03 - Node configuration.nix and disko.nix - Nix modules for first-boot automation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
492 lines
36 KiB
Markdown
492 lines
36 KiB
Markdown
# Deployment Flow Diagram
|
|
|
|
**Document Version:** 1.0
|
|
**Last Updated:** 2025-12-10
|
|
|
|
## End-to-End Deployment Flow
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ Bare-Metal to Running Cluster │
|
|
│ Timeline: ~60-90 minutes │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
|
|
PHASE 1: PREPARATION (Day 0) Timeline: 2-4 hours
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
┌─────────────────┐
|
|
│ Deploy PXE │ • Install DHCP, TFTP, HTTP servers
|
|
│ Server │ • Configure boot infrastructure
|
|
│ (T032.S2) │ • Validate services
|
|
└────────┬────────┘
|
|
│ Complete
|
|
v
|
|
┌─────────────────┐
|
|
│ Build Netboot │ • Build control-plane, worker, all-in-one images
|
|
│ Images │ • Copy bzImage + initrd to HTTP server
|
|
│ (T032.S3) │ • Verify accessibility
|
|
└────────┬────────┘
|
|
│ Complete
|
|
v
|
|
┌─────────────────┐
|
|
│ Generate TLS │ • Create CA certificate
|
|
│ Certificates │ • Generate per-node certificates
|
|
│ │ • Place in node secrets directories
|
|
└────────┬────────┘
|
|
│ Complete
|
|
v
|
|
┌─────────────────┐
|
|
│ Prepare Node │ • Create configuration.nix per node
|
|
│ Configurations │ • Define disko layouts
|
|
│ │ • Create cluster-config.json
|
|
└────────┬────────┘
|
|
│ Ready
|
|
v
|
|
|
|
PHASE 2: PXE BOOT (T+0 minutes) Timeline: 2-5 minutes
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
┌──────────────┐
|
|
│ Power On │ Via BMC/IPMI or physical
|
|
│ Bare-Metal │ • Set boot device to PXE
|
|
│ Server │ • Power on
|
|
└──────┬───────┘
|
|
│
|
|
v
|
|
┌──────────────┐ ┌─────────────────────────────────────────────┐
|
|
│ BIOS/UEFI │────>│ Network Boot ROM │
|
|
│ POST │ │ • Sends DHCP DISCOVER │
|
|
└──────────────┘ │ • Receives IP address (10.0.100.50) │
|
|
│ • Receives TFTP server IP (next-server) │
|
|
│ • Receives boot filename (Option 67) │
|
|
└────────────────┬────────────────────────────┘
|
|
│
|
|
v
|
|
┌────────────────────────────────────────────┐
|
|
│ TFTP Download │
|
|
│ • Downloads undionly.kpxe (BIOS) or │
|
|
│ ipxe.efi (UEFI) │
|
|
│ • ~100 KB, ~5 seconds │
|
|
└────────────────┬───────────────────────────┘
|
|
│
|
|
v
|
|
┌────────────────────────────────────────────┐
|
|
│ iPXE Loads │
|
|
│ • Sends second DHCP request │
|
|
│ (with user-class=iPXE) │
|
|
│ • Receives HTTP boot script URL │
|
|
└────────────────┬───────────────────────────┘
|
|
│
|
|
v
|
|
┌────────────────────────────────────────────┐
|
|
│ HTTP Download boot.ipxe │
|
|
│ • Downloads boot script (~5 KB) │
|
|
│ • Executes script │
|
|
│ • Displays menu or auto-selects profile │
|
|
└────────────────┬───────────────────────────┘
|
|
│
|
|
v
|
|
┌────────────────────────────────────────────┐
|
|
│ HTTP Download Kernel + Initrd │
|
|
│ • Downloads bzImage (~10-30 MB) │
|
|
│ • Downloads initrd (~100-300 MB) │
|
|
│ • Total: 1-2 minutes on 1 Gbps link │
|
|
└────────────────┬───────────────────────────┘
|
|
│
|
|
v
|
|
┌────────────────────────────────────────────┐
|
|
│ kexec into NixOS Installer │
|
|
│ • Boots kernel from RAM │
|
|
│ • Mounts squashfs Nix store │
|
|
│ • Starts sshd on port 22 │
|
|
│ • Acquires DHCP lease again │
|
|
│ Timeline: ~30-60 seconds │
|
|
└────────────────┬───────────────────────────┘
|
|
│
|
|
v
|
|
┌────────────────┐
|
|
│ NixOS Installer │
|
|
│ Running in RAM │
|
|
│ SSH Ready │
|
|
└────────────────┘
|
|
|
|
PHASE 3: INSTALLATION (T+5 minutes) Timeline: 30-60 minutes
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ Provisioning Workstation │
|
|
│ (Human operator or automation system) │
|
|
└───────────────────────────────┬─────────────────────────────────────┘
|
|
│
|
|
v
|
|
┌─────────────────────────────┐
|
|
│ Execute nixos-anywhere │
|
|
│ --flake #node01 │
|
|
│ root@10.0.100.50 │
|
|
└──────────────┬──────────────┘
|
|
│
|
|
┌────────────────┴────────────────┐
|
|
│ SSH Connection Established │
|
|
│ • Transfers disko configuration│
|
|
│ • Transfers NixOS configuration│
|
|
│ • Transfers secrets │
|
|
└────────────────┬────────────────┘
|
|
│
|
|
v
|
|
┌─────────────────────────────────────────────┐
|
|
│ Step 1: Disk Partitioning (disko) │
|
|
│ • Detects disk (/dev/sda or /dev/nvme0n1) │
|
|
│ • Wipes existing partitions │
|
|
│ • Creates GPT partition table │
|
|
│ • Creates ESP (1 GB) and root partitions │
|
|
│ • Formats filesystems (vfat, ext4) │
|
|
│ • Mounts to /mnt │
|
|
│ Timeline: ~1-2 minutes │
|
|
└────────────────┬────────────────────────────┘
|
|
│
|
|
v
|
|
┌─────────────────────────────────────────────┐
|
|
│ Step 2: Build NixOS System │
|
|
│ • Evaluates flake configuration │
|
|
│ • Downloads packages from binary cache │
|
|
│ (cache.nixos.org or local cache) │
|
|
│ • Builds custom packages if needed │
|
|
│ • Creates system closure │
|
|
│ Timeline: ~10-30 minutes (depends on cache)│
|
|
└────────────────┬────────────────────────────┘
|
|
│
|
|
v
|
|
┌─────────────────────────────────────────────┐
|
|
│ Step 3: Install System to Disk │
|
|
│ • Copies Nix store to /mnt/nix/store │
|
|
│ • Creates /etc/nixos/configuration.nix │
|
|
│ • Copies secrets to /mnt/etc/nixos/secrets│
|
|
│ • Sets file permissions (0600 for keys) │
|
|
│ • Installs bootloader (GRUB or systemd-boot)│
|
|
│ Timeline: ~5-10 minutes │
|
|
└────────────────┬────────────────────────────┘
|
|
│
|
|
v
|
|
┌─────────────────────────────────────────────┐
|
|
│ Step 4: Finalize and Reboot │
|
|
│ • Unmounts filesystems │
|
|
│ • Syncs disk writes │
|
|
│ • Triggers reboot │
|
|
│ Timeline: ~10 seconds │
|
|
└────────────────┬────────────────────────────┘
|
|
│
|
|
v
|
|
┌───────────────┐
|
|
│ Server Reboots│
|
|
│ from Disk │
|
|
└───────────────┘
|
|
|
|
PHASE 4: FIRST BOOT (T+40 minutes) Timeline: 5-10 minutes
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
┌──────────────┐
|
|
│ BIOS/UEFI │ • Boot from disk (no longer PXE)
|
|
│ POST │ • Loads GRUB or systemd-boot
|
|
└──────┬───────┘
|
|
│
|
|
v
|
|
┌──────────────────────────────────────────┐
|
|
│ GRUB/systemd-boot │
|
|
│ • Loads NixOS kernel from /boot │
|
|
│ • Loads initrd │
|
|
│ • Boots with init=/nix/store/.../init │
|
|
└──────┬───────────────────────────────────┘
|
|
│
|
|
v
|
|
┌──────────────────────────────────────────┐
|
|
│ NixOS Stage 1 (initrd) │
|
|
│ • Mounts root filesystem │
|
|
│ • Switches to stage 2 │
|
|
└──────┬───────────────────────────────────┘
|
|
│
|
|
v
|
|
┌──────────────────────────────────────────┐
|
|
│ NixOS Stage 2 (systemd) │
|
|
│ • Starts systemd as PID 1 │
|
|
│ • Mounts additional filesystems │
|
|
│ • Starts network services │
|
|
│ • Configures network interfaces │
|
|
│ (eth0: 10.0.100.50, eth1: 10.0.200.10)│
|
|
└──────┬───────────────────────────────────┘
|
|
│
|
|
v
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ Service Startup (systemd targets) │
|
|
│ • multi-user.target │
|
|
│ └─ network-online.target │
|
|
│ └─ chainfire.service ───────────┐ │
|
|
│ └─ flaredb.service ──────────┼───────┐ │
|
|
│ └─ iam.service ───────────┼───────┼──────┐ │
|
|
│ └─ plasmavmc.service ───┼───────┼──────┼───┐ │
|
|
│ v v v v │
|
|
│ (Services start in dependency order) │
|
|
└──────────────────────────────────────────────────────────────┘
|
|
│
|
|
v
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ First-Boot Automation (T032.S4) │
|
|
│ • chainfire-cluster-join.service starts │
|
|
│ └─ Waits for chainfire.service to be healthy │
|
|
│ └─ Reads /etc/nixos/secrets/cluster-config.json │
|
|
│ └─ If bootstrap=true: Cluster forms automatically │
|
|
│ └─ If bootstrap=false: POSTs to leader /admin/member/add │
|
|
│ └─ Creates marker file: .chainfire-joined │
|
|
│ • flaredb-cluster-join.service starts (after chainfire) │
|
|
│ • iam-initial-setup.service starts (after flaredb) │
|
|
│ Timeline: ~2-5 minutes │
|
|
└──────────────────────────────────────────────────────────────┘
|
|
│
|
|
v
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ Cluster Health Validation │
|
|
│ • cluster-health-check.service runs │
|
|
│ └─ Checks Chainfire cluster has quorum │
|
|
│ └─ Checks FlareDB cluster has quorum │
|
|
│ └─ Checks IAM service is reachable │
|
|
│ └─ Checks all health endpoints return 200 OK │
|
|
│ Timeline: ~1-2 minutes │
|
|
└──────────────────────────────────────────────────────────────┘
|
|
│
|
|
v
|
|
┌──────────────────┐
|
|
│ RUNNING CLUSTER │ ✓ All services healthy
|
|
│ ✓ Raft quorum │ ✓ TLS enabled
|
|
│ ✓ API accessible│ ✓ Ready for workloads
|
|
└──────────────────┘
|
|
|
|
PHASE 5: VALIDATION (T+50 minutes) Timeline: 5 minutes
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ Operator Validation │
|
|
│ (Human operator or CI/CD pipeline) │
|
|
└────────────────────────────┬─────────────────────────────────┘
|
|
│
|
|
v
|
|
┌────────────────────────────────────┐
|
|
│ Check Cluster Membership │
|
|
│ curl -k https://node01:2379/... │
|
|
│ Expected: 3 members, 1 leader │
|
|
└────────────────┬───────────────────┘
|
|
│
|
|
v
|
|
┌────────────────────────────────────┐
|
|
│ Check Service Health │
|
|
│ curl -k https://node01:2379/health│
|
|
│ curl -k https://node01:2479/health│
|
|
│ curl -k https://node01:8080/health│
|
|
│ Expected: all return status=healthy│
|
|
└────────────────┬───────────────────┘
|
|
│
|
|
v
|
|
┌────────────────────────────────────┐
|
|
│ Test Write/Read │
|
|
│ PUT /v1/kv/test │
|
|
│ GET /v1/kv/test │
|
|
│ Expected: data replicated │
|
|
└────────────────┬───────────────────┘
|
|
│
|
|
v
|
|
┌────────────────────────────────────┐
|
|
│ DEPLOYMENT COMPLETE │
|
|
│ Cluster operational │
|
|
└────────────────────────────────────┘
|
|
```
|
|
|
|
## Multi-Node Bootstrap Flow
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ Simultaneous 3-Node Bootstrap (Recommended) │
|
|
└─────────────────────────────────────────────────────────────────────────┘
|
|
|
|
T+0: Power on all 3 nodes simultaneously
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
Node01: 10.0.100.50 Node02: 10.0.100.51 Node03: 10.0.100.52
|
|
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
|
│ PXE Boot │ │ PXE Boot │ │ PXE Boot │
|
|
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
|
|
│ │ │
|
|
v v v
|
|
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
|
│ Installer │ │ Installer │ │ Installer │
|
|
│ Ready │ │ Ready │ │ Ready │
|
|
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
|
|
|
|
T+5: Run nixos-anywhere in parallel
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ Provisioning Workstation │
|
|
│ for node in node01 node02 node03; do │
|
|
│ nixos-anywhere --flake #$node root@<ip> & │
|
|
│ done │
|
|
│ wait │
|
|
└─────────────────────────────────────────────────────────────────────────┘
|
|
│ │ │
|
|
v v v
|
|
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
|
│ Install │ │ Install │ │ Install │
|
|
│ node01 │ │ node02 │ │ node03 │
|
|
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
|
|
│ ~30-60 min │ ~30-60 min │ ~30-60 min
|
|
v v v
|
|
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
|
│ Reboot │ │ Reboot │ │ Reboot │
|
|
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
|
|
|
|
T+40: First boot and cluster formation
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
│ │ │
|
|
v v v
|
|
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
|
│ Chainfire │ │ Chainfire │ │ Chainfire │
|
|
│ starts │ │ starts │ │ starts │
|
|
│ (bootstrap) │ │ (bootstrap) │ │ (bootstrap) │
|
|
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
|
|
│ │ │
|
|
└────────────┬───────────────┴───────────────┬────────────┘
|
|
│ Raft leader election │
|
|
│ (typically <10 seconds) │
|
|
v v
|
|
┌──────────┐ ┌──────────┐
|
|
│ Leader │◄─────────────────│ Follower │
|
|
│ Elected │──────────────────│ │
|
|
└────┬─────┘ └──────────┘
|
|
│
|
|
v
|
|
┌─────────────────────┐
|
|
│ 3-Node Raft Cluster│
|
|
│ - node01: leader │
|
|
│ - node02: follower │
|
|
│ - node03: follower │
|
|
└─────────────────────┘
|
|
|
|
T+45: FlareDB and other services join
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ All nodes: FlareDB, IAM, PlasmaVMC, ... start │
|
|
│ • FlareDB forms its own Raft cluster (depends on Chainfire) │
|
|
│ • IAM starts (depends on FlareDB) │
|
|
│ • Other services start in parallel │
|
|
└─────────────────────────────────────────────────────────────────────────┘
|
|
|
|
T+50: Cluster fully operational
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ 3-Node Production Cluster │
|
|
│ • Chainfire: 3 members, quorum achieved │
|
|
│ • FlareDB: 3 members, quorum achieved │
|
|
│ • IAM: 3 instances (stateless, uses FlareDB backend) │
|
|
│ • All services healthy and accepting requests │
|
|
└─────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Adding Node to Existing Cluster
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ Add Node04 to Running 3-Node Cluster │
|
|
└─────────────────────────────────────────────────────────────────────────┘
|
|
|
|
Existing Cluster (node01, node02, node03)
|
|
┌───────────────────────────────────────────────────────────┐
|
|
│ Chainfire: 3 members, leader=node01 │
|
|
│ FlareDB: 3 members, leader=node02 │
|
|
│ All services healthy │
|
|
└───────────────────────────────────────────────────────────┘
|
|
|
|
T+0: Prepare node04 configuration
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ Create configuration.nix with bootstrap=false │
|
|
│ cluster-config.json: │
|
|
│ { │
|
|
│ "node_id": "node04", │
|
|
│ "bootstrap": false, │
|
|
│ "leader_url": "https://node01.example.com:2379", │
|
|
│ "raft_addr": "10.0.200.13:2380" │
|
|
│ } │
|
|
└──────────────────────────────────────────────────────────────┘
|
|
|
|
T+5: Power on node04, PXE boot, install
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
┌──────────────┐
|
|
│ node04 │
|
|
│ PXE Boot │ (same as bootstrap nodes)
|
|
└──────┬───────┘
|
|
│
|
|
v
|
|
┌──────────────┐
|
|
│ Installer │
|
|
│ Ready │
|
|
└──────┬───────┘
|
|
│
|
|
v
|
|
┌──────────────┐
|
|
│ nixos- │
|
|
│ anywhere │ nixos-anywhere --flake #node04 root@10.0.100.60
|
|
│ runs │
|
|
└──────┬───────┘
|
|
│ ~30-60 min
|
|
v
|
|
┌──────────────┐
|
|
│ Reboot │
|
|
└──────┬───────┘
|
|
|
|
T+40: First boot and cluster join
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
│
|
|
v
|
|
┌──────────────────────────────────────────┐
|
|
│ node04 boots │
|
|
│ • Chainfire starts (no bootstrap) │
|
|
│ • First-boot service runs │
|
|
│ └─ Detects bootstrap=false │
|
|
│ └─ POSTs to node01:2379/admin/member/add│
|
|
│ {"id":"node04","raft_addr":"10.0.200.13:2380"}│
|
|
└──────────────────┬───────────────────────┘
|
|
│
|
|
v
|
|
┌──────────────────────────────────────────┐
|
|
│ Existing Cluster (node01=leader) │
|
|
│ • Receives join request │
|
|
│ • Validates node04 │
|
|
│ • Adds to Raft member list │
|
|
│ • Starts replicating to node04 │
|
|
└──────────────────┬───────────────────────┘
|
|
│
|
|
v
|
|
┌──────────────────────────────────────────┐
|
|
│ node04 becomes follower │
|
|
│ • Receives cluster state from leader │
|
|
│ • Starts participating in Raft │
|
|
│ • Accepts write replication │
|
|
└──────────────────────────────────────────┘
|
|
|
|
T+45: Cluster expanded to 4 nodes
|
|
═══════════════════════════════════════════════════════════════════════════
|
|
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ 4-Node Cluster │
|
|
│ • Chainfire: 4 members (node01=leader, node02-04=followers) │
|
|
│ • FlareDB: 4 members (similar join process) │
|
|
│ • Quorum: 3 of 4 (can tolerate 1 failure) │
|
|
└─────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
**Document End**
|