id: T039 name: Production Deployment (Bare-Metal) goal: Deploy the full PlasmaCloud stack to target bare-metal environment using T032 provisioning tools and T036 learnings. status: active priority: P1 owner: peerA depends_on: [T032, T036, T038] blocks: [] context: | **MVP-Alpha Achieved: 12/12 components operational** **UPDATE 2025-12-12:** User approved VM-based deployment using QEMU + VDE virtual network. This allows full production deployment validation without waiting for physical hardware. With the application stack validated and provisioning tools proven (T032/T036), we now execute production deployment to QEMU VM infrastructure. **Prerequisites:** - T032 (COMPLETE): PXE boot infra, NixOS image builder, first-boot automation (17,201L) - T036 (PARTIAL SUCCESS): VM validation proved infrastructure concepts - VDE networking validated L2 clustering - Custom netboot with SSH key auth validated zero-touch provisioning - Key learning: Full NixOS required (nix-copy-closure needs nix-daemon) - T038 (COMPLETE): Build chain working, all services compile **VM Infrastructure:** - baremetal/vm-cluster/launch-node01-netboot.sh (node01) - baremetal/vm-cluster/launch-node02-netboot.sh (node02) - baremetal/vm-cluster/launch-node03-netboot.sh (node03) - VDE virtual network for L2 connectivity **Key Insight from T036:** - nix-copy-closure requires nix on target → full NixOS deployment via nixos-anywhere - Custom netboot (minimal Linux) insufficient for nix-built services - T032's nixos-anywhere approach is architecturally correct acceptance: - All target bare-metal nodes provisioned with NixOS - ChainFire + FlareDB Raft clusters formed (3-node quorum) - IAM service operational on all control-plane nodes - All 12 services deployed and healthy - T029/T035 integration tests passing on live cluster - Production deployment documented in runbook steps: - step: S1 name: Hardware Readiness Verification done: Target bare-metal hardware accessible and ready for provisioning (verified by T032 completion) status: complete completed: 2025-12-12 04:15 JST - step: S2 name: Bootstrap Infrastructure done: VDE switch + 3 QEMU VMs booted with SSH access status: complete completed: 2025-12-12 06:55 JST owner: peerB priority: P0 started: 2025-12-12 06:50 JST notes: | **Decision (2025-12-12):** Option B (Direct Boot) selected for QEMU+VDE VM deployment. **Implementation:** 1. Started VDE switch using nix package: /nix/store/.../vde2-2.3.3/bin/vde_switch 2. Verified netboot artifacts: bzImage (14MB), initrd (484MB) 3. Launched 3 QEMU VMs with direct kernel boot 4. Verified SSH access on all 3 nodes (ports 2201/2202/2203) **Evidence:** - VDE switch running (PID 734637) - 3 QEMU processes active - SSH successful: `hostname` returns "nixos" on all nodes - Zero-touch access (SSH key baked into netboot image) outputs: - path: /tmp/vde.sock note: VDE switch daemon socket - path: baremetal/vm-cluster/node01.qcow2 note: node01 disk (SSH 2201, VNC :1, serial 4401) - path: baremetal/vm-cluster/node02.qcow2 note: node02 disk (SSH 2202, VNC :2, serial 4402) - path: baremetal/vm-cluster/node03.qcow2 note: node03 disk (SSH 2203, VNC :3, serial 4403) - step: S3 name: NixOS Provisioning done: All nodes provisioned with base NixOS via nixos-anywhere status: in_progress started: 2025-12-12 06:57 JST owner: peerB priority: P0 acceptance_gate: | All criteria must pass before S4: 1. All 3 nodes boot from disk (not ISO) 2. `nixos-version` returns 26.05+ on all nodes 3. SSH accessible via ports 2201/2202/2203 4. /etc/nixos/secrets/cluster-config.json exists on all nodes 5. Static IPs configured (192.168.100.11/12/13 on eth0) verification_cmd: | for port in 2201 2202 2203; do ssh -p $port root@localhost 'nixos-version && ls /etc/nixos/secrets/cluster-config.json && ip addr show eth0 | grep 192.168.100' done notes: | **Current State (2025-12-18):** - VMs running from ISO installer (boot d), NOT from disk - NixOS configs have asymmetry (node01 has nightlight, node02/03 missing) - Secrets handling required via --extra-files **Option A: nixos-anywhere (fresh install)** ```bash # Prepare secrets staging mkdir -p /tmp/node01-extra/etc/nixos/secrets cp docs/por/T036-vm-cluster-deployment/node01/secrets/* /tmp/node01-extra/etc/nixos/secrets/ # Deploy nix run nixpkgs#nixos-anywhere -- --flake .#node01 --extra-files /tmp/node01-extra root@localhost -p 2201 ``` **Option B: Reboot from disk (if already installed)** 1. Kill current QEMU processes 2. Use launch-node0{1,2,3}-disk.sh scripts 3. These boot with UEFI from disk (-boot c) Node configurations from T036: - docs/por/T036-vm-cluster-deployment/node01/ - docs/por/T036-vm-cluster-deployment/node02/ - docs/por/T036-vm-cluster-deployment/node03/ - step: S4 name: Service Deployment done: All 11 PlasmaCloud services deployed and running status: pending owner: peerB priority: P0 acceptance_gate: | All criteria must pass before S5: 1. `systemctl is-active` returns "active" for all 11 services on all 3 nodes 2. Each service responds to gRPC reflection (`grpcurl -plaintext : list`) 3. No service in failed/restart loop state verification_cmd: | for port in 2201 2202 2203; do ssh -p $port root@localhost 'systemctl list-units --state=running | grep -cE "chainfire|flaredb|iam|plasmavmc|prismnet|flashdns|fiberlb|lightningstor|k8shost|nightlight|creditservice"' done # Expected: 11 on each node (33 total) notes: | **Services (11 total, per node):** - chainfire-server (2379) - flaredb-server (2479) - iam-server (3000) - plasmavmc-server (4000) - prismnet-server (5000) - flashdns-server (6000) - fiberlb-server (7000) - lightningstor-server (8000) - k8shost-server (6443) - nightlight-server (9101) - creditservice-server (3010) Service deployment is part of NixOS configuration in S3. This step verifies all services started successfully. - step: S5 name: Cluster Formation done: Raft clusters operational (ChainFire + FlareDB) status: pending owner: peerB priority: P0 acceptance_gate: | All criteria must pass before S6: 1. ChainFire: 3 nodes in cluster, leader elected, all healthy 2. FlareDB: 3 nodes joined, quorum formed (2/3 min) 3. IAM: responds on all 3 nodes 4. Write/read test passes across nodes (data replication verified) verification_cmd: | # ChainFire cluster check grpcurl -plaintext localhost:2379 chainfire.ClusterService/GetStatus # FlareDB cluster check grpcurl -plaintext localhost:2479 flaredb.AdminService/GetClusterStatus # IAM health check for port in 2201 2202 2203; do ssh -p $port root@localhost 'curl -s http://localhost:3000/health || echo FAIL' done notes: | **Verify cluster formation:** 1. **ChainFire:** - 3 nodes joined - Leader elected - Health check passing 2. **FlareDB:** - 3 nodes joined - Quorum formed - Read/write operations working 3. **IAM:** - All nodes responding - Authentication working **Dependencies:** first-boot-automation uses cluster-config.json for bootstrap/join logic - step: S6 name: Integration Testing done: T029/T035 integration tests passing on live cluster status: pending owner: peerA priority: P0 acceptance_gate: | T039 complete when ALL pass: 1. Service Health: 11 services × 3 nodes = 33 healthy endpoints 2. IAM Auth: token issue + validate flow works 3. FlareDB: write on node01, read on node02 succeeds 4. LightningSTOR: S3 bucket/object CRUD works 5. FlashDNS: DNS record creation + query works 6. NightLight: Prometheus targets up, metrics queryable 7. Node Failure: cluster survives 1 node stop, rejoins on restart success_criteria: | P0 (must pass): #1, #2, #3, #7 P1 (should pass): #4, #5, #6 P2 (nice to have): FiberLB, PrismNET, CreditService notes: | **Test Plan**: docs/por/T039-production-deployment/S6-integration-test-plan.md **Test Categories (in order):** 1. Service Health (11 services on 3 nodes) 2. Cluster Formation (ChainFire + FlareDB Raft) 3. Cross-Component (IAM auth, FlareDB storage, S3, DNS) 4. Nightlight Metrics 5. FiberLB Load Balancing (T051) 6. PrismNET Networking 7. CreditService Quota 8. Node Failure Resilience **If tests fail:** - Document failures in evidence section - Create follow-up task for fixes - Do not proceed to production traffic until P0 resolved evidence: [] notes: | **T036 Learnings Applied:** - Use full NixOS deployment (not minimal netboot) - nixos-anywhere is the proven deployment path - Custom netboot with SSH key auth for zero-touch access - VDE networking concepts map to real L2 switches **Risk Mitigations:** - Hardware validation before deployment (S1) - Staged deployment (node-by-node) - Integration testing before production traffic (S6) - Rollback plan: Re-provision from scratch if needed