id: T039 name: Production Deployment (Bare-Metal) goal: Deploy the full PlasmaCloud stack to target bare-metal environment using T032 provisioning tools and T036 learnings. status: active priority: P1 owner: peerA depends_on: [T032, T036, T038] blocks: [] context: | **MVP-Alpha Achieved: 12/12 components operational** **UPDATE 2025-12-12:** User approved VM-based deployment using QEMU + VDE virtual network. This allows full production deployment validation without waiting for physical hardware. With the application stack validated and provisioning tools proven (T032/T036), we now execute production deployment to QEMU VM infrastructure. **Prerequisites:** - T032 (COMPLETE): PXE boot infra, NixOS image builder, first-boot automation (17,201L) - T036 (PARTIAL SUCCESS): VM validation proved infrastructure concepts - VDE networking validated L2 clustering - Custom netboot with SSH key auth validated zero-touch provisioning - Key learning: Full NixOS required (nix-copy-closure needs nix-daemon) - T038 (COMPLETE): Build chain working, all services compile **VM Infrastructure:** - baremetal/vm-cluster/launch-node01-netboot.sh (node01) - baremetal/vm-cluster/launch-node02-netboot.sh (node02) - baremetal/vm-cluster/launch-node03-netboot.sh (node03) - VDE virtual network for L2 connectivity **Key Insight from T036:** - nix-copy-closure requires nix on target → full NixOS deployment via nixos-anywhere - Custom netboot (minimal Linux) insufficient for nix-built services - T032's nixos-anywhere approach is architecturally correct acceptance: - All target bare-metal nodes provisioned with NixOS - ChainFire + FlareDB Raft clusters formed (3-node quorum) - IAM service operational on all control-plane nodes - All 12 services deployed and healthy - T029/T035 integration tests passing on live cluster - Production deployment documented in runbook steps: - step: S1 name: Hardware Readiness Verification done: Target bare-metal hardware accessible and ready for provisioning (verified by T032 completion) status: complete completed: 2025-12-12 04:15 JST - step: S2 name: Bootstrap Infrastructure done: VDE switch + 3 QEMU VMs booted with SSH access status: complete completed: 2025-12-12 06:55 JST owner: peerB priority: P0 started: 2025-12-12 06:50 JST notes: | **Decision (2025-12-12):** Option B (Direct Boot) selected for QEMU+VDE VM deployment. **Implementation:** 1. Started VDE switch using nix package: /nix/store/.../vde2-2.3.3/bin/vde_switch 2. Verified netboot artifacts: bzImage (14MB), initrd (484MB) 3. Launched 3 QEMU VMs with direct kernel boot 4. Verified SSH access on all 3 nodes (ports 2201/2202/2203) **Evidence:** - VDE switch running (PID 734637) - 3 QEMU processes active - SSH successful: `hostname` returns "nixos" on all nodes - Zero-touch access (SSH key baked into netboot image) outputs: - path: /tmp/vde.sock note: VDE switch daemon socket - path: baremetal/vm-cluster/node01.qcow2 note: node01 disk (SSH 2201, VNC :1, serial 4401) - path: baremetal/vm-cluster/node02.qcow2 note: node02 disk (SSH 2202, VNC :2, serial 4402) - path: baremetal/vm-cluster/node03.qcow2 note: node03 disk (SSH 2203, VNC :3, serial 4403) - step: S3 name: NixOS Provisioning done: All nodes provisioned with base NixOS via nixos-anywhere status: in_progress started: 2025-12-12 06:57 JST owner: peerB priority: P0 notes: | **Approach:** nixos-anywhere with T036 configurations For each node: 1. Boot into installer environment (custom netboot or NixOS ISO) 2. Verify SSH access 3. Run nixos-anywhere with node-specific configuration: ``` nixos-anywhere --flake .#node01 root@ ``` 4. Wait for reboot and verify SSH access 5. Confirm NixOS installed successfully Node configurations from T036 (adapt IPs for production): - docs/por/T036-vm-cluster-deployment/node01/ - docs/por/T036-vm-cluster-deployment/node02/ - docs/por/T036-vm-cluster-deployment/node03/ - step: S4 name: Service Deployment done: All 12 PlasmaCloud services deployed and running status: pending owner: peerB priority: P0 notes: | Deploy services via NixOS modules (T024): - chainfire-server (cluster KVS) - flaredb-server (DBaaS KVS) - iam-server (aegis) - plasmavmc-server (VM infrastructure) - lightningstor-server (object storage) - flashdns-server (DNS) - fiberlb-server (load balancer) - prismnet-server (overlay networking) [renamed from novanet] - k8shost-server (K8s hosting) - nightlight-server (observability) [renamed from metricstor] - creditservice-server (quota/billing) Service deployment is part of NixOS configuration in S3. This step verifies all services started successfully. - step: S5 name: Cluster Formation done: Raft clusters operational (ChainFire + FlareDB) status: pending owner: peerB priority: P0 notes: | Verify cluster formation: 1. ChainFire: - 3 nodes joined - Leader elected - Health check passing 2. FlareDB: - 3 nodes joined - Quorum formed - Read/write operations working 3. IAM: - All nodes responding - Authentication working - step: S6 name: Integration Testing done: T029/T035 integration tests passing on live cluster status: pending owner: peerA priority: P0 notes: | **Test Plan**: docs/por/T039-production-deployment/S6-integration-test-plan.md Test Categories: 1. Service Health (11 services on 3 nodes) 2. Cluster Formation (ChainFire + FlareDB Raft) 3. Cross-Component (IAM auth, FlareDB storage, S3, DNS) 4. Nightlight Metrics 5. FiberLB Load Balancing (T051) 6. PrismNET Networking 7. CreditService Quota 8. Node Failure Resilience If tests fail: - Document failures - Create follow-up task for fixes - Do not proceed to production traffic until resolved evidence: [] notes: | **T036 Learnings Applied:** - Use full NixOS deployment (not minimal netboot) - nixos-anywhere is the proven deployment path - Custom netboot with SSH key auth for zero-touch access - VDE networking concepts map to real L2 switches **Risk Mitigations:** - Hardware validation before deployment (S1) - Staged deployment (node-by-node) - Integration testing before production traffic (S6) - Rollback plan: Re-provision from scratch if needed