photoncloud-monorepo/docs/por/T039-production-deployment/task.yaml
centra bbc7282b33 feat(T039): Complete S2 Bootstrap Infrastructure
Deployed 3-node QEMU VM cluster for production validation:
- VDE switch started for L2 networking (/tmp/vde.sock)
- 3 VMs launched with custom netboot (SSH key baked in)
- Zero-touch SSH access verified on all nodes (ports 2201/2202/2203)
- Direct kernel boot eliminates PXE/ISO requirements

Next: S3 NixOS Provisioning via nixos-anywhere

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-12 06:55:32 +09:00

177 lines
6.1 KiB
YAML

id: T039
name: Production Deployment (Bare-Metal)
goal: Deploy the full PlasmaCloud stack to target bare-metal environment using T032 provisioning tools and T036 learnings.
status: active
priority: P0
owner: peerA
depends_on: [T032, T036, T038]
blocks: []
context: |
**MVP-Alpha Achieved: 12/12 components operational**
**UPDATE 2025-12-12:** User approved VM-based deployment using QEMU + VDE virtual network.
This allows full production deployment validation without waiting for physical hardware.
With the application stack validated and provisioning tools proven (T032/T036), we now
execute production deployment to QEMU VM infrastructure.
**Prerequisites:**
- T032 (COMPLETE): PXE boot infra, NixOS image builder, first-boot automation (17,201L)
- T036 (PARTIAL SUCCESS): VM validation proved infrastructure concepts
- VDE networking validated L2 clustering
- Custom netboot with SSH key auth validated zero-touch provisioning
- Key learning: Full NixOS required (nix-copy-closure needs nix-daemon)
- T038 (COMPLETE): Build chain working, all services compile
**VM Infrastructure:**
- baremetal/vm-cluster/launch-node01-netboot.sh (node01)
- baremetal/vm-cluster/launch-node02-netboot.sh (node02)
- baremetal/vm-cluster/launch-node03-netboot.sh (node03)
- VDE virtual network for L2 connectivity
**Key Insight from T036:**
- nix-copy-closure requires nix on target → full NixOS deployment via nixos-anywhere
- Custom netboot (minimal Linux) insufficient for nix-built services
- T032's nixos-anywhere approach is architecturally correct
acceptance:
- All target bare-metal nodes provisioned with NixOS
- ChainFire + FlareDB Raft clusters formed (3-node quorum)
- IAM service operational on all control-plane nodes
- All 12 services deployed and healthy
- T029/T035 integration tests passing on live cluster
- Production deployment documented in runbook
steps:
- step: S1
name: Hardware Readiness Verification
done: Target bare-metal hardware accessible and ready for provisioning (verified by T032 completion)
status: complete
completed: 2025-12-12 04:15 JST
- step: S2
name: Bootstrap Infrastructure
done: VDE switch + 3 QEMU VMs booted with SSH access
status: complete
completed: 2025-12-12 06:55 JST
owner: peerB
priority: P0
started: 2025-12-12 06:50 JST
notes: |
**Decision (2025-12-12):** Option B (Direct Boot) selected for QEMU+VDE VM deployment.
**Implementation:**
1. Started VDE switch using nix package: /nix/store/.../vde2-2.3.3/bin/vde_switch
2. Verified netboot artifacts: bzImage (14MB), initrd (484MB)
3. Launched 3 QEMU VMs with direct kernel boot
4. Verified SSH access on all 3 nodes (ports 2201/2202/2203)
**Evidence:**
- VDE switch running (PID 734637)
- 3 QEMU processes active
- SSH successful: `hostname` returns "nixos" on all nodes
- Zero-touch access (SSH key baked into netboot image)
outputs:
- VDE switch daemon at /tmp/vde.sock
- node01: SSH port 2201, VNC :1, serial 4401
- node02: SSH port 2202, VNC :2, serial 4402
- node03: SSH port 2203, VNC :3, serial 4403
- step: S3
name: NixOS Provisioning
done: All nodes provisioned with base NixOS via nixos-anywhere
status: pending
owner: peerB
priority: P0
notes: |
For each node:
1. Boot into installer environment (custom netboot or NixOS ISO)
2. Verify SSH access
3. Run nixos-anywhere with node-specific configuration:
```
nixos-anywhere --flake .#node01 root@<node-ip>
```
4. Wait for reboot and verify SSH access
5. Confirm NixOS installed successfully
Node configurations from T036 (adapt IPs for production):
- docs/por/T036-vm-cluster-deployment/node01/
- docs/por/T036-vm-cluster-deployment/node02/
- docs/por/T036-vm-cluster-deployment/node03/
- step: S4
name: Service Deployment
done: All 12 PlasmaCloud services deployed and running
status: pending
owner: peerB
priority: P0
notes: |
Deploy services via NixOS modules (T024):
- chainfire-server (cluster KVS)
- flaredb-server (DBaaS KVS)
- iam-server (aegis)
- plasmavmc-server (VM infrastructure)
- lightningstor-server (object storage)
- flashdns-server (DNS)
- fiberlb-server (load balancer)
- novanet-server (overlay networking)
- k8shost-server (K8s hosting)
- metricstor-server (metrics)
Service deployment is part of NixOS configuration in S3.
This step verifies all services started successfully.
- step: S5
name: Cluster Formation
done: Raft clusters operational (ChainFire + FlareDB)
status: pending
owner: peerB
priority: P0
notes: |
Verify cluster formation:
1. ChainFire:
- 3 nodes joined
- Leader elected
- Health check passing
2. FlareDB:
- 3 nodes joined
- Quorum formed
- Read/write operations working
3. IAM:
- All nodes responding
- Authentication working
- step: S6
name: Integration Testing
done: T029/T035 integration tests passing on live cluster
status: pending
owner: peerA
priority: P0
notes: |
Run existing integration tests against production cluster:
- T029 practical application tests (VM+NovaNET, FlareDB+IAM, k8shost)
- T035 build validation tests
- Cross-component integration verification
If tests fail:
- Document failures
- Create follow-up task for fixes
- Do not proceed to production traffic until resolved
evidence: []
notes: |
**T036 Learnings Applied:**
- Use full NixOS deployment (not minimal netboot)
- nixos-anywhere is the proven deployment path
- Custom netboot with SSH key auth for zero-touch access
- VDE networking concepts map to real L2 switches
**Risk Mitigations:**
- Hardware validation before deployment (S1)
- Staged deployment (node-by-node)
- Integration testing before production traffic (S6)
- Rollback plan: Re-provision from scratch if needed