Includes all pending changes needed for nixos-anywhere: - fiberlb: L7 policy, rule, certificate types - deployer: New service for cluster management - nix-nos: Generic network modules - Various service updates and fixes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
192 lines
6.7 KiB
YAML
192 lines
6.7 KiB
YAML
id: T039
|
|
name: Production Deployment (Bare-Metal)
|
|
goal: Deploy the full PlasmaCloud stack to target bare-metal environment using T032 provisioning tools and T036 learnings.
|
|
status: active
|
|
priority: P1
|
|
owner: peerA
|
|
depends_on: [T032, T036, T038]
|
|
blocks: []
|
|
|
|
context: |
|
|
**MVP-Alpha Achieved: 12/12 components operational**
|
|
|
|
**UPDATE 2025-12-12:** User approved VM-based deployment using QEMU + VDE virtual network.
|
|
This allows full production deployment validation without waiting for physical hardware.
|
|
|
|
With the application stack validated and provisioning tools proven (T032/T036), we now
|
|
execute production deployment to QEMU VM infrastructure.
|
|
|
|
**Prerequisites:**
|
|
- T032 (COMPLETE): PXE boot infra, NixOS image builder, first-boot automation (17,201L)
|
|
- T036 (PARTIAL SUCCESS): VM validation proved infrastructure concepts
|
|
- VDE networking validated L2 clustering
|
|
- Custom netboot with SSH key auth validated zero-touch provisioning
|
|
- Key learning: Full NixOS required (nix-copy-closure needs nix-daemon)
|
|
- T038 (COMPLETE): Build chain working, all services compile
|
|
|
|
**VM Infrastructure:**
|
|
- baremetal/vm-cluster/launch-node01-netboot.sh (node01)
|
|
- baremetal/vm-cluster/launch-node02-netboot.sh (node02)
|
|
- baremetal/vm-cluster/launch-node03-netboot.sh (node03)
|
|
- VDE virtual network for L2 connectivity
|
|
|
|
**Key Insight from T036:**
|
|
- nix-copy-closure requires nix on target → full NixOS deployment via nixos-anywhere
|
|
- Custom netboot (minimal Linux) insufficient for nix-built services
|
|
- T032's nixos-anywhere approach is architecturally correct
|
|
|
|
acceptance:
|
|
- All target bare-metal nodes provisioned with NixOS
|
|
- ChainFire + FlareDB Raft clusters formed (3-node quorum)
|
|
- IAM service operational on all control-plane nodes
|
|
- All 12 services deployed and healthy
|
|
- T029/T035 integration tests passing on live cluster
|
|
- Production deployment documented in runbook
|
|
|
|
steps:
|
|
- step: S1
|
|
name: Hardware Readiness Verification
|
|
done: Target bare-metal hardware accessible and ready for provisioning (verified by T032 completion)
|
|
status: complete
|
|
completed: 2025-12-12 04:15 JST
|
|
|
|
- step: S2
|
|
name: Bootstrap Infrastructure
|
|
done: VDE switch + 3 QEMU VMs booted with SSH access
|
|
status: complete
|
|
completed: 2025-12-12 06:55 JST
|
|
owner: peerB
|
|
priority: P0
|
|
started: 2025-12-12 06:50 JST
|
|
notes: |
|
|
**Decision (2025-12-12):** Option B (Direct Boot) selected for QEMU+VDE VM deployment.
|
|
|
|
**Implementation:**
|
|
1. Started VDE switch using nix package: /nix/store/.../vde2-2.3.3/bin/vde_switch
|
|
2. Verified netboot artifacts: bzImage (14MB), initrd (484MB)
|
|
3. Launched 3 QEMU VMs with direct kernel boot
|
|
4. Verified SSH access on all 3 nodes (ports 2201/2202/2203)
|
|
|
|
**Evidence:**
|
|
- VDE switch running (PID 734637)
|
|
- 3 QEMU processes active
|
|
- SSH successful: `hostname` returns "nixos" on all nodes
|
|
- Zero-touch access (SSH key baked into netboot image)
|
|
|
|
outputs:
|
|
- path: /tmp/vde.sock
|
|
note: VDE switch daemon socket
|
|
- path: baremetal/vm-cluster/node01.qcow2
|
|
note: node01 disk (SSH 2201, VNC :1, serial 4401)
|
|
- path: baremetal/vm-cluster/node02.qcow2
|
|
note: node02 disk (SSH 2202, VNC :2, serial 4402)
|
|
- path: baremetal/vm-cluster/node03.qcow2
|
|
note: node03 disk (SSH 2203, VNC :3, serial 4403)
|
|
|
|
- step: S3
|
|
name: NixOS Provisioning
|
|
done: All nodes provisioned with base NixOS via nixos-anywhere
|
|
status: in_progress
|
|
started: 2025-12-12 06:57 JST
|
|
owner: peerB
|
|
priority: P0
|
|
notes: |
|
|
**Approach:** nixos-anywhere with T036 configurations
|
|
|
|
For each node:
|
|
1. Boot into installer environment (custom netboot or NixOS ISO)
|
|
2. Verify SSH access
|
|
3. Run nixos-anywhere with node-specific configuration:
|
|
```
|
|
nixos-anywhere --flake .#node01 root@<node-ip>
|
|
```
|
|
4. Wait for reboot and verify SSH access
|
|
5. Confirm NixOS installed successfully
|
|
|
|
Node configurations from T036 (adapt IPs for production):
|
|
- docs/por/T036-vm-cluster-deployment/node01/
|
|
- docs/por/T036-vm-cluster-deployment/node02/
|
|
- docs/por/T036-vm-cluster-deployment/node03/
|
|
|
|
- step: S4
|
|
name: Service Deployment
|
|
done: All 12 PlasmaCloud services deployed and running
|
|
status: pending
|
|
owner: peerB
|
|
priority: P0
|
|
notes: |
|
|
Deploy services via NixOS modules (T024):
|
|
- chainfire-server (cluster KVS)
|
|
- flaredb-server (DBaaS KVS)
|
|
- iam-server (aegis)
|
|
- plasmavmc-server (VM infrastructure)
|
|
- lightningstor-server (object storage)
|
|
- flashdns-server (DNS)
|
|
- fiberlb-server (load balancer)
|
|
- prismnet-server (overlay networking) [renamed from novanet]
|
|
- k8shost-server (K8s hosting)
|
|
- nightlight-server (observability) [renamed from metricstor]
|
|
- creditservice-server (quota/billing)
|
|
|
|
Service deployment is part of NixOS configuration in S3.
|
|
This step verifies all services started successfully.
|
|
|
|
- step: S5
|
|
name: Cluster Formation
|
|
done: Raft clusters operational (ChainFire + FlareDB)
|
|
status: pending
|
|
owner: peerB
|
|
priority: P0
|
|
notes: |
|
|
Verify cluster formation:
|
|
1. ChainFire:
|
|
- 3 nodes joined
|
|
- Leader elected
|
|
- Health check passing
|
|
|
|
2. FlareDB:
|
|
- 3 nodes joined
|
|
- Quorum formed
|
|
- Read/write operations working
|
|
|
|
3. IAM:
|
|
- All nodes responding
|
|
- Authentication working
|
|
|
|
- step: S6
|
|
name: Integration Testing
|
|
done: T029/T035 integration tests passing on live cluster
|
|
status: pending
|
|
owner: peerA
|
|
priority: P0
|
|
notes: |
|
|
**Test Plan**: docs/por/T039-production-deployment/S6-integration-test-plan.md
|
|
|
|
Test Categories:
|
|
1. Service Health (11 services on 3 nodes)
|
|
2. Cluster Formation (ChainFire + FlareDB Raft)
|
|
3. Cross-Component (IAM auth, FlareDB storage, S3, DNS)
|
|
4. Nightlight Metrics
|
|
5. FiberLB Load Balancing (T051)
|
|
6. PrismNET Networking
|
|
7. CreditService Quota
|
|
8. Node Failure Resilience
|
|
|
|
If tests fail:
|
|
- Document failures
|
|
- Create follow-up task for fixes
|
|
- Do not proceed to production traffic until resolved
|
|
|
|
evidence: []
|
|
notes: |
|
|
**T036 Learnings Applied:**
|
|
- Use full NixOS deployment (not minimal netboot)
|
|
- nixos-anywhere is the proven deployment path
|
|
- Custom netboot with SSH key auth for zero-touch access
|
|
- VDE networking concepts map to real L2 switches
|
|
|
|
**Risk Mitigations:**
|
|
- Hardware validation before deployment (S1)
|
|
- Staged deployment (node-by-node)
|
|
- Integration testing before production traffic (S6)
|
|
- Rollback plan: Re-provision from scratch if needed
|