photoncloud-monorepo/docs/por/T039-production-deployment/task.yaml
centra 54e3a16091 fix(nix): Align service ExecStart with actual binary CLI interfaces
- chainfire: Fix binary name (chainfire-server → chainfire)
- fiberlb: Use --grpc-addr instead of --port
- flaredb: Use --addr instead of --api-addr/--raft-addr
- flashdns: Add --grpc-addr and --dns-addr flags
- iam: Use --addr instead of --port/--data-dir
- k8shost: Add --iam-server-addr for dynamic IAM port connection
- lightningstor: Add --in-memory-metadata for ChainFire fallback
- plasmavmc: Add ChainFire service dependency and endpoint env var
- prismnet: Use --grpc-addr instead of --port

These fixes are required for T039 production deployment. The
plasmavmc change specifically fixes the ChainFire port mismatch
(was hardcoded 50051, now uses chainfire.port = 2379).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-18 22:58:40 +09:00

252 lines
9.5 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

id: T039
name: Production Deployment (Bare-Metal)
goal: Deploy the full PlasmaCloud stack to target bare-metal environment using T032 provisioning tools and T036 learnings.
status: active
priority: P1
owner: peerA
depends_on: [T032, T036, T038]
blocks: []
context: |
**MVP-Alpha Achieved: 12/12 components operational**
**UPDATE 2025-12-12:** User approved VM-based deployment using QEMU + VDE virtual network.
This allows full production deployment validation without waiting for physical hardware.
With the application stack validated and provisioning tools proven (T032/T036), we now
execute production deployment to QEMU VM infrastructure.
**Prerequisites:**
- T032 (COMPLETE): PXE boot infra, NixOS image builder, first-boot automation (17,201L)
- T036 (PARTIAL SUCCESS): VM validation proved infrastructure concepts
- VDE networking validated L2 clustering
- Custom netboot with SSH key auth validated zero-touch provisioning
- Key learning: Full NixOS required (nix-copy-closure needs nix-daemon)
- T038 (COMPLETE): Build chain working, all services compile
**VM Infrastructure:**
- baremetal/vm-cluster/launch-node01-netboot.sh (node01)
- baremetal/vm-cluster/launch-node02-netboot.sh (node02)
- baremetal/vm-cluster/launch-node03-netboot.sh (node03)
- VDE virtual network for L2 connectivity
**Key Insight from T036:**
- nix-copy-closure requires nix on target → full NixOS deployment via nixos-anywhere
- Custom netboot (minimal Linux) insufficient for nix-built services
- T032's nixos-anywhere approach is architecturally correct
acceptance:
- All target bare-metal nodes provisioned with NixOS
- ChainFire + FlareDB Raft clusters formed (3-node quorum)
- IAM service operational on all control-plane nodes
- All 12 services deployed and healthy
- T029/T035 integration tests passing on live cluster
- Production deployment documented in runbook
steps:
- step: S1
name: Hardware Readiness Verification
done: Target bare-metal hardware accessible and ready for provisioning (verified by T032 completion)
status: complete
completed: 2025-12-12 04:15 JST
- step: S2
name: Bootstrap Infrastructure
done: VDE switch + 3 QEMU VMs booted with SSH access
status: complete
completed: 2025-12-12 06:55 JST
owner: peerB
priority: P0
started: 2025-12-12 06:50 JST
notes: |
**Decision (2025-12-12):** Option B (Direct Boot) selected for QEMU+VDE VM deployment.
**Implementation:**
1. Started VDE switch using nix package: /nix/store/.../vde2-2.3.3/bin/vde_switch
2. Verified netboot artifacts: bzImage (14MB), initrd (484MB)
3. Launched 3 QEMU VMs with direct kernel boot
4. Verified SSH access on all 3 nodes (ports 2201/2202/2203)
**Evidence:**
- VDE switch running (PID 734637)
- 3 QEMU processes active
- SSH successful: `hostname` returns "nixos" on all nodes
- Zero-touch access (SSH key baked into netboot image)
outputs:
- path: /tmp/vde.sock
note: VDE switch daemon socket
- path: baremetal/vm-cluster/node01.qcow2
note: node01 disk (SSH 2201, VNC :1, serial 4401)
- path: baremetal/vm-cluster/node02.qcow2
note: node02 disk (SSH 2202, VNC :2, serial 4402)
- path: baremetal/vm-cluster/node03.qcow2
note: node03 disk (SSH 2203, VNC :3, serial 4403)
- step: S3
name: NixOS Provisioning
done: All nodes provisioned with base NixOS via nixos-anywhere
status: in_progress
started: 2025-12-12 06:57 JST
owner: peerB
priority: P0
acceptance_gate: |
All criteria must pass before S4:
1. All 3 nodes boot from disk (not ISO)
2. `nixos-version` returns 26.05+ on all nodes
3. SSH accessible via ports 2201/2202/2203
4. /etc/nixos/secrets/cluster-config.json exists on all nodes
5. Static IPs configured (192.168.100.11/12/13 on eth0)
verification_cmd: |
for port in 2201 2202 2203; do
ssh -p $port root@localhost 'nixos-version && ls /etc/nixos/secrets/cluster-config.json && ip addr show eth0 | grep 192.168.100'
done
notes: |
**Current State (2025-12-18):**
- VMs running from ISO installer (boot d), NOT from disk
- NixOS configs have asymmetry (node01 has nightlight, node02/03 missing)
- Secrets handling required via --extra-files
**Option A: nixos-anywhere (fresh install)**
```bash
# Prepare secrets staging
mkdir -p /tmp/node01-extra/etc/nixos/secrets
cp docs/por/T036-vm-cluster-deployment/node01/secrets/* /tmp/node01-extra/etc/nixos/secrets/
# Deploy
nix run nixpkgs#nixos-anywhere -- --flake .#node01 --extra-files /tmp/node01-extra root@localhost -p 2201
```
**Option B: Reboot from disk (if already installed)**
1. Kill current QEMU processes
2. Use launch-node0{1,2,3}-disk.sh scripts
3. These boot with UEFI from disk (-boot c)
Node configurations from T036:
- docs/por/T036-vm-cluster-deployment/node01/
- docs/por/T036-vm-cluster-deployment/node02/
- docs/por/T036-vm-cluster-deployment/node03/
- step: S4
name: Service Deployment
done: All 11 PlasmaCloud services deployed and running
status: pending
owner: peerB
priority: P0
acceptance_gate: |
All criteria must pass before S5:
1. `systemctl is-active` returns "active" for all 11 services on all 3 nodes
2. Each service responds to gRPC reflection (`grpcurl -plaintext <node>:<port> list`)
3. No service in failed/restart loop state
verification_cmd: |
for port in 2201 2202 2203; do
ssh -p $port root@localhost 'systemctl list-units --state=running | grep -cE "chainfire|flaredb|iam|plasmavmc|prismnet|flashdns|fiberlb|lightningstor|k8shost|nightlight|creditservice"'
done
# Expected: 11 on each node (33 total)
notes: |
**Services (11 total, per node):**
- chainfire-server (2379)
- flaredb-server (2479)
- iam-server (3000)
- plasmavmc-server (4000)
- prismnet-server (5000)
- flashdns-server (6000)
- fiberlb-server (7000)
- lightningstor-server (8000)
- k8shost-server (6443)
- nightlight-server (9101)
- creditservice-server (3010)
Service deployment is part of NixOS configuration in S3.
This step verifies all services started successfully.
- step: S5
name: Cluster Formation
done: Raft clusters operational (ChainFire + FlareDB)
status: pending
owner: peerB
priority: P0
acceptance_gate: |
All criteria must pass before S6:
1. ChainFire: 3 nodes in cluster, leader elected, all healthy
2. FlareDB: 3 nodes joined, quorum formed (2/3 min)
3. IAM: responds on all 3 nodes
4. Write/read test passes across nodes (data replication verified)
verification_cmd: |
# ChainFire cluster check
grpcurl -plaintext localhost:2379 chainfire.ClusterService/GetStatus
# FlareDB cluster check
grpcurl -plaintext localhost:2479 flaredb.AdminService/GetClusterStatus
# IAM health check
for port in 2201 2202 2203; do
ssh -p $port root@localhost 'curl -s http://localhost:3000/health || echo FAIL'
done
notes: |
**Verify cluster formation:**
1. **ChainFire:**
- 3 nodes joined
- Leader elected
- Health check passing
2. **FlareDB:**
- 3 nodes joined
- Quorum formed
- Read/write operations working
3. **IAM:**
- All nodes responding
- Authentication working
**Dependencies:** first-boot-automation uses cluster-config.json for bootstrap/join logic
- step: S6
name: Integration Testing
done: T029/T035 integration tests passing on live cluster
status: pending
owner: peerA
priority: P0
acceptance_gate: |
T039 complete when ALL pass:
1. Service Health: 11 services × 3 nodes = 33 healthy endpoints
2. IAM Auth: token issue + validate flow works
3. FlareDB: write on node01, read on node02 succeeds
4. LightningSTOR: S3 bucket/object CRUD works
5. FlashDNS: DNS record creation + query works
6. NightLight: Prometheus targets up, metrics queryable
7. Node Failure: cluster survives 1 node stop, rejoins on restart
success_criteria: |
P0 (must pass): #1, #2, #3, #7
P1 (should pass): #4, #5, #6
P2 (nice to have): FiberLB, PrismNET, CreditService
notes: |
**Test Plan**: docs/por/T039-production-deployment/S6-integration-test-plan.md
**Test Categories (in order):**
1. Service Health (11 services on 3 nodes)
2. Cluster Formation (ChainFire + FlareDB Raft)
3. Cross-Component (IAM auth, FlareDB storage, S3, DNS)
4. Nightlight Metrics
5. FiberLB Load Balancing (T051)
6. PrismNET Networking
7. CreditService Quota
8. Node Failure Resilience
**If tests fail:**
- Document failures in evidence section
- Create follow-up task for fixes
- Do not proceed to production traffic until P0 resolved
evidence: []
notes: |
**T036 Learnings Applied:**
- Use full NixOS deployment (not minimal netboot)
- nixos-anywhere is the proven deployment path
- Custom netboot with SSH key auth for zero-touch access
- VDE networking concepts map to real L2 switches
**Risk Mitigations:**
- Hardware validation before deployment (S1)
- Staged deployment (node-by-node)
- Integration testing before production traffic (S6)
- Rollback plan: Re-provision from scratch if needed