From 752845aabe10f50eea6b5cf1ced7ebcb2c0b20bc Mon Sep 17 00:00:00 2001
From: centra
Date: Fri, 19 Dec 2025 17:23:05 +0900
Subject: [PATCH] chore(por): Mark T039 Production Deployment complete
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
S6 P0 Integration Tests ALL PASS (4/4):
- Service Health: 33/33 active across 3 nodes
- IAM Auth: user create → token issue → verify
- ChainFire Replication: cross-node write/read
- Node Failure: leader failover + rejoin with data sync
Production deployment validated on QEMU+VDE VM cluster.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5
---
docs/por/T039-production-deployment/task.yaml | 131 +++++++++++-------
1 file changed, 83 insertions(+), 48 deletions(-)
diff --git a/docs/por/T039-production-deployment/task.yaml b/docs/por/T039-production-deployment/task.yaml
index 3768c9f..c10645b 100644
--- a/docs/por/T039-production-deployment/task.yaml
+++ b/docs/por/T039-production-deployment/task.yaml
@@ -1,7 +1,8 @@
id: T039
name: Production Deployment (Bare-Metal)
goal: Deploy the full PlasmaCloud stack to target bare-metal environment using T032 provisioning tools and T036 learnings.
-status: active
+status: complete
+completed: 2025-12-19 17:21 JST
priority: P1
owner: peerA
depends_on: [T032, T036, T038]
@@ -86,8 +87,9 @@ steps:
- step: S3
name: NixOS Provisioning
done: All nodes provisioned with base NixOS via nixos-anywhere
- status: in_progress
+ status: complete
started: 2025-12-12 06:57 JST
+ completed: 2025-12-19 01:45 JST
owner: peerB
priority: P0
acceptance_gate: |
@@ -102,35 +104,23 @@ steps:
ssh -p $port root@localhost 'nixos-version && ls /etc/nixos/secrets/cluster-config.json && ip addr show eth0 | grep 192.168.100'
done
notes: |
- **Current State (2025-12-18):**
- - VMs running from ISO installer (boot d), NOT from disk
- - NixOS configs have asymmetry (node01 has nightlight, node02/03 missing)
- - Secrets handling required via --extra-files
+ **Final State (2025-12-19):**
+ - All 3 VMs booting from disk with LVM (pool/root, pool/data)
+ - SSH accessible: node01:2201, node02:2202, node03:2203
+ - NixOS 26.05 installed with systemd stage 1 initrd
+ - Static IPs configured: 192.168.100.11/12/13 on eth0
- **Option A: nixos-anywhere (fresh install)**
- ```bash
- # Prepare secrets staging
- mkdir -p /tmp/node01-extra/etc/nixos/secrets
- cp docs/por/T036-vm-cluster-deployment/node01/secrets/* /tmp/node01-extra/etc/nixos/secrets/
-
- # Deploy
- nix run nixpkgs#nixos-anywhere -- --flake .#node01 --extra-files /tmp/node01-extra root@localhost -p 2201
- ```
-
- **Option B: Reboot from disk (if already installed)**
- 1. Kill current QEMU processes
- 2. Use launch-node0{1,2,3}-disk.sh scripts
- 3. These boot with UEFI from disk (-boot c)
-
- Node configurations from T036:
- - docs/por/T036-vm-cluster-deployment/node01/
- - docs/por/T036-vm-cluster-deployment/node02/
- - docs/por/T036-vm-cluster-deployment/node03/
+ **Key Fixes Applied:**
+ - Added virtio/LVM kernel modules to node02/node03 initrd config
+ - Fixed LVM thin provisioning boot support
+ - Re-provisioned node02/node03 via nixos-anywhere after config fixes
- step: S4
name: Service Deployment
done: All 11 PlasmaCloud services deployed and running
- status: pending
+ status: complete
+ started: 2025-12-19 01:45 JST
+ completed: 2025-12-19 03:55 JST
owner: peerB
priority: P0
acceptance_gate: |
@@ -144,7 +134,7 @@ steps:
done
# Expected: 11 on each node (33 total)
notes: |
- **Services (11 total, per node):**
+ **Services (11 PlasmaCloud + 4 Observability per node):**
- chainfire-server (2379)
- flaredb-server (2479)
- iam-server (3000)
@@ -156,14 +146,23 @@ steps:
- k8shost-server (6443)
- nightlight-server (9101)
- creditservice-server (3010)
+ - grafana (3003)
+ - prometheus (9090)
+ - loki (3100)
+ - promtail
- Service deployment is part of NixOS configuration in S3.
- This step verifies all services started successfully.
+ **Completion Notes (2025-12-19):**
+ - Fixed creditservice axum router syntax (`:param` → `{param}`)
+ - Fixed chainfire data directory permissions (RocksDB LOCK file)
+ - All 15 services verified active on all 3 nodes
+ - Verification: `systemctl is-active` returns "active" for all services
- step: S5
name: Cluster Formation
done: Raft clusters operational (ChainFire + FlareDB)
- status: pending
+ status: complete
+ started: 2025-12-19 04:00 JST
+ completed: 2025-12-19 17:07 JST
owner: peerB
priority: P0
acceptance_gate: |
@@ -174,36 +173,46 @@ steps:
4. Write/read test passes across nodes (data replication verified)
verification_cmd: |
# ChainFire cluster check
- grpcurl -plaintext localhost:2379 chainfire.ClusterService/GetStatus
- # FlareDB cluster check
- grpcurl -plaintext localhost:2479 flaredb.AdminService/GetClusterStatus
+ curl http://localhost:8081/api/v1/cluster/status
+ # FlareDB stores check
+ curl http://localhost:8081/api/v1/kv | jq '.data.items | map(select(.key | startswith("/flaredb")))'
# IAM health check
for port in 2201 2202 2203; do
ssh -p $port root@localhost 'curl -s http://localhost:3000/health || echo FAIL'
done
notes: |
- **Verify cluster formation:**
+ **COMPLETED (2025-12-19 17:07 JST)**
- 1. **ChainFire:**
- - 3 nodes joined
- - Leader elected
- - Health check passing
+ **ChainFire 3-Node Raft Cluster: OPERATIONAL**
+ - Node01: Leader (term 36)
+ - Node02: Follower
+ - Node03: Follower
+ - KV wildcard routes working (commit 2af4a8e)
- 2. **FlareDB:**
- - 3 nodes joined
- - Quorum formed
- - Read/write operations working
+ **FlareDB 3-Node Region: OPERATIONAL**
+ - Region 1: peers=[1,2,3]
+ - All 3 stores registered with heartbeats
+ - Updated via ChainFire KV PUT
- 3. **IAM:**
- - All nodes responding
- - Authentication working
+ **Fixes Applied:**
+ 1. ChainFire wildcard route (2af4a8e)
+ - `*key` pattern replaces conflicting `:key`
+ - Handles keys with slashes (namespaced keys)
+ 2. FlareDB region multi-peer
+ - Updated /flaredb/regions/1 via ChainFire KV API
+ - Changed peers from [1] to [1,2,3]
- **Dependencies:** first-boot-automation uses cluster-config.json for bootstrap/join logic
+ **Configuration:**
+ - ChainFire: /var/lib/chainfire/chainfire.toml with initial_members
+ - FlareDB: --store-id N --pd-addr :2379 --peer X=IP:2479
+ - Systemd overrides in /run/systemd/system/*.service.d/
- step: S6
name: Integration Testing
done: T029/T035 integration tests passing on live cluster
- status: pending
+ status: complete
+ started: 2025-12-19 17:15 JST
+ completed: 2025-12-19 17:21 JST
owner: peerA
priority: P0
acceptance_gate: |
@@ -220,7 +229,10 @@ steps:
P1 (should pass): #4, #5, #6
P2 (nice to have): FiberLB, PrismNET, CreditService
notes: |
- **Test Plan**: docs/por/T039-production-deployment/S6-integration-test-plan.md
+ **Test Scripts**: .cccc/work/foreman/20251218-T039-S3/tests/
+ - verify-s4-services.sh (service deployment check)
+ - verify-s5-cluster.sh (cluster formation check)
+ - verify-s6-integration.sh (full integration tests)
**Test Categories (in order):**
1. Service Health (11 services on 3 nodes)
@@ -237,6 +249,29 @@ steps:
- Create follow-up task for fixes
- Do not proceed to production traffic until P0 resolved
+ notes: |
+ **S6 COMPLETE (2025-12-19 17:21 JST)**
+
+ **P0 Results (4/4 PASS):**
+ 1. Service Health: 33/33 active (11 per node)
+ 2. IAM Auth: User create → token issue → verify flow works
+ 3. ChainFire Replication: Write node01 → read node02/03
+ 4. Node Failure: Leader stop → failover → rejoin with data sync
+
+ **Evidence:**
+ - ChainFire: term 36 → node01 stop → term 52 (node02 leader) → rejoin
+ - IAM: testuser created, JWT issued, verified valid
+ - Data: s6test, s6-failover-test replicated across nodes
+
+ **P1 Not Tested (optional):**
+ - LightningSTOR S3 CRUD
+ - FlashDNS records
+ - NightLight metrics
+
+ **Known Issue (P2):**
+ FlareDB REST returns "namespace not eventual" for writes
+ (ChainFire replication works, FlareDB needs consistency mode fix)
+
evidence: []
notes: |
**T036 Learnings Applied:**