id: T036 name: VM Cluster Deployment (T032 Validation) goal: Deploy and validate a 3-node PlasmaCloud cluster using T032 bare-metal provisioning tools in a VM environment to validate end-to-end provisioning flow before physical deployment. status: active priority: P0 owner: peerA created: 2025-12-11 depends_on: [T032, T035] blocks: [] context: | PROJECT.md Principal: "Peer Aへ:**自分で戦略を**決めて良い!好きにやれ!" Strategic Decision: Pursue VM-based testing cluster (Option A from deployment readiness assessment) to validate T032 tools end-to-end before committing to physical infrastructure. T032 delivered: PXE boot infra, NixOS image builder, first-boot automation, documentation (17,201L) T035 validated: Single-VM build integration (10/10 services, dev builds) This task validates: Multi-node cluster deployment, PXE boot flow, nixos-anywhere, Raft cluster formation, first-boot automation, and operational procedures. acceptance: - 3 VMs deployed with libvirt/KVM - Virtual network configured for PXE boot - PXE server running and serving netboot images - All 3 nodes provisioned via nixos-anywhere - Chainfire + FlareDB Raft clusters formed (3-node quorum) - IAM service operational on all control-plane nodes - Health checks passing on all services - T032 RUNBOOK validated end-to-end steps: - step: S1 name: VM Infrastructure Setup done: 3 VMs created with QEMU, multicast socket network configured, launch scripts ready status: complete owner: peerA priority: P0 progress: | **COMPLETED** — VM infrastructure operational, pivoted to ISO boot approach Completed: - ✅ Created VM working directory: /home/centra/cloud/baremetal/vm-cluster - ✅ Created disk images: node01/02/03.qcow2 (100GB each) - ✅ Wrote launch scripts: launch-node{01,02,03}.sh - ✅ Configured QEMU multicast socket networking (230.0.0.1:1234) - ✅ VM specs: 8 vCPU, 16GB RAM per node - ✅ MACs assigned: 52:54:00:00:01:{01,02,03} (nodes) - ✅ Netboot artifacts built successfully (bzImage 14MB, initrd 484MB, ZFS disabled) - ✅ **PIVOT DECISION**: ISO boot approach (QEMU 10.1.2 initrd compatibility bug) - ✅ Downloaded NixOS 25.11 minimal ISO (1.6GB) - ✅ Node01 booting from ISO, multicast network configured notes: | **Topology Change:** Abandoned libvirt bridges (required root). Using QEMU directly with: - Multicast socket networking (no root required): `-netdev socket,mcast=230.0.0.1:1234` - 3 node VMs (pxe-server dropped due to ISO pivot) - All VMs share L2 segment via multicast **PIVOT JUSTIFICATION (MID: cccc-1765406017-b04a6e):** - Netboot artifacts validated ✓ (build process, kernel-6.18 ZFS fix) - QEMU 10.1.2 initrd bug blocks PXE testing (environmental, not T032 issue) - ISO + nixos-anywhere validates core T032 provisioning capability - PXE boot protocol deferred for bare-metal validation - step: S2 name: Network Access Configuration done: Node VMs configured with SSH access for nixos-anywhere (netboot key auth) status: complete owner: peerB priority: P0 progress: | **COMPLETED** — Custom netboot with SSH key auth bypasses VNC/telnet entirely Completed (2025-12-11): - ✅ Updated nix/images/netboot-base.nix with real SSH key (centra@cn-nixos-think) - ✅ Added netboot-base to flake.nix nixosConfigurations - ✅ Built netboot artifacts (kernel 14MB, initrd 484MB) - ✅ Created launch-node01-netboot.sh (QEMU -kernel/-initrd direct boot) - ✅ Fixed init path in kernel append parameter - ✅ SSH access verified (port 2201, key auth, zero manual interaction) Evidence: ``` ssh -p 2201 root@localhost -> SUCCESS: nixos at Thu Dec 11 12:48:13 AM UTC 2025 ``` **PIVOT DECISION (2025-12-11, MID: cccc-1765413547-285e0f):** - PeerA directive: Build custom netboot with SSH key baked in - Eliminates VNC/telnet/password setup entirely - Netboot approach superior to ISO for automated provisioning notes: | **Solution Evolution:** - Initial: VNC (Option C) - requires user - Investigation: Alpine/telnet (Options A/B) - tooling gap/fragile - Final: Custom netboot with SSH key (PeerA strategy) - ZERO manual steps Files created: - baremetal/vm-cluster/launch-node01-netboot.sh (direct kernel/initrd boot) - baremetal/vm-cluster/netboot-{kernel,initrd}/ (nix build outputs) - step: S3 name: TLS Certificate Generation done: CA and per-node certificates generated, ready for deployment status: complete owner: peerA priority: P0 progress: | **COMPLETED** — TLS certificates generated and deployed to node config directories Completed: - ✅ Generated CA certificate and key - ✅ Generated node01.crt/.key (192.168.100.11) - ✅ Generated node02.crt/.key (192.168.100.12) - ✅ Generated node03.crt/.key (192.168.100.13) - ✅ Copied to docs/por/T036-vm-cluster-deployment/node*/secrets/ - ✅ Permissions set (ca.crt/node*.crt: 644, node*.key: 400) - ✅ **CRITICAL FIX (2025-12-11):** Renamed certs to match cluster-config.json expectations - ca-cert.pem → ca.crt, cert.pem → node0X.crt, key.pem → node0X.key (all 3 nodes) - Prevented first-boot automation failure (services couldn't load TLS certs) notes: | Certificates ready for nixos-anywhere deployment (will be placed at /etc/nixos/secrets/) **Critical naming fix applied:** Certs renamed to match cluster-config.json paths - step: S4 name: Node Configuration Preparation done: configuration.nix, disko.nix, cluster-config.json ready for all 3 nodes status: complete owner: peerB priority: P0 progress: | **COMPLETED** — All node configurations created and validated Deliverables (13 files, ~600 LOC): - ✅ node01/configuration.nix (112L) - NixOS system config, control-plane services - ✅ node01/disko.nix (62L) - Disk partitioning (EFI + LVM) - ✅ node01/secrets/cluster-config.json (28L) - Raft bootstrap config - ✅ node01/secrets/README.md - TLS documentation - ✅ node02/* (same structure, IP: 192.168.100.12) - ✅ node03/* (same structure, IP: 192.168.100.13) - ✅ DEPLOYMENT.md (335L) - Comprehensive deployment guide Configuration highlights: - All 9 control-plane services enabled per node - Bootstrap mode: `bootstrap: true` on all 3 nodes (simultaneous initialization) - Network: Static IPs 192.168.100.11/12/13 - Disk: Single-disk LVM (512MB EFI + 80GB root + 19.5GB data) - First-boot automation: Enabled with cluster-config.json - **CRITICAL FIX (2025-12-11):** Added networking.hosts to all 3 nodes (configuration.nix:14-19) - Maps node01/02/03 hostnames to 192.168.100.11/12/13 - Prevented Raft bootstrap failure (cluster-config.json uses hostnames, DNS unavailable) notes: | Node configurations ready for nixos-anywhere provisioning (S5) TLS certificates from S3 already in secrets/ directories **Critical fixes applied:** TLS cert naming (S3), hostname resolution (/etc/hosts) - step: S5 name: Cluster Provisioning done: All 3 nodes provisioned via nixos-anywhere, first-boot automation completed status: in_progress owner: peerB priority: P0 progress: | **BLOCKED** — nixos-anywhere flake path resolution errors (nix store vs git working tree) Completed: - ✅ All 3 VMs launched with custom netboot (SSH ports 2201/2202/2203, key auth) - ✅ SSH access verified on all nodes (zero manual interaction) - ✅ Node configurations staged in git (node0{1,2,3}/configuration.nix + disko.nix + secrets/) - ✅ nix/modules staged (first-boot-automation, k8shost, metricstor, observability) - ✅ Launch scripts created: launch-node0{1,2,3}-netboot.sh Blocked: - ❌ nixos-anywhere failing with path resolution errors - ❌ Error: `/nix/store/.../docs/nix/modules/default.nix does not exist` - ❌ Root cause: Git tree dirty + files not in nix store - ❌ 3 attempts made, each failing on different missing path Next (awaiting PeerA decision): - Option A: Continue debug (may need git commit or --impure flag) - Option B: Alternative provisioning (direct configuration.nix) - Option C: Hand off to PeerA - Analyzed telnet serial console automation viability - Presented 3 options: Alpine automation (A), NixOS+telnet (B), VNC (C) Blocked: - ❌ SSH access unavailable (connection refused to 192.168.100.11) - ❌ S2 dependency: VNC network configuration or telnet console bypass required Next steps (when unblocked): - [ ] Choose unblock strategy: VNC (C), NixOS+telnet (B), or Alpine (A) - [ ] Run nixos-anywhere for node01/02/03 - [ ] Monitor first-boot automation logs - [ ] Verify cluster formation (Chainfire, FlareDB Raft) notes: | **Unblock Options (peerB investigation 2025-12-11):** - Option A: Alpine virt ISO + telnet automation (viable but fragile) - Option B: NixOS + manual telnet console (recommended: simple, reliable) - Option C: Original VNC approach (lowest risk, requires user) ISO boot approach (not PXE): - Boot VMs from NixOS/Alpine ISO - Configure SSH via VNC or telnet serial console - Execute nixos-anywhere with node configurations from S4 - First-boot automation will handle cluster initialization - step: S6 name: Cluster Validation done: All acceptance criteria met, cluster operational, RUNBOOK validated status: pending owner: peerA priority: P0 notes: | Validate cluster per T032 QUICKSTART: - Chainfire cluster: 3 members, leader elected, health OK - FlareDB cluster: 3 members, quorum formed, health OK - IAM service: all nodes responding - CRUD operations: write/read/delete working - Data persistence: verify across restarts - Metrics: Prometheus endpoints responding evidence: [] notes: | **Strategic Rationale:** - VM deployment validates T032 tools without hardware dependency - Fastest feedback loop (~3-4 hours total) - After success, physical bare-metal deployment has validated blueprint - Failure discovery in VMs is cheaper than on physical hardware **Timeline Estimate:** - S1 VM Infrastructure: 30 min - S2 PXE Server: 30 min - S3 TLS Certs: 15 min - S4 Node Configs: 30 min - S5 Provisioning: 60 min - S6 Validation: 30 min - Total: ~3.5 hours **Success Criteria:** - All 6 steps complete - 3-node Raft cluster operational - T032 RUNBOOK procedures validated - Ready for physical bare-metal deployment