id: T036
name: VM Cluster Deployment (T032 Validation)
goal: Deploy and validate a 3-node PlasmaCloud cluster using T032 bare-metal provisioning tools in a VM environment to validate end-to-end provisioning flow before physical deployment.
status: active
priority: P0
owner: peerA
created: 2025-12-11
depends_on: [T032, T035]
blocks: []

context: |
  PROJECT.md Principal: "Peer Aへ：**自分で戦略を**決めて良い！好きにやれ！"

  Strategic Decision: Pursue VM-based testing cluster (Option A from deployment readiness assessment)
  to validate T032 tools end-to-end before committing to physical infrastructure.

  T032 delivered: PXE boot infra, NixOS image builder, first-boot automation, documentation (17,201L)
  T035 validated: Single-VM build integration (10/10 services, dev builds)

  This task validates: Multi-node cluster deployment, PXE boot flow, nixos-anywhere,
  Raft cluster formation, first-boot automation, and operational procedures.

acceptance:
  - 3 VMs deployed with libvirt/KVM
  - Virtual network configured for PXE boot
  - PXE server running and serving netboot images
  - All 3 nodes provisioned via nixos-anywhere
  - Chainfire + FlareDB Raft clusters formed (3-node quorum)
  - IAM service operational on all control-plane nodes
  - Health checks passing on all services
  - T032 RUNBOOK validated end-to-end

steps:
  - step: S1
    name: VM Infrastructure Setup
    done: 3 VMs created with QEMU, multicast socket network configured, launch scripts ready
    status: complete
    owner: peerA
    priority: P0
    progress: |
      **COMPLETED** — VM infrastructure operational, pivoted to ISO boot approach

      Completed:
      - ✅ Created VM working directory: /home/centra/cloud/baremetal/vm-cluster
      - ✅ Created disk images: node01/02/03.qcow2 (100GB each)
      - ✅ Wrote launch scripts: launch-node{01,02,03}.sh
      - ✅ Configured QEMU multicast socket networking (230.0.0.1:1234)
      - ✅ VM specs: 8 vCPU, 16GB RAM per node
      - ✅ MACs assigned: 52:54:00:00:01:{01,02,03} (nodes)
      - ✅ Netboot artifacts built successfully (bzImage 14MB, initrd 484MB, ZFS disabled)
      - ✅ **PIVOT DECISION**: ISO boot approach (QEMU 10.1.2 initrd compatibility bug)
      - ✅ Downloaded NixOS 25.11 minimal ISO (1.6GB)
      - ✅ Node01 booting from ISO, multicast network configured

    notes: |
      **Topology Change:** Abandoned libvirt bridges (required root). Using QEMU directly with:
      - Multicast socket networking (no root required): `-netdev socket,mcast=230.0.0.1:1234`
      - 3 node VMs (pxe-server dropped due to ISO pivot)
      - All VMs share L2 segment via multicast

      **PIVOT JUSTIFICATION (MID: cccc-1765406017-b04a6e):**
      - Netboot artifacts validated ✓ (build process, kernel-6.18 ZFS fix)
      - QEMU 10.1.2 initrd bug blocks PXE testing (environmental, not T032 issue)
      - ISO + nixos-anywhere validates core T032 provisioning capability
      - PXE boot protocol deferred for bare-metal validation

  - step: S2
    name: Network Access Configuration
    done: Node VMs configured with SSH access for nixos-anywhere (netboot key auth)
    status: complete
    owner: peerB
    priority: P0
    progress: |
      **COMPLETED** — Custom netboot with SSH key auth bypasses VNC/telnet entirely

      Completed (2025-12-11):
      - ✅ Updated nix/images/netboot-base.nix with real SSH key (centra@cn-nixos-think)
      - ✅ Added netboot-base to flake.nix nixosConfigurations
      - ✅ Built netboot artifacts (kernel 14MB, initrd 484MB)
      - ✅ Created launch-node01-netboot.sh (QEMU -kernel/-initrd direct boot)
      - ✅ Fixed init path in kernel append parameter
      - ✅ SSH access verified (port 2201, key auth, zero manual interaction)

      Evidence:
      ```
      ssh -p 2201 root@localhost -> SUCCESS: nixos at Thu Dec 11 12:48:13 AM UTC 2025
      ```

      **PIVOT DECISION (2025-12-11, MID: cccc-1765413547-285e0f):**
      - PeerA directive: Build custom netboot with SSH key baked in
      - Eliminates VNC/telnet/password setup entirely
      - Netboot approach superior to ISO for automated provisioning
    notes: |
      **Solution Evolution:**
      - Initial: VNC (Option C) - requires user
      - Investigation: Alpine/telnet (Options A/B) - tooling gap/fragile
      - Final: Custom netboot with SSH key (PeerA strategy) - ZERO manual steps

      Files created:
      - baremetal/vm-cluster/launch-node01-netboot.sh (direct kernel/initrd boot)
      - baremetal/vm-cluster/netboot-{kernel,initrd}/ (nix build outputs)

  - step: S3
    name: TLS Certificate Generation
    done: CA and per-node certificates generated, ready for deployment
    status: complete
    owner: peerA
    priority: P0
    progress: |
      **COMPLETED** — TLS certificates generated and deployed to node config directories

      Completed:
      - ✅ Generated CA certificate and key
      - ✅ Generated node01.crt/.key (192.168.100.11)
      - ✅ Generated node02.crt/.key (192.168.100.12)
      - ✅ Generated node03.crt/.key (192.168.100.13)
      - ✅ Copied to docs/por/T036-vm-cluster-deployment/node*/secrets/
      - ✅ Permissions set (ca.crt/node*.crt: 644, node*.key: 400)
      - ✅ **CRITICAL FIX (2025-12-11):** Renamed certs to match cluster-config.json expectations
        - ca-cert.pem → ca.crt, cert.pem → node0X.crt, key.pem → node0X.key (all 3 nodes)
        - Prevented first-boot automation failure (services couldn't load TLS certs)

    notes: |
      Certificates ready for nixos-anywhere deployment (will be placed at /etc/nixos/secrets/)
      **Critical naming fix applied:** Certs renamed to match cluster-config.json paths

  - step: S4
    name: Node Configuration Preparation
    done: configuration.nix, disko.nix, cluster-config.json ready for all 3 nodes
    status: complete
    owner: peerB
    priority: P0
    progress: |
      **COMPLETED** — All node configurations created and validated

      Deliverables (13 files, ~600 LOC):
      - ✅ node01/configuration.nix (112L) - NixOS system config, control-plane services
      - ✅ node01/disko.nix (62L) - Disk partitioning (EFI + LVM)
      - ✅ node01/secrets/cluster-config.json (28L) - Raft bootstrap config
      - ✅ node01/secrets/README.md - TLS documentation
      - ✅ node02/* (same structure, IP: 192.168.100.12)
      - ✅ node03/* (same structure, IP: 192.168.100.13)
      - ✅ DEPLOYMENT.md (335L) - Comprehensive deployment guide

      Configuration highlights:
      - All 9 control-plane services enabled per node
      - Bootstrap mode: `bootstrap: true` on all 3 nodes (simultaneous initialization)
      - Network: Static IPs 192.168.100.11/12/13
      - Disk: Single-disk LVM (512MB EFI + 80GB root + 19.5GB data)
      - First-boot automation: Enabled with cluster-config.json
      - **CRITICAL FIX (2025-12-11):** Added networking.hosts to all 3 nodes (configuration.nix:14-19)
        - Maps node01/02/03 hostnames to 192.168.100.11/12/13
        - Prevented Raft bootstrap failure (cluster-config.json uses hostnames, DNS unavailable)

    notes: |
      Node configurations ready for nixos-anywhere provisioning (S5)
      TLS certificates from S3 already in secrets/ directories
      **Critical fixes applied:** TLS cert naming (S3), hostname resolution (/etc/hosts)

  - step: S5
    name: Cluster Provisioning
    done: All 3 nodes provisioned via nixos-anywhere, first-boot automation completed
    status: in_progress
    owner: peerB
    priority: P0
    progress: |
      **BLOCKED** — nixos-anywhere flake path resolution errors (nix store vs git working tree)

      Completed:
      - ✅ All 3 VMs launched with custom netboot (SSH ports 2201/2202/2203, key auth)
      - ✅ SSH access verified on all nodes (zero manual interaction)
      - ✅ Node configurations staged in git (node0{1,2,3}/configuration.nix + disko.nix + secrets/)
      - ✅ nix/modules staged (first-boot-automation, k8shost, metricstor, observability)
      - ✅ Launch scripts created: launch-node0{1,2,3}-netboot.sh

      Blocked:
      - ❌ nixos-anywhere failing with path resolution errors
      - ❌ Error: `/nix/store/.../docs/nix/modules/default.nix does not exist`
      - ❌ Root cause: Git tree dirty + files not in nix store
      - ❌ 3 attempts made, each failing on different missing path

      Next (awaiting PeerA decision):
      - Option A: Continue debug (may need git commit or --impure flag)
      - Option B: Alternative provisioning (direct configuration.nix)
      - Option C: Hand off to PeerA
        - Analyzed telnet serial console automation viability
        - Presented 3 options: Alpine automation (A), NixOS+telnet (B), VNC (C)

      Blocked:
      - ❌ SSH access unavailable (connection refused to 192.168.100.11)
      - ❌ S2 dependency: VNC network configuration or telnet console bypass required

      Next steps (when unblocked):
      - [ ] Choose unblock strategy: VNC (C), NixOS+telnet (B), or Alpine (A)
      - [ ] Run nixos-anywhere for node01/02/03
      - [ ] Monitor first-boot automation logs
      - [ ] Verify cluster formation (Chainfire, FlareDB Raft)

    notes: |
      **Unblock Options (peerB investigation 2025-12-11):**
      - Option A: Alpine virt ISO + telnet automation (viable but fragile)
      - Option B: NixOS + manual telnet console (recommended: simple, reliable)
      - Option C: Original VNC approach (lowest risk, requires user)

      ISO boot approach (not PXE):
      - Boot VMs from NixOS/Alpine ISO
      - Configure SSH via VNC or telnet serial console
      - Execute nixos-anywhere with node configurations from S4
      - First-boot automation will handle cluster initialization

  - step: S6
    name: Cluster Validation
    done: All acceptance criteria met, cluster operational, RUNBOOK validated
    status: pending
    owner: peerA
    priority: P0
    notes: |
      Validate cluster per T032 QUICKSTART:
      - Chainfire cluster: 3 members, leader elected, health OK
      - FlareDB cluster: 3 members, quorum formed, health OK
      - IAM service: all nodes responding
      - CRUD operations: write/read/delete working
      - Data persistence: verify across restarts
      - Metrics: Prometheus endpoints responding

evidence: []
notes: |
  **Strategic Rationale:**
  - VM deployment validates T032 tools without hardware dependency
  - Fastest feedback loop (~3-4 hours total)
  - After success, physical bare-metal deployment has validated blueprint
  - Failure discovery in VMs is cheaper than on physical hardware

  **Timeline Estimate:**
  - S1 VM Infrastructure: 30 min
  - S2 PXE Server: 30 min
  - S3 TLS Certs: 15 min
  - S4 Node Configs: 30 min
  - S5 Provisioning: 60 min
  - S6 Validation: 30 min
  - Total: ~3.5 hours

  **Success Criteria:**
  - All 6 steps complete
  - 3-node Raft cluster operational
  - T032 RUNBOOK procedures validated
  - Ready for physical bare-metal deployment