- netboot-base.nix with SSH key auth - Launch scripts for node01/02/03 - Node configuration.nix and disko.nix - Nix modules for first-boot automation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
247 lines
11 KiB
YAML
247 lines
11 KiB
YAML
id: T036
|
||
name: VM Cluster Deployment (T032 Validation)
|
||
goal: Deploy and validate a 3-node PlasmaCloud cluster using T032 bare-metal provisioning tools in a VM environment to validate end-to-end provisioning flow before physical deployment.
|
||
status: active
|
||
priority: P0
|
||
owner: peerA
|
||
created: 2025-12-11
|
||
depends_on: [T032, T035]
|
||
blocks: []
|
||
|
||
context: |
|
||
PROJECT.md Principal: "Peer Aへ:**自分で戦略を**決めて良い!好きにやれ!"
|
||
|
||
Strategic Decision: Pursue VM-based testing cluster (Option A from deployment readiness assessment)
|
||
to validate T032 tools end-to-end before committing to physical infrastructure.
|
||
|
||
T032 delivered: PXE boot infra, NixOS image builder, first-boot automation, documentation (17,201L)
|
||
T035 validated: Single-VM build integration (10/10 services, dev builds)
|
||
|
||
This task validates: Multi-node cluster deployment, PXE boot flow, nixos-anywhere,
|
||
Raft cluster formation, first-boot automation, and operational procedures.
|
||
|
||
acceptance:
|
||
- 3 VMs deployed with libvirt/KVM
|
||
- Virtual network configured for PXE boot
|
||
- PXE server running and serving netboot images
|
||
- All 3 nodes provisioned via nixos-anywhere
|
||
- Chainfire + FlareDB Raft clusters formed (3-node quorum)
|
||
- IAM service operational on all control-plane nodes
|
||
- Health checks passing on all services
|
||
- T032 RUNBOOK validated end-to-end
|
||
|
||
steps:
|
||
- step: S1
|
||
name: VM Infrastructure Setup
|
||
done: 3 VMs created with QEMU, multicast socket network configured, launch scripts ready
|
||
status: complete
|
||
owner: peerA
|
||
priority: P0
|
||
progress: |
|
||
**COMPLETED** — VM infrastructure operational, pivoted to ISO boot approach
|
||
|
||
Completed:
|
||
- ✅ Created VM working directory: /home/centra/cloud/baremetal/vm-cluster
|
||
- ✅ Created disk images: node01/02/03.qcow2 (100GB each)
|
||
- ✅ Wrote launch scripts: launch-node{01,02,03}.sh
|
||
- ✅ Configured QEMU multicast socket networking (230.0.0.1:1234)
|
||
- ✅ VM specs: 8 vCPU, 16GB RAM per node
|
||
- ✅ MACs assigned: 52:54:00:00:01:{01,02,03} (nodes)
|
||
- ✅ Netboot artifacts built successfully (bzImage 14MB, initrd 484MB, ZFS disabled)
|
||
- ✅ **PIVOT DECISION**: ISO boot approach (QEMU 10.1.2 initrd compatibility bug)
|
||
- ✅ Downloaded NixOS 25.11 minimal ISO (1.6GB)
|
||
- ✅ Node01 booting from ISO, multicast network configured
|
||
|
||
notes: |
|
||
**Topology Change:** Abandoned libvirt bridges (required root). Using QEMU directly with:
|
||
- Multicast socket networking (no root required): `-netdev socket,mcast=230.0.0.1:1234`
|
||
- 3 node VMs (pxe-server dropped due to ISO pivot)
|
||
- All VMs share L2 segment via multicast
|
||
|
||
**PIVOT JUSTIFICATION (MID: cccc-1765406017-b04a6e):**
|
||
- Netboot artifacts validated ✓ (build process, kernel-6.18 ZFS fix)
|
||
- QEMU 10.1.2 initrd bug blocks PXE testing (environmental, not T032 issue)
|
||
- ISO + nixos-anywhere validates core T032 provisioning capability
|
||
- PXE boot protocol deferred for bare-metal validation
|
||
|
||
- step: S2
|
||
name: Network Access Configuration
|
||
done: Node VMs configured with SSH access for nixos-anywhere (netboot key auth)
|
||
status: complete
|
||
owner: peerB
|
||
priority: P0
|
||
progress: |
|
||
**COMPLETED** — Custom netboot with SSH key auth bypasses VNC/telnet entirely
|
||
|
||
Completed (2025-12-11):
|
||
- ✅ Updated nix/images/netboot-base.nix with real SSH key (centra@cn-nixos-think)
|
||
- ✅ Added netboot-base to flake.nix nixosConfigurations
|
||
- ✅ Built netboot artifacts (kernel 14MB, initrd 484MB)
|
||
- ✅ Created launch-node01-netboot.sh (QEMU -kernel/-initrd direct boot)
|
||
- ✅ Fixed init path in kernel append parameter
|
||
- ✅ SSH access verified (port 2201, key auth, zero manual interaction)
|
||
|
||
Evidence:
|
||
```
|
||
ssh -p 2201 root@localhost -> SUCCESS: nixos at Thu Dec 11 12:48:13 AM UTC 2025
|
||
```
|
||
|
||
**PIVOT DECISION (2025-12-11, MID: cccc-1765413547-285e0f):**
|
||
- PeerA directive: Build custom netboot with SSH key baked in
|
||
- Eliminates VNC/telnet/password setup entirely
|
||
- Netboot approach superior to ISO for automated provisioning
|
||
notes: |
|
||
**Solution Evolution:**
|
||
- Initial: VNC (Option C) - requires user
|
||
- Investigation: Alpine/telnet (Options A/B) - tooling gap/fragile
|
||
- Final: Custom netboot with SSH key (PeerA strategy) - ZERO manual steps
|
||
|
||
Files created:
|
||
- baremetal/vm-cluster/launch-node01-netboot.sh (direct kernel/initrd boot)
|
||
- baremetal/vm-cluster/netboot-{kernel,initrd}/ (nix build outputs)
|
||
|
||
- step: S3
|
||
name: TLS Certificate Generation
|
||
done: CA and per-node certificates generated, ready for deployment
|
||
status: complete
|
||
owner: peerA
|
||
priority: P0
|
||
progress: |
|
||
**COMPLETED** — TLS certificates generated and deployed to node config directories
|
||
|
||
Completed:
|
||
- ✅ Generated CA certificate and key
|
||
- ✅ Generated node01.crt/.key (192.168.100.11)
|
||
- ✅ Generated node02.crt/.key (192.168.100.12)
|
||
- ✅ Generated node03.crt/.key (192.168.100.13)
|
||
- ✅ Copied to docs/por/T036-vm-cluster-deployment/node*/secrets/
|
||
- ✅ Permissions set (ca.crt/node*.crt: 644, node*.key: 400)
|
||
- ✅ **CRITICAL FIX (2025-12-11):** Renamed certs to match cluster-config.json expectations
|
||
- ca-cert.pem → ca.crt, cert.pem → node0X.crt, key.pem → node0X.key (all 3 nodes)
|
||
- Prevented first-boot automation failure (services couldn't load TLS certs)
|
||
|
||
notes: |
|
||
Certificates ready for nixos-anywhere deployment (will be placed at /etc/nixos/secrets/)
|
||
**Critical naming fix applied:** Certs renamed to match cluster-config.json paths
|
||
|
||
- step: S4
|
||
name: Node Configuration Preparation
|
||
done: configuration.nix, disko.nix, cluster-config.json ready for all 3 nodes
|
||
status: complete
|
||
owner: peerB
|
||
priority: P0
|
||
progress: |
|
||
**COMPLETED** — All node configurations created and validated
|
||
|
||
Deliverables (13 files, ~600 LOC):
|
||
- ✅ node01/configuration.nix (112L) - NixOS system config, control-plane services
|
||
- ✅ node01/disko.nix (62L) - Disk partitioning (EFI + LVM)
|
||
- ✅ node01/secrets/cluster-config.json (28L) - Raft bootstrap config
|
||
- ✅ node01/secrets/README.md - TLS documentation
|
||
- ✅ node02/* (same structure, IP: 192.168.100.12)
|
||
- ✅ node03/* (same structure, IP: 192.168.100.13)
|
||
- ✅ DEPLOYMENT.md (335L) - Comprehensive deployment guide
|
||
|
||
Configuration highlights:
|
||
- All 9 control-plane services enabled per node
|
||
- Bootstrap mode: `bootstrap: true` on all 3 nodes (simultaneous initialization)
|
||
- Network: Static IPs 192.168.100.11/12/13
|
||
- Disk: Single-disk LVM (512MB EFI + 80GB root + 19.5GB data)
|
||
- First-boot automation: Enabled with cluster-config.json
|
||
- **CRITICAL FIX (2025-12-11):** Added networking.hosts to all 3 nodes (configuration.nix:14-19)
|
||
- Maps node01/02/03 hostnames to 192.168.100.11/12/13
|
||
- Prevented Raft bootstrap failure (cluster-config.json uses hostnames, DNS unavailable)
|
||
|
||
notes: |
|
||
Node configurations ready for nixos-anywhere provisioning (S5)
|
||
TLS certificates from S3 already in secrets/ directories
|
||
**Critical fixes applied:** TLS cert naming (S3), hostname resolution (/etc/hosts)
|
||
|
||
- step: S5
|
||
name: Cluster Provisioning
|
||
done: All 3 nodes provisioned via nixos-anywhere, first-boot automation completed
|
||
status: in_progress
|
||
owner: peerB
|
||
priority: P0
|
||
progress: |
|
||
**BLOCKED** — nixos-anywhere flake path resolution errors (nix store vs git working tree)
|
||
|
||
Completed:
|
||
- ✅ All 3 VMs launched with custom netboot (SSH ports 2201/2202/2203, key auth)
|
||
- ✅ SSH access verified on all nodes (zero manual interaction)
|
||
- ✅ Node configurations staged in git (node0{1,2,3}/configuration.nix + disko.nix + secrets/)
|
||
- ✅ nix/modules staged (first-boot-automation, k8shost, metricstor, observability)
|
||
- ✅ Launch scripts created: launch-node0{1,2,3}-netboot.sh
|
||
|
||
Blocked:
|
||
- ❌ nixos-anywhere failing with path resolution errors
|
||
- ❌ Error: `/nix/store/.../docs/nix/modules/default.nix does not exist`
|
||
- ❌ Root cause: Git tree dirty + files not in nix store
|
||
- ❌ 3 attempts made, each failing on different missing path
|
||
|
||
Next (awaiting PeerA decision):
|
||
- Option A: Continue debug (may need git commit or --impure flag)
|
||
- Option B: Alternative provisioning (direct configuration.nix)
|
||
- Option C: Hand off to PeerA
|
||
- Analyzed telnet serial console automation viability
|
||
- Presented 3 options: Alpine automation (A), NixOS+telnet (B), VNC (C)
|
||
|
||
Blocked:
|
||
- ❌ SSH access unavailable (connection refused to 192.168.100.11)
|
||
- ❌ S2 dependency: VNC network configuration or telnet console bypass required
|
||
|
||
Next steps (when unblocked):
|
||
- [ ] Choose unblock strategy: VNC (C), NixOS+telnet (B), or Alpine (A)
|
||
- [ ] Run nixos-anywhere for node01/02/03
|
||
- [ ] Monitor first-boot automation logs
|
||
- [ ] Verify cluster formation (Chainfire, FlareDB Raft)
|
||
|
||
notes: |
|
||
**Unblock Options (peerB investigation 2025-12-11):**
|
||
- Option A: Alpine virt ISO + telnet automation (viable but fragile)
|
||
- Option B: NixOS + manual telnet console (recommended: simple, reliable)
|
||
- Option C: Original VNC approach (lowest risk, requires user)
|
||
|
||
ISO boot approach (not PXE):
|
||
- Boot VMs from NixOS/Alpine ISO
|
||
- Configure SSH via VNC or telnet serial console
|
||
- Execute nixos-anywhere with node configurations from S4
|
||
- First-boot automation will handle cluster initialization
|
||
|
||
- step: S6
|
||
name: Cluster Validation
|
||
done: All acceptance criteria met, cluster operational, RUNBOOK validated
|
||
status: pending
|
||
owner: peerA
|
||
priority: P0
|
||
notes: |
|
||
Validate cluster per T032 QUICKSTART:
|
||
- Chainfire cluster: 3 members, leader elected, health OK
|
||
- FlareDB cluster: 3 members, quorum formed, health OK
|
||
- IAM service: all nodes responding
|
||
- CRUD operations: write/read/delete working
|
||
- Data persistence: verify across restarts
|
||
- Metrics: Prometheus endpoints responding
|
||
|
||
evidence: []
|
||
notes: |
|
||
**Strategic Rationale:**
|
||
- VM deployment validates T032 tools without hardware dependency
|
||
- Fastest feedback loop (~3-4 hours total)
|
||
- After success, physical bare-metal deployment has validated blueprint
|
||
- Failure discovery in VMs is cheaper than on physical hardware
|
||
|
||
**Timeline Estimate:**
|
||
- S1 VM Infrastructure: 30 min
|
||
- S2 PXE Server: 30 min
|
||
- S3 TLS Certs: 15 min
|
||
- S4 Node Configs: 30 min
|
||
- S5 Provisioning: 60 min
|
||
- S6 Validation: 30 min
|
||
- Total: ~3.5 hours
|
||
|
||
**Success Criteria:**
|
||
- All 6 steps complete
|
||
- 3-node Raft cluster operational
|
||
- T032 RUNBOOK procedures validated
|
||
- Ready for physical bare-metal deployment
|