photoncloud-monorepo/docs/por/T036-vm-cluster-deployment/task.yaml
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

247 lines
11 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

id: T036
name: VM Cluster Deployment (T032 Validation)
goal: Deploy and validate a 3-node PlasmaCloud cluster using T032 bare-metal provisioning tools in a VM environment to validate end-to-end provisioning flow before physical deployment.
status: active
priority: P0
owner: peerA
created: 2025-12-11
depends_on: [T032, T035]
blocks: []
context: |
PROJECT.md Principal: "Peer Aへ**自分で戦略を**決めて良い!好きにやれ!"
Strategic Decision: Pursue VM-based testing cluster (Option A from deployment readiness assessment)
to validate T032 tools end-to-end before committing to physical infrastructure.
T032 delivered: PXE boot infra, NixOS image builder, first-boot automation, documentation (17,201L)
T035 validated: Single-VM build integration (10/10 services, dev builds)
This task validates: Multi-node cluster deployment, PXE boot flow, nixos-anywhere,
Raft cluster formation, first-boot automation, and operational procedures.
acceptance:
- 3 VMs deployed with libvirt/KVM
- Virtual network configured for PXE boot
- PXE server running and serving netboot images
- All 3 nodes provisioned via nixos-anywhere
- Chainfire + FlareDB Raft clusters formed (3-node quorum)
- IAM service operational on all control-plane nodes
- Health checks passing on all services
- T032 RUNBOOK validated end-to-end
steps:
- step: S1
name: VM Infrastructure Setup
done: 3 VMs created with QEMU, multicast socket network configured, launch scripts ready
status: complete
owner: peerA
priority: P0
progress: |
**COMPLETED** — VM infrastructure operational, pivoted to ISO boot approach
Completed:
- ✅ Created VM working directory: /home/centra/cloud/baremetal/vm-cluster
- ✅ Created disk images: node01/02/03.qcow2 (100GB each)
- ✅ Wrote launch scripts: launch-node{01,02,03}.sh
- ✅ Configured QEMU multicast socket networking (230.0.0.1:1234)
- ✅ VM specs: 8 vCPU, 16GB RAM per node
- ✅ MACs assigned: 52:54:00:00:01:{01,02,03} (nodes)
- ✅ Netboot artifacts built successfully (bzImage 14MB, initrd 484MB, ZFS disabled)
- ✅ **PIVOT DECISION**: ISO boot approach (QEMU 10.1.2 initrd compatibility bug)
- ✅ Downloaded NixOS 25.11 minimal ISO (1.6GB)
- ✅ Node01 booting from ISO, multicast network configured
notes: |
**Topology Change:** Abandoned libvirt bridges (required root). Using QEMU directly with:
- Multicast socket networking (no root required): `-netdev socket,mcast=230.0.0.1:1234`
- 3 node VMs (pxe-server dropped due to ISO pivot)
- All VMs share L2 segment via multicast
**PIVOT JUSTIFICATION (MID: cccc-1765406017-b04a6e):**
- Netboot artifacts validated ✓ (build process, kernel-6.18 ZFS fix)
- QEMU 10.1.2 initrd bug blocks PXE testing (environmental, not T032 issue)
- ISO + nixos-anywhere validates core T032 provisioning capability
- PXE boot protocol deferred for bare-metal validation
- step: S2
name: Network Access Configuration
done: Node VMs configured with SSH access for nixos-anywhere (netboot key auth)
status: complete
owner: peerB
priority: P0
progress: |
**COMPLETED** — Custom netboot with SSH key auth bypasses VNC/telnet entirely
Completed (2025-12-11):
- ✅ Updated nix/images/netboot-base.nix with real SSH key (centra@cn-nixos-think)
- ✅ Added netboot-base to flake.nix nixosConfigurations
- ✅ Built netboot artifacts (kernel 14MB, initrd 484MB)
- ✅ Created launch-node01-netboot.sh (QEMU -kernel/-initrd direct boot)
- ✅ Fixed init path in kernel append parameter
- ✅ SSH access verified (port 2201, key auth, zero manual interaction)
Evidence:
```
ssh -p 2201 root@localhost -> SUCCESS: nixos at Thu Dec 11 12:48:13 AM UTC 2025
```
**PIVOT DECISION (2025-12-11, MID: cccc-1765413547-285e0f):**
- PeerA directive: Build custom netboot with SSH key baked in
- Eliminates VNC/telnet/password setup entirely
- Netboot approach superior to ISO for automated provisioning
notes: |
**Solution Evolution:**
- Initial: VNC (Option C) - requires user
- Investigation: Alpine/telnet (Options A/B) - tooling gap/fragile
- Final: Custom netboot with SSH key (PeerA strategy) - ZERO manual steps
Files created:
- baremetal/vm-cluster/launch-node01-netboot.sh (direct kernel/initrd boot)
- baremetal/vm-cluster/netboot-{kernel,initrd}/ (nix build outputs)
- step: S3
name: TLS Certificate Generation
done: CA and per-node certificates generated, ready for deployment
status: complete
owner: peerA
priority: P0
progress: |
**COMPLETED** — TLS certificates generated and deployed to node config directories
Completed:
- ✅ Generated CA certificate and key
- ✅ Generated node01.crt/.key (192.168.100.11)
- ✅ Generated node02.crt/.key (192.168.100.12)
- ✅ Generated node03.crt/.key (192.168.100.13)
- ✅ Copied to docs/por/T036-vm-cluster-deployment/node*/secrets/
- ✅ Permissions set (ca.crt/node*.crt: 644, node*.key: 400)
- ✅ **CRITICAL FIX (2025-12-11):** Renamed certs to match cluster-config.json expectations
- ca-cert.pem → ca.crt, cert.pem → node0X.crt, key.pem → node0X.key (all 3 nodes)
- Prevented first-boot automation failure (services couldn't load TLS certs)
notes: |
Certificates ready for nixos-anywhere deployment (will be placed at /etc/nixos/secrets/)
**Critical naming fix applied:** Certs renamed to match cluster-config.json paths
- step: S4
name: Node Configuration Preparation
done: configuration.nix, disko.nix, cluster-config.json ready for all 3 nodes
status: complete
owner: peerB
priority: P0
progress: |
**COMPLETED** — All node configurations created and validated
Deliverables (13 files, ~600 LOC):
- ✅ node01/configuration.nix (112L) - NixOS system config, control-plane services
- ✅ node01/disko.nix (62L) - Disk partitioning (EFI + LVM)
- ✅ node01/secrets/cluster-config.json (28L) - Raft bootstrap config
- ✅ node01/secrets/README.md - TLS documentation
- ✅ node02/* (same structure, IP: 192.168.100.12)
- ✅ node03/* (same structure, IP: 192.168.100.13)
- ✅ DEPLOYMENT.md (335L) - Comprehensive deployment guide
Configuration highlights:
- All 9 control-plane services enabled per node
- Bootstrap mode: `bootstrap: true` on all 3 nodes (simultaneous initialization)
- Network: Static IPs 192.168.100.11/12/13
- Disk: Single-disk LVM (512MB EFI + 80GB root + 19.5GB data)
- First-boot automation: Enabled with cluster-config.json
- **CRITICAL FIX (2025-12-11):** Added networking.hosts to all 3 nodes (configuration.nix:14-19)
- Maps node01/02/03 hostnames to 192.168.100.11/12/13
- Prevented Raft bootstrap failure (cluster-config.json uses hostnames, DNS unavailable)
notes: |
Node configurations ready for nixos-anywhere provisioning (S5)
TLS certificates from S3 already in secrets/ directories
**Critical fixes applied:** TLS cert naming (S3), hostname resolution (/etc/hosts)
- step: S5
name: Cluster Provisioning
done: All 3 nodes provisioned via nixos-anywhere, first-boot automation completed
status: in_progress
owner: peerB
priority: P0
progress: |
**BLOCKED** — nixos-anywhere flake path resolution errors (nix store vs git working tree)
Completed:
- ✅ All 3 VMs launched with custom netboot (SSH ports 2201/2202/2203, key auth)
- ✅ SSH access verified on all nodes (zero manual interaction)
- ✅ Node configurations staged in git (node0{1,2,3}/configuration.nix + disko.nix + secrets/)
- ✅ nix/modules staged (first-boot-automation, k8shost, metricstor, observability)
- ✅ Launch scripts created: launch-node0{1,2,3}-netboot.sh
Blocked:
- ❌ nixos-anywhere failing with path resolution errors
- ❌ Error: `/nix/store/.../docs/nix/modules/default.nix does not exist`
- ❌ Root cause: Git tree dirty + files not in nix store
- ❌ 3 attempts made, each failing on different missing path
Next (awaiting PeerA decision):
- Option A: Continue debug (may need git commit or --impure flag)
- Option B: Alternative provisioning (direct configuration.nix)
- Option C: Hand off to PeerA
- Analyzed telnet serial console automation viability
- Presented 3 options: Alpine automation (A), NixOS+telnet (B), VNC (C)
Blocked:
- ❌ SSH access unavailable (connection refused to 192.168.100.11)
- ❌ S2 dependency: VNC network configuration or telnet console bypass required
Next steps (when unblocked):
- [ ] Choose unblock strategy: VNC (C), NixOS+telnet (B), or Alpine (A)
- [ ] Run nixos-anywhere for node01/02/03
- [ ] Monitor first-boot automation logs
- [ ] Verify cluster formation (Chainfire, FlareDB Raft)
notes: |
**Unblock Options (peerB investigation 2025-12-11):**
- Option A: Alpine virt ISO + telnet automation (viable but fragile)
- Option B: NixOS + manual telnet console (recommended: simple, reliable)
- Option C: Original VNC approach (lowest risk, requires user)
ISO boot approach (not PXE):
- Boot VMs from NixOS/Alpine ISO
- Configure SSH via VNC or telnet serial console
- Execute nixos-anywhere with node configurations from S4
- First-boot automation will handle cluster initialization
- step: S6
name: Cluster Validation
done: All acceptance criteria met, cluster operational, RUNBOOK validated
status: pending
owner: peerA
priority: P0
notes: |
Validate cluster per T032 QUICKSTART:
- Chainfire cluster: 3 members, leader elected, health OK
- FlareDB cluster: 3 members, quorum formed, health OK
- IAM service: all nodes responding
- CRUD operations: write/read/delete working
- Data persistence: verify across restarts
- Metrics: Prometheus endpoints responding
evidence: []
notes: |
**Strategic Rationale:**
- VM deployment validates T032 tools without hardware dependency
- Fastest feedback loop (~3-4 hours total)
- After success, physical bare-metal deployment has validated blueprint
- Failure discovery in VMs is cheaper than on physical hardware
**Timeline Estimate:**
- S1 VM Infrastructure: 30 min
- S2 PXE Server: 30 min
- S3 TLS Certs: 15 min
- S4 Node Configs: 30 min
- S5 Provisioning: 60 min
- S6 Validation: 30 min
- Total: ~3.5 hours
**Success Criteria:**
- All 6 steps complete
- 3-node Raft cluster operational
- T032 RUNBOOK procedures validated
- Ready for physical bare-metal deployment