# T036 VM Cluster Deployment - Key Learnings **Status:** Partial Success (Infrastructure Validated) **Date:** 2025-12-11 **Duration:** ~5 hours **Outcome:** Provisioning tools validated, service deployment deferred to T038 --- ## Executive Summary T036 successfully validated VM infrastructure, networking automation, and provisioning concepts for T032 bare-metal deployment. The task demonstrated that T032 tooling works correctly, with build failures identified as orthogonal code maintenance issues (FlareDB API drift from T037). **Key Achievement:** VDE switch networking breakthrough proves multi-VM cluster viability on single host. --- ## Technical Wins ### 1. VDE Switch Networking (Critical Breakthrough) **Problem:** QEMU socket multicast designed for cross-host VMs, not same-host L2 networking. **Symptoms:** - Static IPs configured successfully - Ping failed: 100% packet loss - ARP tables empty (no neighbor discovery) **Solution:** VDE (Virtual Distributed Ethernet) switch ```bash # Start VDE switch daemon vde_switch -d -s /tmp/vde.sock -M /tmp/vde.mgmt # QEMU launch with VDE qemu-system-x86_64 \ -netdev vde,id=vde0,sock=/tmp/vde.sock \ -device virtio-net-pci,netdev=vde0,mac=52:54:00:12:34:01 ``` **Evidence:** - node01→node02: 0% packet loss, ~0.7ms latency - node02→node03: 0% packet loss (after ARP delay) - Full mesh L2 connectivity verified across 3 VMs **Impact:** Enables true L2 broadcast domain for Raft cluster testing on single host. --- ### 2. Custom Netboot with SSH Key (Zero-Touch Provisioning) **Problem:** VMs required manual network configuration via VNC or telnet console. **Solution:** Bake SSH public key into netboot image ```nix # nix/images/netboot-base.nix users.users.root.openssh.authorizedKeys.keys = [ "ssh-ed25519 AAAAC3Nza... centra@cn-nixos-think" ]; ``` **Build & Launch:** ```bash # Build custom netboot nix build .#netboot-base # Direct kernel/initrd boot with QEMU qemu-system-x86_64 \ -kernel netboot-kernel/bzImage \ -initrd netboot-initrd/initrd \ -append "init=/nix/store/.../init console=ttyS0,115200" ``` **Result:** SSH access immediately available on boot (ports 2201/2202/2203), zero manual steps. **Impact:** Eliminates VNC/telnet/password requirements entirely for automation. --- ### 3. Disk Automation (Manual but Repeatable) **Approach:** Direct SSH provisioning with disk setup script ```bash # Partition disk parted /dev/vda -- mklabel gpt parted /dev/vda -- mkpart ESP fat32 1MB 512MB parted /dev/vda -- mkpart primary ext4 512MB 100% parted /dev/vda -- set 1 esp on # Format and mount mkfs.fat -F 32 -n boot /dev/vda1 mkfs.ext4 -L nixos /dev/vda2 mount /dev/vda2 /mnt mkdir -p /mnt/boot mount /dev/vda1 /mnt/boot ``` **Result:** All 3 VMs ready for NixOS install with consistent disk layout. **Impact:** Validates T032 disk automation concepts, ready for final service deployment. --- ## Strategic Insights ### 1. MVP Validation Path Should Be Simplest First **Observation:** 4+ hours spent on tooling (nixos-anywhere, disko, flake integration) before discovering build drift. **Cascade Pattern:** 1. nixos-anywhere attempt (~3h): git tree → path resolution → disko → package resolution 2. Networking pivot (~1h): multicast failure → VDE switch success ✅ 3. Manual provisioning (P2): disk setup ✅ → build failures (code drift) **Learning:** Start with P2 (manual binary deployment) for initial validation, automate after success. **T032 Application:** Bare-metal should use simpler provisioning path initially, add automation incrementally. --- ### 2. Nixos-anywhere + Hybrid Flake Has Integration Complexity **Challenges Encountered:** 1. **Dirty git tree:** Staged files not in nix store (requires commit) 2. **Path resolution:** Relative imports fail in flake context (must be exact) 3. **Disko module:** Must be in flake inputs AND nixosSystem modules 4. **Package resolution:** nixosSystem context lacks access to workspace packages (overlay not applied) **Root Cause:** Flake evaluation purity conflicts with development workflow. **Learning:** Flake-based nixos-anywhere requires clean git, exact paths, and full dependency graph in flake.nix. **T032 Application:** Consider non-flake nixos-anywhere path for bare-metal, or maintain separate deployment flake. --- ### 3. Code Drift Detection Needs Integration Testing **Issue:** T037 SQL layer API changes broke flaredb-server without detection. **Symptoms:** ```rust error[E0599]: no method named `rows` found for struct `flaredb_sql::QueryResult` error[E0560]: struct `ErrorResult` has no field named `message` ``` **Root Cause:** Workspace crates updated independently without cross-crate testing. **Learning:** Need integration tests across workspace dependencies to catch API drift early. **Action:** T038 created to fix drift + establish integration testing. --- ## Execution Timeline **Total:** ~5 hours **Outcome:** Infrastructure validated, build drift identified | Phase | Duration | Result | |-------|----------|--------| | S1: VM Infrastructure | 30 min | ✅ 3 VMs + netboot | | S2: SSH Access (Custom Netboot) | 1h | ✅ Zero-touch SSH | | S3: TLS Certificates | 15 min | ✅ Certs deployed | | S4: Node Configurations | 30 min | ✅ Configs ready | | S5: Provisioning Attempts | 3h+ | ⚠️ Infrastructure validated, builds blocked | | - nixos-anywhere debugging | ~3h | ⚠️ Flake complexity | | - Networking pivot (VDE) | ~1h | ✅ L2 breakthrough | | - Disk setup (manual) | 30 min | ✅ All nodes ready | | S6: Cluster Validation | Deferred | ⏸️ Blocked on T038 | --- ## Recommendations for T032 Bare-Metal ### 1. Networking - **Use VDE switch equivalent** (likely not needed for bare-metal with real switches) - **For VM testing:** VDE is correct approach for multi-VM on single host - **For bare-metal:** Standard L2 switches provide broadcast domain ### 2. Provisioning - **Option A (Simple):** Manual binary deployment + systemd units (like P2 approach) - Pros: Fast, debuggable, no flake complexity - Cons: Less automated - **Option B (Automated):** nixos-anywhere with simplified non-flake config - Pros: Fully automated, reproducible - Cons: Requires debugging time, flake purity issues **Recommendation:** Start with Option A for initial deployment, migrate to Option B after validation. ### 3. Build System - **Fix T038 first:** Ensure all builds work before bare-metal deployment - **Test in nix-shell:** Verify cargo build environment before nix build - **Integration tests:** Add cross-workspace crate testing to CI/CD ### 4. Custom Netboot - **Keep SSH key approach:** Eliminates manual console access - **Validate on bare-metal:** Test PXE boot flow with SSH key in netboot image - **Fallback plan:** Keep VNC/IPMI access available for debugging --- ## Technical Debt ### Immediate (T038) - [ ] Fix FlareDB API drift from T037 - [ ] Verify nix-shell cargo build environment - [ ] Build all 3 service binaries successfully - [ ] Deploy to T036 VMs and complete S6 validation ### Future (T039+) - [ ] Add integration tests across workspace crates - [ ] Simplify nixos-anywhere flake integration - [ ] Document development workflow (git, flakes, nix-shell) - [ ] CI/CD for cross-crate API compatibility --- ## Conclusion **T036 achieved its goal:** Validate T032 provisioning tools before bare-metal deployment. **Success Metrics:** - ✅ VM infrastructure operational (3 nodes, VDE networking) - ✅ Custom netboot with SSH key (zero-touch access) - ✅ Disk automation validated (all nodes partitioned/mounted) - ✅ TLS certificates deployed - ✅ Network configuration validated (static IPs, hostname resolution) **Blockers Identified:** - ❌ FlareDB API drift (T037) - code maintenance, NOT provisioning issue - ❌ Cargo build environment - tooling configuration, NOT infrastructure issue **Risk Reduction for T032:** - VDE breakthrough proves VM cluster viability - Custom netboot validates automation concepts - Disk setup process validated and documented - Build drift identified before bare-metal investment **Next Steps:** 1. Complete T038 (code drift cleanup) 2. Resume T036.S6 with working binaries (VMs still running, ready) 3. Assess T032 readiness (tooling validated, proceed with confidence) **ROI:** Negative for cluster validation (4+ hours, no cluster), but positive for risk reduction (infrastructure proven, blockers identified early).