- Replace form_urlencoded with RFC 3986 compliant URI encoding - Implement aws_uri_encode() matching AWS SigV4 spec exactly - Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded - All other chars percent-encoded with uppercase hex - Preserve slashes in paths, encode in query params - Normalize empty paths to '/' per AWS spec - Fix test expectations (body hash, HMAC values) - Add comprehensive SigV4 signature determinism test This fixes the canonicalization mismatch that caused signature validation failures in T047. Auth can now be enabled for production. Refs: T058.S1
8.2 KiB
T036 VM Cluster Deployment - Key Learnings
Status: Partial Success (Infrastructure Validated) Date: 2025-12-11 Duration: ~5 hours Outcome: Provisioning tools validated, service deployment deferred to T038
Executive Summary
T036 successfully validated VM infrastructure, networking automation, and provisioning concepts for T032 bare-metal deployment. The task demonstrated that T032 tooling works correctly, with build failures identified as orthogonal code maintenance issues (FlareDB API drift from T037).
Key Achievement: VDE switch networking breakthrough proves multi-VM cluster viability on single host.
Technical Wins
1. VDE Switch Networking (Critical Breakthrough)
Problem: QEMU socket multicast designed for cross-host VMs, not same-host L2 networking.
Symptoms:
- Static IPs configured successfully
- Ping failed: 100% packet loss
- ARP tables empty (no neighbor discovery)
Solution: VDE (Virtual Distributed Ethernet) switch
# Start VDE switch daemon
vde_switch -d -s /tmp/vde.sock -M /tmp/vde.mgmt
# QEMU launch with VDE
qemu-system-x86_64 \
-netdev vde,id=vde0,sock=/tmp/vde.sock \
-device virtio-net-pci,netdev=vde0,mac=52:54:00:12:34:01
Evidence:
- node01→node02: 0% packet loss, ~0.7ms latency
- node02→node03: 0% packet loss (after ARP delay)
- Full mesh L2 connectivity verified across 3 VMs
Impact: Enables true L2 broadcast domain for Raft cluster testing on single host.
2. Custom Netboot with SSH Key (Zero-Touch Provisioning)
Problem: VMs required manual network configuration via VNC or telnet console.
Solution: Bake SSH public key into netboot image
# nix/images/netboot-base.nix
users.users.root.openssh.authorizedKeys.keys = [
"ssh-ed25519 AAAAC3Nza... centra@cn-nixos-think"
];
Build & Launch:
# Build custom netboot
nix build .#netboot-base
# Direct kernel/initrd boot with QEMU
qemu-system-x86_64 \
-kernel netboot-kernel/bzImage \
-initrd netboot-initrd/initrd \
-append "init=/nix/store/.../init console=ttyS0,115200"
Result: SSH access immediately available on boot (ports 2201/2202/2203), zero manual steps.
Impact: Eliminates VNC/telnet/password requirements entirely for automation.
3. Disk Automation (Manual but Repeatable)
Approach: Direct SSH provisioning with disk setup script
# Partition disk
parted /dev/vda -- mklabel gpt
parted /dev/vda -- mkpart ESP fat32 1MB 512MB
parted /dev/vda -- mkpart primary ext4 512MB 100%
parted /dev/vda -- set 1 esp on
# Format and mount
mkfs.fat -F 32 -n boot /dev/vda1
mkfs.ext4 -L nixos /dev/vda2
mount /dev/vda2 /mnt
mkdir -p /mnt/boot
mount /dev/vda1 /mnt/boot
Result: All 3 VMs ready for NixOS install with consistent disk layout.
Impact: Validates T032 disk automation concepts, ready for final service deployment.
Strategic Insights
1. MVP Validation Path Should Be Simplest First
Observation: 4+ hours spent on tooling (nixos-anywhere, disko, flake integration) before discovering build drift.
Cascade Pattern:
- nixos-anywhere attempt (~3h): git tree → path resolution → disko → package resolution
- Networking pivot (~1h): multicast failure → VDE switch success ✅
- Manual provisioning (P2): disk setup ✅ → build failures (code drift)
Learning: Start with P2 (manual binary deployment) for initial validation, automate after success.
T032 Application: Bare-metal should use simpler provisioning path initially, add automation incrementally.
2. Nixos-anywhere + Hybrid Flake Has Integration Complexity
Challenges Encountered:
- Dirty git tree: Staged files not in nix store (requires commit)
- Path resolution: Relative imports fail in flake context (must be exact)
- Disko module: Must be in flake inputs AND nixosSystem modules
- Package resolution: nixosSystem context lacks access to workspace packages (overlay not applied)
Root Cause: Flake evaluation purity conflicts with development workflow.
Learning: Flake-based nixos-anywhere requires clean git, exact paths, and full dependency graph in flake.nix.
T032 Application: Consider non-flake nixos-anywhere path for bare-metal, or maintain separate deployment flake.
3. Code Drift Detection Needs Integration Testing
Issue: T037 SQL layer API changes broke flaredb-server without detection.
Symptoms:
error[E0599]: no method named `rows` found for struct `flaredb_sql::QueryResult`
error[E0560]: struct `ErrorResult` has no field named `message`
Root Cause: Workspace crates updated independently without cross-crate testing.
Learning: Need integration tests across workspace dependencies to catch API drift early.
Action: T038 created to fix drift + establish integration testing.
Execution Timeline
Total: ~5 hours Outcome: Infrastructure validated, build drift identified
| Phase | Duration | Result |
|---|---|---|
| S1: VM Infrastructure | 30 min | ✅ 3 VMs + netboot |
| S2: SSH Access (Custom Netboot) | 1h | ✅ Zero-touch SSH |
| S3: TLS Certificates | 15 min | ✅ Certs deployed |
| S4: Node Configurations | 30 min | ✅ Configs ready |
| S5: Provisioning Attempts | 3h+ | ⚠️ Infrastructure validated, builds blocked |
| - nixos-anywhere debugging | ~3h | ⚠️ Flake complexity |
| - Networking pivot (VDE) | ~1h | ✅ L2 breakthrough |
| - Disk setup (manual) | 30 min | ✅ All nodes ready |
| S6: Cluster Validation | Deferred | ⏸️ Blocked on T038 |
Recommendations for T032 Bare-Metal
1. Networking
- Use VDE switch equivalent (likely not needed for bare-metal with real switches)
- For VM testing: VDE is correct approach for multi-VM on single host
- For bare-metal: Standard L2 switches provide broadcast domain
2. Provisioning
- Option A (Simple): Manual binary deployment + systemd units (like P2 approach)
- Pros: Fast, debuggable, no flake complexity
- Cons: Less automated
- Option B (Automated): nixos-anywhere with simplified non-flake config
- Pros: Fully automated, reproducible
- Cons: Requires debugging time, flake purity issues
Recommendation: Start with Option A for initial deployment, migrate to Option B after validation.
3. Build System
- Fix T038 first: Ensure all builds work before bare-metal deployment
- Test in nix-shell: Verify cargo build environment before nix build
- Integration tests: Add cross-workspace crate testing to CI/CD
4. Custom Netboot
- Keep SSH key approach: Eliminates manual console access
- Validate on bare-metal: Test PXE boot flow with SSH key in netboot image
- Fallback plan: Keep VNC/IPMI access available for debugging
Technical Debt
Immediate (T038)
- Fix FlareDB API drift from T037
- Verify nix-shell cargo build environment
- Build all 3 service binaries successfully
- Deploy to T036 VMs and complete S6 validation
Future (T039+)
- Add integration tests across workspace crates
- Simplify nixos-anywhere flake integration
- Document development workflow (git, flakes, nix-shell)
- CI/CD for cross-crate API compatibility
Conclusion
T036 achieved its goal: Validate T032 provisioning tools before bare-metal deployment.
Success Metrics:
- ✅ VM infrastructure operational (3 nodes, VDE networking)
- ✅ Custom netboot with SSH key (zero-touch access)
- ✅ Disk automation validated (all nodes partitioned/mounted)
- ✅ TLS certificates deployed
- ✅ Network configuration validated (static IPs, hostname resolution)
Blockers Identified:
- ❌ FlareDB API drift (T037) - code maintenance, NOT provisioning issue
- ❌ Cargo build environment - tooling configuration, NOT infrastructure issue
Risk Reduction for T032:
- VDE breakthrough proves VM cluster viability
- Custom netboot validates automation concepts
- Disk setup process validated and documented
- Build drift identified before bare-metal investment
Next Steps:
- Complete T038 (code drift cleanup)
- Resume T036.S6 with working binaries (VMs still running, ready)
- Assess T032 readiness (tooling validated, proceed with confidence)
ROI: Negative for cluster validation (4+ hours, no cluster), but positive for risk reduction (infrastructure proven, blockers identified early).