- Replace form_urlencoded with RFC 3986 compliant URI encoding - Implement aws_uri_encode() matching AWS SigV4 spec exactly - Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded - All other chars percent-encoded with uppercase hex - Preserve slashes in paths, encode in query params - Normalize empty paths to '/' per AWS spec - Fix test expectations (body hash, HMAC values) - Add comprehensive SigV4 signature determinism test This fixes the canonicalization mismatch that caused signature validation failures in T047. Auth can now be enabled for production. Refs: T058.S1
244 lines
8.2 KiB
Markdown
244 lines
8.2 KiB
Markdown
# T036 VM Cluster Deployment - Key Learnings
|
|
|
|
**Status:** Partial Success (Infrastructure Validated)
|
|
**Date:** 2025-12-11
|
|
**Duration:** ~5 hours
|
|
**Outcome:** Provisioning tools validated, service deployment deferred to T038
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
T036 successfully validated VM infrastructure, networking automation, and provisioning concepts for T032 bare-metal deployment. The task demonstrated that T032 tooling works correctly, with build failures identified as orthogonal code maintenance issues (FlareDB API drift from T037).
|
|
|
|
**Key Achievement:** VDE switch networking breakthrough proves multi-VM cluster viability on single host.
|
|
|
|
---
|
|
|
|
## Technical Wins
|
|
|
|
### 1. VDE Switch Networking (Critical Breakthrough)
|
|
|
|
**Problem:** QEMU socket multicast designed for cross-host VMs, not same-host L2 networking.
|
|
|
|
**Symptoms:**
|
|
- Static IPs configured successfully
|
|
- Ping failed: 100% packet loss
|
|
- ARP tables empty (no neighbor discovery)
|
|
|
|
**Solution:** VDE (Virtual Distributed Ethernet) switch
|
|
```bash
|
|
# Start VDE switch daemon
|
|
vde_switch -d -s /tmp/vde.sock -M /tmp/vde.mgmt
|
|
|
|
# QEMU launch with VDE
|
|
qemu-system-x86_64 \
|
|
-netdev vde,id=vde0,sock=/tmp/vde.sock \
|
|
-device virtio-net-pci,netdev=vde0,mac=52:54:00:12:34:01
|
|
```
|
|
|
|
**Evidence:**
|
|
- node01→node02: 0% packet loss, ~0.7ms latency
|
|
- node02→node03: 0% packet loss (after ARP delay)
|
|
- Full mesh L2 connectivity verified across 3 VMs
|
|
|
|
**Impact:** Enables true L2 broadcast domain for Raft cluster testing on single host.
|
|
|
|
---
|
|
|
|
### 2. Custom Netboot with SSH Key (Zero-Touch Provisioning)
|
|
|
|
**Problem:** VMs required manual network configuration via VNC or telnet console.
|
|
|
|
**Solution:** Bake SSH public key into netboot image
|
|
```nix
|
|
# nix/images/netboot-base.nix
|
|
users.users.root.openssh.authorizedKeys.keys = [
|
|
"ssh-ed25519 AAAAC3Nza... centra@cn-nixos-think"
|
|
];
|
|
```
|
|
|
|
**Build & Launch:**
|
|
```bash
|
|
# Build custom netboot
|
|
nix build .#netboot-base
|
|
|
|
# Direct kernel/initrd boot with QEMU
|
|
qemu-system-x86_64 \
|
|
-kernel netboot-kernel/bzImage \
|
|
-initrd netboot-initrd/initrd \
|
|
-append "init=/nix/store/.../init console=ttyS0,115200"
|
|
```
|
|
|
|
**Result:** SSH access immediately available on boot (ports 2201/2202/2203), zero manual steps.
|
|
|
|
**Impact:** Eliminates VNC/telnet/password requirements entirely for automation.
|
|
|
|
---
|
|
|
|
### 3. Disk Automation (Manual but Repeatable)
|
|
|
|
**Approach:** Direct SSH provisioning with disk setup script
|
|
```bash
|
|
# Partition disk
|
|
parted /dev/vda -- mklabel gpt
|
|
parted /dev/vda -- mkpart ESP fat32 1MB 512MB
|
|
parted /dev/vda -- mkpart primary ext4 512MB 100%
|
|
parted /dev/vda -- set 1 esp on
|
|
|
|
# Format and mount
|
|
mkfs.fat -F 32 -n boot /dev/vda1
|
|
mkfs.ext4 -L nixos /dev/vda2
|
|
mount /dev/vda2 /mnt
|
|
mkdir -p /mnt/boot
|
|
mount /dev/vda1 /mnt/boot
|
|
```
|
|
|
|
**Result:** All 3 VMs ready for NixOS install with consistent disk layout.
|
|
|
|
**Impact:** Validates T032 disk automation concepts, ready for final service deployment.
|
|
|
|
---
|
|
|
|
## Strategic Insights
|
|
|
|
### 1. MVP Validation Path Should Be Simplest First
|
|
|
|
**Observation:** 4+ hours spent on tooling (nixos-anywhere, disko, flake integration) before discovering build drift.
|
|
|
|
**Cascade Pattern:**
|
|
1. nixos-anywhere attempt (~3h): git tree → path resolution → disko → package resolution
|
|
2. Networking pivot (~1h): multicast failure → VDE switch success ✅
|
|
3. Manual provisioning (P2): disk setup ✅ → build failures (code drift)
|
|
|
|
**Learning:** Start with P2 (manual binary deployment) for initial validation, automate after success.
|
|
|
|
**T032 Application:** Bare-metal should use simpler provisioning path initially, add automation incrementally.
|
|
|
|
---
|
|
|
|
### 2. Nixos-anywhere + Hybrid Flake Has Integration Complexity
|
|
|
|
**Challenges Encountered:**
|
|
1. **Dirty git tree:** Staged files not in nix store (requires commit)
|
|
2. **Path resolution:** Relative imports fail in flake context (must be exact)
|
|
3. **Disko module:** Must be in flake inputs AND nixosSystem modules
|
|
4. **Package resolution:** nixosSystem context lacks access to workspace packages (overlay not applied)
|
|
|
|
**Root Cause:** Flake evaluation purity conflicts with development workflow.
|
|
|
|
**Learning:** Flake-based nixos-anywhere requires clean git, exact paths, and full dependency graph in flake.nix.
|
|
|
|
**T032 Application:** Consider non-flake nixos-anywhere path for bare-metal, or maintain separate deployment flake.
|
|
|
|
---
|
|
|
|
### 3. Code Drift Detection Needs Integration Testing
|
|
|
|
**Issue:** T037 SQL layer API changes broke flaredb-server without detection.
|
|
|
|
**Symptoms:**
|
|
```rust
|
|
error[E0599]: no method named `rows` found for struct `flaredb_sql::QueryResult`
|
|
error[E0560]: struct `ErrorResult` has no field named `message`
|
|
```
|
|
|
|
**Root Cause:** Workspace crates updated independently without cross-crate testing.
|
|
|
|
**Learning:** Need integration tests across workspace dependencies to catch API drift early.
|
|
|
|
**Action:** T038 created to fix drift + establish integration testing.
|
|
|
|
---
|
|
|
|
## Execution Timeline
|
|
|
|
**Total:** ~5 hours
|
|
**Outcome:** Infrastructure validated, build drift identified
|
|
|
|
| Phase | Duration | Result |
|
|
|-------|----------|--------|
|
|
| S1: VM Infrastructure | 30 min | ✅ 3 VMs + netboot |
|
|
| S2: SSH Access (Custom Netboot) | 1h | ✅ Zero-touch SSH |
|
|
| S3: TLS Certificates | 15 min | ✅ Certs deployed |
|
|
| S4: Node Configurations | 30 min | ✅ Configs ready |
|
|
| S5: Provisioning Attempts | 3h+ | ⚠️ Infrastructure validated, builds blocked |
|
|
| - nixos-anywhere debugging | ~3h | ⚠️ Flake complexity |
|
|
| - Networking pivot (VDE) | ~1h | ✅ L2 breakthrough |
|
|
| - Disk setup (manual) | 30 min | ✅ All nodes ready |
|
|
| S6: Cluster Validation | Deferred | ⏸️ Blocked on T038 |
|
|
|
|
---
|
|
|
|
## Recommendations for T032 Bare-Metal
|
|
|
|
### 1. Networking
|
|
- **Use VDE switch equivalent** (likely not needed for bare-metal with real switches)
|
|
- **For VM testing:** VDE is correct approach for multi-VM on single host
|
|
- **For bare-metal:** Standard L2 switches provide broadcast domain
|
|
|
|
### 2. Provisioning
|
|
- **Option A (Simple):** Manual binary deployment + systemd units (like P2 approach)
|
|
- Pros: Fast, debuggable, no flake complexity
|
|
- Cons: Less automated
|
|
- **Option B (Automated):** nixos-anywhere with simplified non-flake config
|
|
- Pros: Fully automated, reproducible
|
|
- Cons: Requires debugging time, flake purity issues
|
|
|
|
**Recommendation:** Start with Option A for initial deployment, migrate to Option B after validation.
|
|
|
|
### 3. Build System
|
|
- **Fix T038 first:** Ensure all builds work before bare-metal deployment
|
|
- **Test in nix-shell:** Verify cargo build environment before nix build
|
|
- **Integration tests:** Add cross-workspace crate testing to CI/CD
|
|
|
|
### 4. Custom Netboot
|
|
- **Keep SSH key approach:** Eliminates manual console access
|
|
- **Validate on bare-metal:** Test PXE boot flow with SSH key in netboot image
|
|
- **Fallback plan:** Keep VNC/IPMI access available for debugging
|
|
|
|
---
|
|
|
|
## Technical Debt
|
|
|
|
### Immediate (T038)
|
|
- [ ] Fix FlareDB API drift from T037
|
|
- [ ] Verify nix-shell cargo build environment
|
|
- [ ] Build all 3 service binaries successfully
|
|
- [ ] Deploy to T036 VMs and complete S6 validation
|
|
|
|
### Future (T039+)
|
|
- [ ] Add integration tests across workspace crates
|
|
- [ ] Simplify nixos-anywhere flake integration
|
|
- [ ] Document development workflow (git, flakes, nix-shell)
|
|
- [ ] CI/CD for cross-crate API compatibility
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
**T036 achieved its goal:** Validate T032 provisioning tools before bare-metal deployment.
|
|
|
|
**Success Metrics:**
|
|
- ✅ VM infrastructure operational (3 nodes, VDE networking)
|
|
- ✅ Custom netboot with SSH key (zero-touch access)
|
|
- ✅ Disk automation validated (all nodes partitioned/mounted)
|
|
- ✅ TLS certificates deployed
|
|
- ✅ Network configuration validated (static IPs, hostname resolution)
|
|
|
|
**Blockers Identified:**
|
|
- ❌ FlareDB API drift (T037) - code maintenance, NOT provisioning issue
|
|
- ❌ Cargo build environment - tooling configuration, NOT infrastructure issue
|
|
|
|
**Risk Reduction for T032:**
|
|
- VDE breakthrough proves VM cluster viability
|
|
- Custom netboot validates automation concepts
|
|
- Disk setup process validated and documented
|
|
- Build drift identified before bare-metal investment
|
|
|
|
**Next Steps:**
|
|
1. Complete T038 (code drift cleanup)
|
|
2. Resume T036.S6 with working binaries (VMs still running, ready)
|
|
3. Assess T032 readiness (tooling validated, proceed with confidence)
|
|
|
|
**ROI:** Negative for cluster validation (4+ hours, no cluster), but positive for risk reduction (infrastructure proven, blockers identified early).
|