photoncloud-monorepo/docs/por/T036-vm-cluster-deployment/LEARNINGS.md
centra d2149b6249 fix(lightningstor): Fix SigV4 canonicalization for AWS S3 auth
- Replace form_urlencoded with RFC 3986 compliant URI encoding
- Implement aws_uri_encode() matching AWS SigV4 spec exactly
- Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded
- All other chars percent-encoded with uppercase hex
- Preserve slashes in paths, encode in query params
- Normalize empty paths to '/' per AWS spec
- Fix test expectations (body hash, HMAC values)
- Add comprehensive SigV4 signature determinism test

This fixes the canonicalization mismatch that caused signature
validation failures in T047. Auth can now be enabled for production.

Refs: T058.S1
2025-12-12 06:23:46 +09:00

244 lines
8.2 KiB
Markdown

# T036 VM Cluster Deployment - Key Learnings
**Status:** Partial Success (Infrastructure Validated)
**Date:** 2025-12-11
**Duration:** ~5 hours
**Outcome:** Provisioning tools validated, service deployment deferred to T038
---
## Executive Summary
T036 successfully validated VM infrastructure, networking automation, and provisioning concepts for T032 bare-metal deployment. The task demonstrated that T032 tooling works correctly, with build failures identified as orthogonal code maintenance issues (FlareDB API drift from T037).
**Key Achievement:** VDE switch networking breakthrough proves multi-VM cluster viability on single host.
---
## Technical Wins
### 1. VDE Switch Networking (Critical Breakthrough)
**Problem:** QEMU socket multicast designed for cross-host VMs, not same-host L2 networking.
**Symptoms:**
- Static IPs configured successfully
- Ping failed: 100% packet loss
- ARP tables empty (no neighbor discovery)
**Solution:** VDE (Virtual Distributed Ethernet) switch
```bash
# Start VDE switch daemon
vde_switch -d -s /tmp/vde.sock -M /tmp/vde.mgmt
# QEMU launch with VDE
qemu-system-x86_64 \
-netdev vde,id=vde0,sock=/tmp/vde.sock \
-device virtio-net-pci,netdev=vde0,mac=52:54:00:12:34:01
```
**Evidence:**
- node01→node02: 0% packet loss, ~0.7ms latency
- node02→node03: 0% packet loss (after ARP delay)
- Full mesh L2 connectivity verified across 3 VMs
**Impact:** Enables true L2 broadcast domain for Raft cluster testing on single host.
---
### 2. Custom Netboot with SSH Key (Zero-Touch Provisioning)
**Problem:** VMs required manual network configuration via VNC or telnet console.
**Solution:** Bake SSH public key into netboot image
```nix
# nix/images/netboot-base.nix
users.users.root.openssh.authorizedKeys.keys = [
"ssh-ed25519 AAAAC3Nza... centra@cn-nixos-think"
];
```
**Build & Launch:**
```bash
# Build custom netboot
nix build .#netboot-base
# Direct kernel/initrd boot with QEMU
qemu-system-x86_64 \
-kernel netboot-kernel/bzImage \
-initrd netboot-initrd/initrd \
-append "init=/nix/store/.../init console=ttyS0,115200"
```
**Result:** SSH access immediately available on boot (ports 2201/2202/2203), zero manual steps.
**Impact:** Eliminates VNC/telnet/password requirements entirely for automation.
---
### 3. Disk Automation (Manual but Repeatable)
**Approach:** Direct SSH provisioning with disk setup script
```bash
# Partition disk
parted /dev/vda -- mklabel gpt
parted /dev/vda -- mkpart ESP fat32 1MB 512MB
parted /dev/vda -- mkpart primary ext4 512MB 100%
parted /dev/vda -- set 1 esp on
# Format and mount
mkfs.fat -F 32 -n boot /dev/vda1
mkfs.ext4 -L nixos /dev/vda2
mount /dev/vda2 /mnt
mkdir -p /mnt/boot
mount /dev/vda1 /mnt/boot
```
**Result:** All 3 VMs ready for NixOS install with consistent disk layout.
**Impact:** Validates T032 disk automation concepts, ready for final service deployment.
---
## Strategic Insights
### 1. MVP Validation Path Should Be Simplest First
**Observation:** 4+ hours spent on tooling (nixos-anywhere, disko, flake integration) before discovering build drift.
**Cascade Pattern:**
1. nixos-anywhere attempt (~3h): git tree → path resolution → disko → package resolution
2. Networking pivot (~1h): multicast failure → VDE switch success ✅
3. Manual provisioning (P2): disk setup ✅ → build failures (code drift)
**Learning:** Start with P2 (manual binary deployment) for initial validation, automate after success.
**T032 Application:** Bare-metal should use simpler provisioning path initially, add automation incrementally.
---
### 2. Nixos-anywhere + Hybrid Flake Has Integration Complexity
**Challenges Encountered:**
1. **Dirty git tree:** Staged files not in nix store (requires commit)
2. **Path resolution:** Relative imports fail in flake context (must be exact)
3. **Disko module:** Must be in flake inputs AND nixosSystem modules
4. **Package resolution:** nixosSystem context lacks access to workspace packages (overlay not applied)
**Root Cause:** Flake evaluation purity conflicts with development workflow.
**Learning:** Flake-based nixos-anywhere requires clean git, exact paths, and full dependency graph in flake.nix.
**T032 Application:** Consider non-flake nixos-anywhere path for bare-metal, or maintain separate deployment flake.
---
### 3. Code Drift Detection Needs Integration Testing
**Issue:** T037 SQL layer API changes broke flaredb-server without detection.
**Symptoms:**
```rust
error[E0599]: no method named `rows` found for struct `flaredb_sql::QueryResult`
error[E0560]: struct `ErrorResult` has no field named `message`
```
**Root Cause:** Workspace crates updated independently without cross-crate testing.
**Learning:** Need integration tests across workspace dependencies to catch API drift early.
**Action:** T038 created to fix drift + establish integration testing.
---
## Execution Timeline
**Total:** ~5 hours
**Outcome:** Infrastructure validated, build drift identified
| Phase | Duration | Result |
|-------|----------|--------|
| S1: VM Infrastructure | 30 min | ✅ 3 VMs + netboot |
| S2: SSH Access (Custom Netboot) | 1h | ✅ Zero-touch SSH |
| S3: TLS Certificates | 15 min | ✅ Certs deployed |
| S4: Node Configurations | 30 min | ✅ Configs ready |
| S5: Provisioning Attempts | 3h+ | ⚠️ Infrastructure validated, builds blocked |
| - nixos-anywhere debugging | ~3h | ⚠️ Flake complexity |
| - Networking pivot (VDE) | ~1h | ✅ L2 breakthrough |
| - Disk setup (manual) | 30 min | ✅ All nodes ready |
| S6: Cluster Validation | Deferred | ⏸️ Blocked on T038 |
---
## Recommendations for T032 Bare-Metal
### 1. Networking
- **Use VDE switch equivalent** (likely not needed for bare-metal with real switches)
- **For VM testing:** VDE is correct approach for multi-VM on single host
- **For bare-metal:** Standard L2 switches provide broadcast domain
### 2. Provisioning
- **Option A (Simple):** Manual binary deployment + systemd units (like P2 approach)
- Pros: Fast, debuggable, no flake complexity
- Cons: Less automated
- **Option B (Automated):** nixos-anywhere with simplified non-flake config
- Pros: Fully automated, reproducible
- Cons: Requires debugging time, flake purity issues
**Recommendation:** Start with Option A for initial deployment, migrate to Option B after validation.
### 3. Build System
- **Fix T038 first:** Ensure all builds work before bare-metal deployment
- **Test in nix-shell:** Verify cargo build environment before nix build
- **Integration tests:** Add cross-workspace crate testing to CI/CD
### 4. Custom Netboot
- **Keep SSH key approach:** Eliminates manual console access
- **Validate on bare-metal:** Test PXE boot flow with SSH key in netboot image
- **Fallback plan:** Keep VNC/IPMI access available for debugging
---
## Technical Debt
### Immediate (T038)
- [ ] Fix FlareDB API drift from T037
- [ ] Verify nix-shell cargo build environment
- [ ] Build all 3 service binaries successfully
- [ ] Deploy to T036 VMs and complete S6 validation
### Future (T039+)
- [ ] Add integration tests across workspace crates
- [ ] Simplify nixos-anywhere flake integration
- [ ] Document development workflow (git, flakes, nix-shell)
- [ ] CI/CD for cross-crate API compatibility
---
## Conclusion
**T036 achieved its goal:** Validate T032 provisioning tools before bare-metal deployment.
**Success Metrics:**
- ✅ VM infrastructure operational (3 nodes, VDE networking)
- ✅ Custom netboot with SSH key (zero-touch access)
- ✅ Disk automation validated (all nodes partitioned/mounted)
- ✅ TLS certificates deployed
- ✅ Network configuration validated (static IPs, hostname resolution)
**Blockers Identified:**
- ❌ FlareDB API drift (T037) - code maintenance, NOT provisioning issue
- ❌ Cargo build environment - tooling configuration, NOT infrastructure issue
**Risk Reduction for T032:**
- VDE breakthrough proves VM cluster viability
- Custom netboot validates automation concepts
- Disk setup process validated and documented
- Build drift identified before bare-metal investment
**Next Steps:**
1. Complete T038 (code drift cleanup)
2. Resume T036.S6 with working binaries (VMs still running, ready)
3. Assess T032 readiness (tooling validated, proceed with confidence)
**ROI:** Negative for cluster validation (4+ hours, no cluster), but positive for risk reduction (infrastructure proven, blockers identified early).