photoncloud-monorepo/docs/por/T036-vm-cluster-deployment/LEARNINGS.md
centra d2149b6249 fix(lightningstor): Fix SigV4 canonicalization for AWS S3 auth
- Replace form_urlencoded with RFC 3986 compliant URI encoding
- Implement aws_uri_encode() matching AWS SigV4 spec exactly
- Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded
- All other chars percent-encoded with uppercase hex
- Preserve slashes in paths, encode in query params
- Normalize empty paths to '/' per AWS spec
- Fix test expectations (body hash, HMAC values)
- Add comprehensive SigV4 signature determinism test

This fixes the canonicalization mismatch that caused signature
validation failures in T047. Auth can now be enabled for production.

Refs: T058.S1
2025-12-12 06:23:46 +09:00

8.2 KiB

T036 VM Cluster Deployment - Key Learnings

Status: Partial Success (Infrastructure Validated) Date: 2025-12-11 Duration: ~5 hours Outcome: Provisioning tools validated, service deployment deferred to T038


Executive Summary

T036 successfully validated VM infrastructure, networking automation, and provisioning concepts for T032 bare-metal deployment. The task demonstrated that T032 tooling works correctly, with build failures identified as orthogonal code maintenance issues (FlareDB API drift from T037).

Key Achievement: VDE switch networking breakthrough proves multi-VM cluster viability on single host.


Technical Wins

1. VDE Switch Networking (Critical Breakthrough)

Problem: QEMU socket multicast designed for cross-host VMs, not same-host L2 networking.

Symptoms:

  • Static IPs configured successfully
  • Ping failed: 100% packet loss
  • ARP tables empty (no neighbor discovery)

Solution: VDE (Virtual Distributed Ethernet) switch

# Start VDE switch daemon
vde_switch -d -s /tmp/vde.sock -M /tmp/vde.mgmt

# QEMU launch with VDE
qemu-system-x86_64 \
  -netdev vde,id=vde0,sock=/tmp/vde.sock \
  -device virtio-net-pci,netdev=vde0,mac=52:54:00:12:34:01

Evidence:

  • node01→node02: 0% packet loss, ~0.7ms latency
  • node02→node03: 0% packet loss (after ARP delay)
  • Full mesh L2 connectivity verified across 3 VMs

Impact: Enables true L2 broadcast domain for Raft cluster testing on single host.


2. Custom Netboot with SSH Key (Zero-Touch Provisioning)

Problem: VMs required manual network configuration via VNC or telnet console.

Solution: Bake SSH public key into netboot image

# nix/images/netboot-base.nix
users.users.root.openssh.authorizedKeys.keys = [
  "ssh-ed25519 AAAAC3Nza... centra@cn-nixos-think"
];

Build & Launch:

# Build custom netboot
nix build .#netboot-base

# Direct kernel/initrd boot with QEMU
qemu-system-x86_64 \
  -kernel netboot-kernel/bzImage \
  -initrd netboot-initrd/initrd \
  -append "init=/nix/store/.../init console=ttyS0,115200"

Result: SSH access immediately available on boot (ports 2201/2202/2203), zero manual steps.

Impact: Eliminates VNC/telnet/password requirements entirely for automation.


3. Disk Automation (Manual but Repeatable)

Approach: Direct SSH provisioning with disk setup script

# Partition disk
parted /dev/vda -- mklabel gpt
parted /dev/vda -- mkpart ESP fat32 1MB 512MB
parted /dev/vda -- mkpart primary ext4 512MB 100%
parted /dev/vda -- set 1 esp on

# Format and mount
mkfs.fat -F 32 -n boot /dev/vda1
mkfs.ext4 -L nixos /dev/vda2
mount /dev/vda2 /mnt
mkdir -p /mnt/boot
mount /dev/vda1 /mnt/boot

Result: All 3 VMs ready for NixOS install with consistent disk layout.

Impact: Validates T032 disk automation concepts, ready for final service deployment.


Strategic Insights

1. MVP Validation Path Should Be Simplest First

Observation: 4+ hours spent on tooling (nixos-anywhere, disko, flake integration) before discovering build drift.

Cascade Pattern:

  1. nixos-anywhere attempt (~3h): git tree → path resolution → disko → package resolution
  2. Networking pivot (~1h): multicast failure → VDE switch success
  3. Manual provisioning (P2): disk setup → build failures (code drift)

Learning: Start with P2 (manual binary deployment) for initial validation, automate after success.

T032 Application: Bare-metal should use simpler provisioning path initially, add automation incrementally.


2. Nixos-anywhere + Hybrid Flake Has Integration Complexity

Challenges Encountered:

  1. Dirty git tree: Staged files not in nix store (requires commit)
  2. Path resolution: Relative imports fail in flake context (must be exact)
  3. Disko module: Must be in flake inputs AND nixosSystem modules
  4. Package resolution: nixosSystem context lacks access to workspace packages (overlay not applied)

Root Cause: Flake evaluation purity conflicts with development workflow.

Learning: Flake-based nixos-anywhere requires clean git, exact paths, and full dependency graph in flake.nix.

T032 Application: Consider non-flake nixos-anywhere path for bare-metal, or maintain separate deployment flake.


3. Code Drift Detection Needs Integration Testing

Issue: T037 SQL layer API changes broke flaredb-server without detection.

Symptoms:

error[E0599]: no method named `rows` found for struct `flaredb_sql::QueryResult`
error[E0560]: struct `ErrorResult` has no field named `message`

Root Cause: Workspace crates updated independently without cross-crate testing.

Learning: Need integration tests across workspace dependencies to catch API drift early.

Action: T038 created to fix drift + establish integration testing.


Execution Timeline

Total: ~5 hours Outcome: Infrastructure validated, build drift identified

Phase Duration Result
S1: VM Infrastructure 30 min 3 VMs + netboot
S2: SSH Access (Custom Netboot) 1h Zero-touch SSH
S3: TLS Certificates 15 min Certs deployed
S4: Node Configurations 30 min Configs ready
S5: Provisioning Attempts 3h+ ⚠️ Infrastructure validated, builds blocked
- nixos-anywhere debugging ~3h ⚠️ Flake complexity
- Networking pivot (VDE) ~1h L2 breakthrough
- Disk setup (manual) 30 min All nodes ready
S6: Cluster Validation Deferred ⏸️ Blocked on T038

Recommendations for T032 Bare-Metal

1. Networking

  • Use VDE switch equivalent (likely not needed for bare-metal with real switches)
  • For VM testing: VDE is correct approach for multi-VM on single host
  • For bare-metal: Standard L2 switches provide broadcast domain

2. Provisioning

  • Option A (Simple): Manual binary deployment + systemd units (like P2 approach)
    • Pros: Fast, debuggable, no flake complexity
    • Cons: Less automated
  • Option B (Automated): nixos-anywhere with simplified non-flake config
    • Pros: Fully automated, reproducible
    • Cons: Requires debugging time, flake purity issues

Recommendation: Start with Option A for initial deployment, migrate to Option B after validation.

3. Build System

  • Fix T038 first: Ensure all builds work before bare-metal deployment
  • Test in nix-shell: Verify cargo build environment before nix build
  • Integration tests: Add cross-workspace crate testing to CI/CD

4. Custom Netboot

  • Keep SSH key approach: Eliminates manual console access
  • Validate on bare-metal: Test PXE boot flow with SSH key in netboot image
  • Fallback plan: Keep VNC/IPMI access available for debugging

Technical Debt

Immediate (T038)

  • Fix FlareDB API drift from T037
  • Verify nix-shell cargo build environment
  • Build all 3 service binaries successfully
  • Deploy to T036 VMs and complete S6 validation

Future (T039+)

  • Add integration tests across workspace crates
  • Simplify nixos-anywhere flake integration
  • Document development workflow (git, flakes, nix-shell)
  • CI/CD for cross-crate API compatibility

Conclusion

T036 achieved its goal: Validate T032 provisioning tools before bare-metal deployment.

Success Metrics:

  • VM infrastructure operational (3 nodes, VDE networking)
  • Custom netboot with SSH key (zero-touch access)
  • Disk automation validated (all nodes partitioned/mounted)
  • TLS certificates deployed
  • Network configuration validated (static IPs, hostname resolution)

Blockers Identified:

  • FlareDB API drift (T037) - code maintenance, NOT provisioning issue
  • Cargo build environment - tooling configuration, NOT infrastructure issue

Risk Reduction for T032:

  • VDE breakthrough proves VM cluster viability
  • Custom netboot validates automation concepts
  • Disk setup process validated and documented
  • Build drift identified before bare-metal investment

Next Steps:

  1. Complete T038 (code drift cleanup)
  2. Resume T036.S6 with working binaries (VMs still running, ready)
  3. Assess T032 readiness (tooling validated, proceed with confidence)

ROI: Negative for cluster validation (4+ hours, no cluster), but positive for risk reduction (infrastructure proven, blockers identified early).