centra d2149b6249 fix(lightningstor): Fix SigV4 canonicalization for AWS S3 auth

- Replace form_urlencoded with RFC 3986 compliant URI encoding
- Implement aws_uri_encode() matching AWS SigV4 spec exactly
- Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded
- All other chars percent-encoded with uppercase hex
- Preserve slashes in paths, encode in query params
- Normalize empty paths to '/' per AWS spec
- Fix test expectations (body hash, HMAC values)
- Add comprehensive SigV4 signature determinism test

This fixes the canonicalization mismatch that caused signature
validation failures in T047. Auth can now be enabled for production.

Refs: T058.S1

2025-12-12 06:23:46 +09:00

8.2 KiB

Raw Blame History

T036 VM Cluster Deployment - Key Learnings

Status: Partial Success (Infrastructure Validated) Date: 2025-12-11 Duration: ~5 hours Outcome: Provisioning tools validated, service deployment deferred to T038

Executive Summary

T036 successfully validated VM infrastructure, networking automation, and provisioning concepts for T032 bare-metal deployment. The task demonstrated that T032 tooling works correctly, with build failures identified as orthogonal code maintenance issues (FlareDB API drift from T037).

Key Achievement: VDE switch networking breakthrough proves multi-VM cluster viability on single host.

Technical Wins

1. VDE Switch Networking (Critical Breakthrough)

Problem: QEMU socket multicast designed for cross-host VMs, not same-host L2 networking.

Symptoms:

Static IPs configured successfully
Ping failed: 100% packet loss
ARP tables empty (no neighbor discovery)

Solution: VDE (Virtual Distributed Ethernet) switch

# Start VDE switch daemon
vde_switch -d -s /tmp/vde.sock -M /tmp/vde.mgmt

# QEMU launch with VDE
qemu-system-x86_64 \
  -netdev vde,id=vde0,sock=/tmp/vde.sock \
  -device virtio-net-pci,netdev=vde0,mac=52:54:00:12:34:01

Evidence:

node01→node02: 0% packet loss, ~0.7ms latency
node02→node03: 0% packet loss (after ARP delay)
Full mesh L2 connectivity verified across 3 VMs

Impact: Enables true L2 broadcast domain for Raft cluster testing on single host.

2. Custom Netboot with SSH Key (Zero-Touch Provisioning)

Problem: VMs required manual network configuration via VNC or telnet console.

Solution: Bake SSH public key into netboot image

# nix/images/netboot-base.nix
users.users.root.openssh.authorizedKeys.keys = [
  "ssh-ed25519 AAAAC3Nza... centra@cn-nixos-think"
];

Build & Launch:

# Build custom netboot
nix build .#netboot-base

# Direct kernel/initrd boot with QEMU
qemu-system-x86_64 \
  -kernel netboot-kernel/bzImage \
  -initrd netboot-initrd/initrd \
  -append "init=/nix/store/.../init console=ttyS0,115200"

Result: SSH access immediately available on boot (ports 2201/2202/2203), zero manual steps.

Impact: Eliminates VNC/telnet/password requirements entirely for automation.

3. Disk Automation (Manual but Repeatable)

Approach: Direct SSH provisioning with disk setup script

# Partition disk
parted /dev/vda -- mklabel gpt
parted /dev/vda -- mkpart ESP fat32 1MB 512MB
parted /dev/vda -- mkpart primary ext4 512MB 100%
parted /dev/vda -- set 1 esp on

# Format and mount
mkfs.fat -F 32 -n boot /dev/vda1
mkfs.ext4 -L nixos /dev/vda2
mount /dev/vda2 /mnt
mkdir -p /mnt/boot
mount /dev/vda1 /mnt/boot

Result: All 3 VMs ready for NixOS install with consistent disk layout.

Impact: Validates T032 disk automation concepts, ready for final service deployment.

Strategic Insights

1. MVP Validation Path Should Be Simplest First

Observation: 4+ hours spent on tooling (nixos-anywhere, disko, flake integration) before discovering build drift.

Cascade Pattern:

nixos-anywhere attempt (~3h): git tree → path resolution → disko → package resolution
Networking pivot (~1h): multicast failure → VDE switch success ✅
Manual provisioning (P2): disk setup ✅ → build failures (code drift)

Learning: Start with P2 (manual binary deployment) for initial validation, automate after success.

T032 Application: Bare-metal should use simpler provisioning path initially, add automation incrementally.

2. Nixos-anywhere + Hybrid Flake Has Integration Complexity

Challenges Encountered:

Dirty git tree: Staged files not in nix store (requires commit)
Path resolution: Relative imports fail in flake context (must be exact)
Disko module: Must be in flake inputs AND nixosSystem modules
Package resolution: nixosSystem context lacks access to workspace packages (overlay not applied)

Root Cause: Flake evaluation purity conflicts with development workflow.

Learning: Flake-based nixos-anywhere requires clean git, exact paths, and full dependency graph in flake.nix.

T032 Application: Consider non-flake nixos-anywhere path for bare-metal, or maintain separate deployment flake.

3. Code Drift Detection Needs Integration Testing

Issue: T037 SQL layer API changes broke flaredb-server without detection.

Symptoms:

error[E0599]: no method named `rows` found for struct `flaredb_sql::QueryResult`
error[E0560]: struct `ErrorResult` has no field named `message`

Root Cause: Workspace crates updated independently without cross-crate testing.

Learning: Need integration tests across workspace dependencies to catch API drift early.

Action: T038 created to fix drift + establish integration testing.

Execution Timeline

Total: ~5 hours Outcome: Infrastructure validated, build drift identified

Phase	Duration	Result
S1: VM Infrastructure	30 min	✅ 3 VMs + netboot
S2: SSH Access (Custom Netboot)	1h	✅ Zero-touch SSH
S3: TLS Certificates	15 min	✅ Certs deployed
S4: Node Configurations	30 min	✅ Configs ready
S5: Provisioning Attempts	3h+	⚠️ Infrastructure validated, builds blocked
- nixos-anywhere debugging	~3h	⚠️ Flake complexity
- Networking pivot (VDE)	~1h	✅ L2 breakthrough
- Disk setup (manual)	30 min	✅ All nodes ready
S6: Cluster Validation	Deferred	⏸️ Blocked on T038

Recommendations for T032 Bare-Metal

1. Networking

Use VDE switch equivalent (likely not needed for bare-metal with real switches)
For VM testing: VDE is correct approach for multi-VM on single host
For bare-metal: Standard L2 switches provide broadcast domain

2. Provisioning

Option A (Simple): Manual binary deployment + systemd units (like P2 approach)
- Pros: Fast, debuggable, no flake complexity
- Cons: Less automated
Option B (Automated): nixos-anywhere with simplified non-flake config
- Pros: Fully automated, reproducible
- Cons: Requires debugging time, flake purity issues

Recommendation: Start with Option A for initial deployment, migrate to Option B after validation.

3. Build System

Fix T038 first: Ensure all builds work before bare-metal deployment
Test in nix-shell: Verify cargo build environment before nix build
Integration tests: Add cross-workspace crate testing to CI/CD

4. Custom Netboot

Keep SSH key approach: Eliminates manual console access
Validate on bare-metal: Test PXE boot flow with SSH key in netboot image
Fallback plan: Keep VNC/IPMI access available for debugging

Technical Debt

Immediate (T038)

Fix FlareDB API drift from T037
Verify nix-shell cargo build environment
Build all 3 service binaries successfully
Deploy to T036 VMs and complete S6 validation

Future (T039+)

Add integration tests across workspace crates
Simplify nixos-anywhere flake integration
Document development workflow (git, flakes, nix-shell)
CI/CD for cross-crate API compatibility

Conclusion

T036 achieved its goal: Validate T032 provisioning tools before bare-metal deployment.

Success Metrics:

✅ VM infrastructure operational (3 nodes, VDE networking)
✅ Custom netboot with SSH key (zero-touch access)
✅ Disk automation validated (all nodes partitioned/mounted)
✅ TLS certificates deployed
✅ Network configuration validated (static IPs, hostname resolution)

Blockers Identified:

❌ FlareDB API drift (T037) - code maintenance, NOT provisioning issue
❌ Cargo build environment - tooling configuration, NOT infrastructure issue

Risk Reduction for T032:

VDE breakthrough proves VM cluster viability
Custom netboot validates automation concepts
Disk setup process validated and documented
Build drift identified before bare-metal investment

Next Steps:

Complete T038 (code drift cleanup)
Resume T036.S6 with working binaries (VMs still running, ready)
Assess T032 readiness (tooling validated, proceed with confidence)

ROI: Negative for cluster validation (4+ hours, no cluster), but positive for risk reduction (infrastructure proven, blockers identified early).

8.2 KiB Raw Blame History