centra d2149b6249 fix(lightningstor): Fix SigV4 canonicalization for AWS S3 auth

- Replace form_urlencoded with RFC 3986 compliant URI encoding
- Implement aws_uri_encode() matching AWS SigV4 spec exactly
- Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded
- All other chars percent-encoded with uppercase hex
- Preserve slashes in paths, encode in query params
- Normalize empty paths to '/' per AWS spec
- Fix test expectations (body hash, HMAC values)
- Add comprehensive SigV4 signature determinism test

This fixes the canonicalization mismatch that caused signature
validation failures in T047. Auth can now be enabled for production.

Refs: T058.S1

2025-12-12 06:23:46 +09:00

4.2 KiB

Raw Blame History

T040.S3 PlasmaVMC HA Behavior Runbook

Objective

Document PlasmaVMC behavior when host fails. This is a gap documentation exercise - live migration is NOT implemented.

Current Capability Assessment

What IS Implemented

Feature	Status	Location
VM State tracking	YES	`plasmavmc-types/src/vm.rs:56` - VmState::Migrating
KVM capability flag	YES	`plasmavmc-kvm/src/lib.rs:147` - `live_migration: true`
QMP state parsing	YES	`plasmavmc-kvm/src/qmp.rs:99` - parses "inmigrate"/"postmigrate"
ChainFire persistence	YES	VM metadata stored in cluster KVS

What is NOT Implemented (GAPS)

Feature	Gap	Impact
Live migration API	No `migrate()` function	VMs cannot move between hosts
Host failure detection	No health monitoring	VM loss undetected
Automatic recovery	No failover logic	Manual intervention required
Shared storage	No VM disk migration	Would need shared storage (Ceph/NFS)

Test Scenarios

Scenario 1: Document Current VM Lifecycle

# Create a VM
grpcurl -plaintext localhost:50051 plasmavmc.VmService/CreateVm \
  -d '{"name":"test-vm","vcpus":1,"memory_mb":512}'

# Get VM ID from response
VM_ID="<returned-id>"

# Check VM state
grpcurl -plaintext localhost:50051 plasmavmc.VmService/GetVm \
  -d "{\"id\":\"$VM_ID\"}"

# Expected: VM running on this host

Scenario 2: Host Process Kill (Simulated Failure)

# Kill PlasmaVMC server
kill -9 $(pgrep -f plasmavmc-server)

# QEMU processes continue running (orphaned)
ps aux | grep qemu

# Expected Behavior:
# - QEMU continues (not managed)
# - VM metadata in ChainFire still shows "Running"
# - No automatic recovery

Scenario 3: Restart PlasmaVMC Server

# Restart server
./plasmavmc-server &

# Check if VM is rediscovered
grpcurl -plaintext localhost:50051 plasmavmc.VmService/ListVms

# Expected Behavior (DOCUMENT):
# Option A: Server reads ChainFire, finds orphan, reconnects QMP
# Option B: Server reads ChainFire, state mismatch (metadata vs reality)
# Option C: Server starts fresh, VMs lost from management

Scenario 4: QEMU Process Kill (VM Crash)

# Kill QEMU directly
kill -9 $(pgrep -f "qemu.*$VM_ID")

# Check PlasmaVMC state
grpcurl -plaintext localhost:50051 plasmavmc.VmService/GetVm \
  -d "{\"id\":\"$VM_ID\"}"

# Expected:
# - State should transition to "Failed" or "Unknown"
# - (Or) State stale until next QMP poll

Documentation Template

After testing, fill in this table:

Failure Mode	Detection Time	Automatic Recovery?	Manual Steps Required
PlasmaVMC server crash	N/A	NO	Restart server, reconcile state
QEMU process crash	? seconds	NO	Delete/recreate VM
Host reboot	N/A	NO	VMs lost, recreate from metadata
Network partition	N/A	NO	No detection mechanism

Recommendations for Future Work

Based on test findings, document gaps for future implementation:

Host Health Monitoring
- PlasmaVMC servers should heartbeat to ChainFire
- Other nodes detect failure via missed heartbeats
- Estimated effort: Medium
VM State Reconciliation
- On startup, scan running QEMUs, match to ChainFire metadata
- Handle orphans and stale entries
- Estimated effort: Medium
Live Migration (Full)
- Requires: shared storage, QMP migrate command, network coordination
- Estimated effort: Large (weeks)
Cold Migration (Simpler)
- Stop VM, copy disk, start on new host
- More feasible short-term
- Estimated effort: Medium

Success Criteria for S3

Criterion	Status
Current HA capabilities documented
Failure modes tested and recorded
Recovery procedures documented
Gap list with priorities created
No false claims about live migration

Notes

This runbook is intentionally about documenting current behavior, not testing features that don't exist. The value is in:

Clarifying what works today
Identifying gaps for production readiness
Informing T039 (production deployment) requirements

4.2 KiB Raw Blame History