- Replace form_urlencoded with RFC 3986 compliant URI encoding - Implement aws_uri_encode() matching AWS SigV4 spec exactly - Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded - All other chars percent-encoded with uppercase hex - Preserve slashes in paths, encode in query params - Normalize empty paths to '/' per AWS spec - Fix test expectations (body hash, HMAC values) - Add comprehensive SigV4 signature determinism test This fixes the canonicalization mismatch that caused signature validation failures in T047. Auth can now be enabled for production. Refs: T058.S1
4.2 KiB
4.2 KiB
T040.S3 PlasmaVMC HA Behavior Runbook
Objective
Document PlasmaVMC behavior when host fails. This is a gap documentation exercise - live migration is NOT implemented.
Current Capability Assessment
What IS Implemented
| Feature | Status | Location |
|---|---|---|
| VM State tracking | YES | plasmavmc-types/src/vm.rs:56 - VmState::Migrating |
| KVM capability flag | YES | plasmavmc-kvm/src/lib.rs:147 - live_migration: true |
| QMP state parsing | YES | plasmavmc-kvm/src/qmp.rs:99 - parses "inmigrate"/"postmigrate" |
| ChainFire persistence | YES | VM metadata stored in cluster KVS |
What is NOT Implemented (GAPS)
| Feature | Gap | Impact |
|---|---|---|
| Live migration API | No migrate() function |
VMs cannot move between hosts |
| Host failure detection | No health monitoring | VM loss undetected |
| Automatic recovery | No failover logic | Manual intervention required |
| Shared storage | No VM disk migration | Would need shared storage (Ceph/NFS) |
Test Scenarios
Scenario 1: Document Current VM Lifecycle
# Create a VM
grpcurl -plaintext localhost:50051 plasmavmc.VmService/CreateVm \
-d '{"name":"test-vm","vcpus":1,"memory_mb":512}'
# Get VM ID from response
VM_ID="<returned-id>"
# Check VM state
grpcurl -plaintext localhost:50051 plasmavmc.VmService/GetVm \
-d "{\"id\":\"$VM_ID\"}"
# Expected: VM running on this host
Scenario 2: Host Process Kill (Simulated Failure)
# Kill PlasmaVMC server
kill -9 $(pgrep -f plasmavmc-server)
# QEMU processes continue running (orphaned)
ps aux | grep qemu
# Expected Behavior:
# - QEMU continues (not managed)
# - VM metadata in ChainFire still shows "Running"
# - No automatic recovery
Scenario 3: Restart PlasmaVMC Server
# Restart server
./plasmavmc-server &
# Check if VM is rediscovered
grpcurl -plaintext localhost:50051 plasmavmc.VmService/ListVms
# Expected Behavior (DOCUMENT):
# Option A: Server reads ChainFire, finds orphan, reconnects QMP
# Option B: Server reads ChainFire, state mismatch (metadata vs reality)
# Option C: Server starts fresh, VMs lost from management
Scenario 4: QEMU Process Kill (VM Crash)
# Kill QEMU directly
kill -9 $(pgrep -f "qemu.*$VM_ID")
# Check PlasmaVMC state
grpcurl -plaintext localhost:50051 plasmavmc.VmService/GetVm \
-d "{\"id\":\"$VM_ID\"}"
# Expected:
# - State should transition to "Failed" or "Unknown"
# - (Or) State stale until next QMP poll
Documentation Template
After testing, fill in this table:
| Failure Mode | Detection Time | Automatic Recovery? | Manual Steps Required |
|---|---|---|---|
| PlasmaVMC server crash | N/A | NO | Restart server, reconcile state |
| QEMU process crash | ? seconds | NO | Delete/recreate VM |
| Host reboot | N/A | NO | VMs lost, recreate from metadata |
| Network partition | N/A | NO | No detection mechanism |
Recommendations for Future Work
Based on test findings, document gaps for future implementation:
-
Host Health Monitoring
- PlasmaVMC servers should heartbeat to ChainFire
- Other nodes detect failure via missed heartbeats
- Estimated effort: Medium
-
VM State Reconciliation
- On startup, scan running QEMUs, match to ChainFire metadata
- Handle orphans and stale entries
- Estimated effort: Medium
-
Live Migration (Full)
- Requires: shared storage, QMP migrate command, network coordination
- Estimated effort: Large (weeks)
-
Cold Migration (Simpler)
- Stop VM, copy disk, start on new host
- More feasible short-term
- Estimated effort: Medium
Success Criteria for S3
| Criterion | Status |
|---|---|
| Current HA capabilities documented | |
| Failure modes tested and recorded | |
| Recovery procedures documented | |
| Gap list with priorities created | |
| No false claims about live migration |
Notes
This runbook is intentionally about documenting current behavior, not testing features that don't exist. The value is in:
- Clarifying what works today
- Identifying gaps for production readiness
- Informing T039 (production deployment) requirements