- Replace form_urlencoded with RFC 3986 compliant URI encoding - Implement aws_uri_encode() matching AWS SigV4 spec exactly - Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded - All other chars percent-encoded with uppercase hex - Preserve slashes in paths, encode in query params - Normalize empty paths to '/' per AWS spec - Fix test expectations (body hash, HMAC values) - Add comprehensive SigV4 signature determinism test This fixes the canonicalization mismatch that caused signature validation failures in T047. Auth can now be enabled for production. Refs: T058.S1
147 lines
4.2 KiB
Markdown
147 lines
4.2 KiB
Markdown
# T040.S3 PlasmaVMC HA Behavior Runbook
|
|
|
|
## Objective
|
|
Document PlasmaVMC behavior when host fails. This is a **gap documentation** exercise - live migration is NOT implemented.
|
|
|
|
## Current Capability Assessment
|
|
|
|
### What IS Implemented
|
|
| Feature | Status | Location |
|
|
|---------|--------|----------|
|
|
| VM State tracking | YES | `plasmavmc-types/src/vm.rs:56` - VmState::Migrating |
|
|
| KVM capability flag | YES | `plasmavmc-kvm/src/lib.rs:147` - `live_migration: true` |
|
|
| QMP state parsing | YES | `plasmavmc-kvm/src/qmp.rs:99` - parses "inmigrate"/"postmigrate" |
|
|
| ChainFire persistence | YES | VM metadata stored in cluster KVS |
|
|
|
|
### What is NOT Implemented (GAPS)
|
|
| Feature | Gap | Impact |
|
|
|---------|-----|--------|
|
|
| Live migration API | No `migrate()` function | VMs cannot move between hosts |
|
|
| Host failure detection | No health monitoring | VM loss undetected |
|
|
| Automatic recovery | No failover logic | Manual intervention required |
|
|
| Shared storage | No VM disk migration | Would need shared storage (Ceph/NFS) |
|
|
|
|
---
|
|
|
|
## Test Scenarios
|
|
|
|
### Scenario 1: Document Current VM Lifecycle
|
|
|
|
```bash
|
|
# Create a VM
|
|
grpcurl -plaintext localhost:50051 plasmavmc.VmService/CreateVm \
|
|
-d '{"name":"test-vm","vcpus":1,"memory_mb":512}'
|
|
|
|
# Get VM ID from response
|
|
VM_ID="<returned-id>"
|
|
|
|
# Check VM state
|
|
grpcurl -plaintext localhost:50051 plasmavmc.VmService/GetVm \
|
|
-d "{\"id\":\"$VM_ID\"}"
|
|
|
|
# Expected: VM running on this host
|
|
```
|
|
|
|
### Scenario 2: Host Process Kill (Simulated Failure)
|
|
|
|
```bash
|
|
# Kill PlasmaVMC server
|
|
kill -9 $(pgrep -f plasmavmc-server)
|
|
|
|
# QEMU processes continue running (orphaned)
|
|
ps aux | grep qemu
|
|
|
|
# Expected Behavior:
|
|
# - QEMU continues (not managed)
|
|
# - VM metadata in ChainFire still shows "Running"
|
|
# - No automatic recovery
|
|
```
|
|
|
|
### Scenario 3: Restart PlasmaVMC Server
|
|
|
|
```bash
|
|
# Restart server
|
|
./plasmavmc-server &
|
|
|
|
# Check if VM is rediscovered
|
|
grpcurl -plaintext localhost:50051 plasmavmc.VmService/ListVms
|
|
|
|
# Expected Behavior (DOCUMENT):
|
|
# Option A: Server reads ChainFire, finds orphan, reconnects QMP
|
|
# Option B: Server reads ChainFire, state mismatch (metadata vs reality)
|
|
# Option C: Server starts fresh, VMs lost from management
|
|
```
|
|
|
|
### Scenario 4: QEMU Process Kill (VM Crash)
|
|
|
|
```bash
|
|
# Kill QEMU directly
|
|
kill -9 $(pgrep -f "qemu.*$VM_ID")
|
|
|
|
# Check PlasmaVMC state
|
|
grpcurl -plaintext localhost:50051 plasmavmc.VmService/GetVm \
|
|
-d "{\"id\":\"$VM_ID\"}"
|
|
|
|
# Expected:
|
|
# - State should transition to "Failed" or "Unknown"
|
|
# - (Or) State stale until next QMP poll
|
|
```
|
|
|
|
---
|
|
|
|
## Documentation Template
|
|
|
|
After testing, fill in this table:
|
|
|
|
| Failure Mode | Detection Time | Automatic Recovery? | Manual Steps Required |
|
|
|--------------|----------------|--------------------|-----------------------|
|
|
| PlasmaVMC server crash | N/A | NO | Restart server, reconcile state |
|
|
| QEMU process crash | ? seconds | NO | Delete/recreate VM |
|
|
| Host reboot | N/A | NO | VMs lost, recreate from metadata |
|
|
| Network partition | N/A | NO | No detection mechanism |
|
|
|
|
---
|
|
|
|
## Recommendations for Future Work
|
|
|
|
Based on test findings, document gaps for future implementation:
|
|
|
|
1. **Host Health Monitoring**
|
|
- PlasmaVMC servers should heartbeat to ChainFire
|
|
- Other nodes detect failure via missed heartbeats
|
|
- Estimated effort: Medium
|
|
|
|
2. **VM State Reconciliation**
|
|
- On startup, scan running QEMUs, match to ChainFire metadata
|
|
- Handle orphans and stale entries
|
|
- Estimated effort: Medium
|
|
|
|
3. **Live Migration (Full)**
|
|
- Requires: shared storage, QMP migrate command, network coordination
|
|
- Estimated effort: Large (weeks)
|
|
|
|
4. **Cold Migration (Simpler)**
|
|
- Stop VM, copy disk, start on new host
|
|
- More feasible short-term
|
|
- Estimated effort: Medium
|
|
|
|
---
|
|
|
|
## Success Criteria for S3
|
|
|
|
| Criterion | Status |
|
|
|-----------|--------|
|
|
| Current HA capabilities documented | |
|
|
| Failure modes tested and recorded | |
|
|
| Recovery procedures documented | |
|
|
| Gap list with priorities created | |
|
|
| No false claims about live migration | |
|
|
|
|
---
|
|
|
|
## Notes
|
|
|
|
This runbook is intentionally about **documenting current behavior**, not testing features that don't exist. The value is in:
|
|
1. Clarifying what works today
|
|
2. Identifying gaps for production readiness
|
|
3. Informing T039 (production deployment) requirements
|