# T040.S3 PlasmaVMC HA Behavior Runbook ## Objective Document PlasmaVMC behavior when host fails. This is a **gap documentation** exercise - live migration is NOT implemented. ## Current Capability Assessment ### What IS Implemented | Feature | Status | Location | |---------|--------|----------| | VM State tracking | YES | `plasmavmc-types/src/vm.rs:56` - VmState::Migrating | | KVM capability flag | YES | `plasmavmc-kvm/src/lib.rs:147` - `live_migration: true` | | QMP state parsing | YES | `plasmavmc-kvm/src/qmp.rs:99` - parses "inmigrate"/"postmigrate" | | ChainFire persistence | YES | VM metadata stored in cluster KVS | ### What is NOT Implemented (GAPS) | Feature | Gap | Impact | |---------|-----|--------| | Live migration API | No `migrate()` function | VMs cannot move between hosts | | Host failure detection | No health monitoring | VM loss undetected | | Automatic recovery | No failover logic | Manual intervention required | | Shared storage | No VM disk migration | Would need shared storage (Ceph/NFS) | --- ## Test Scenarios ### Scenario 1: Document Current VM Lifecycle ```bash # Create a VM grpcurl -plaintext localhost:50051 plasmavmc.VmService/CreateVm \ -d '{"name":"test-vm","vcpus":1,"memory_mb":512}' # Get VM ID from response VM_ID="" # Check VM state grpcurl -plaintext localhost:50051 plasmavmc.VmService/GetVm \ -d "{\"id\":\"$VM_ID\"}" # Expected: VM running on this host ``` ### Scenario 2: Host Process Kill (Simulated Failure) ```bash # Kill PlasmaVMC server kill -9 $(pgrep -f plasmavmc-server) # QEMU processes continue running (orphaned) ps aux | grep qemu # Expected Behavior: # - QEMU continues (not managed) # - VM metadata in ChainFire still shows "Running" # - No automatic recovery ``` ### Scenario 3: Restart PlasmaVMC Server ```bash # Restart server ./plasmavmc-server & # Check if VM is rediscovered grpcurl -plaintext localhost:50051 plasmavmc.VmService/ListVms # Expected Behavior (DOCUMENT): # Option A: Server reads ChainFire, finds orphan, reconnects QMP # Option B: Server reads ChainFire, state mismatch (metadata vs reality) # Option C: Server starts fresh, VMs lost from management ``` ### Scenario 4: QEMU Process Kill (VM Crash) ```bash # Kill QEMU directly kill -9 $(pgrep -f "qemu.*$VM_ID") # Check PlasmaVMC state grpcurl -plaintext localhost:50051 plasmavmc.VmService/GetVm \ -d "{\"id\":\"$VM_ID\"}" # Expected: # - State should transition to "Failed" or "Unknown" # - (Or) State stale until next QMP poll ``` --- ## Documentation Template After testing, fill in this table: | Failure Mode | Detection Time | Automatic Recovery? | Manual Steps Required | |--------------|----------------|--------------------|-----------------------| | PlasmaVMC server crash | N/A | NO | Restart server, reconcile state | | QEMU process crash | ? seconds | NO | Delete/recreate VM | | Host reboot | N/A | NO | VMs lost, recreate from metadata | | Network partition | N/A | NO | No detection mechanism | --- ## Recommendations for Future Work Based on test findings, document gaps for future implementation: 1. **Host Health Monitoring** - PlasmaVMC servers should heartbeat to ChainFire - Other nodes detect failure via missed heartbeats - Estimated effort: Medium 2. **VM State Reconciliation** - On startup, scan running QEMUs, match to ChainFire metadata - Handle orphans and stale entries - Estimated effort: Medium 3. **Live Migration (Full)** - Requires: shared storage, QMP migrate command, network coordination - Estimated effort: Large (weeks) 4. **Cold Migration (Simpler)** - Stop VM, copy disk, start on new host - More feasible short-term - Estimated effort: Medium --- ## Success Criteria for S3 | Criterion | Status | |-----------|--------| | Current HA capabilities documented | | | Failure modes tested and recorded | | | Recovery procedures documented | | | Gap list with priorities created | | | No false claims about live migration | | --- ## Notes This runbook is intentionally about **documenting current behavior**, not testing features that don't exist. The value is in: 1. Clarifying what works today 2. Identifying gaps for production readiness 3. Informing T039 (production deployment) requirements