photoncloud-monorepo/docs/por/T040-ha-validation/s3-plasmavmc-ha-runbook.md

# T040.S3 PlasmaVMC HA Behavior Runbook

## Objective
Document PlasmaVMC behavior when host fails. This is a **gap documentation** exercise - live migration is NOT implemented.

## Current Capability Assessment

### What IS Implemented
| Feature | Status | Location |
|---------|--------|----------|
| VM State tracking | YES | `plasmavmc-types/src/vm.rs:56` - VmState::Migrating |
| KVM capability flag | YES | `plasmavmc-kvm/src/lib.rs:147` - `live_migration: true` |
| QMP state parsing | YES | `plasmavmc-kvm/src/qmp.rs:99` - parses "inmigrate"/"postmigrate" |
| ChainFire persistence | YES | VM metadata stored in cluster KVS |

### What is NOT Implemented (GAPS)
| Feature | Gap | Impact |
|---------|-----|--------|
| Live migration API | No `migrate()` function | VMs cannot move between hosts |
| Host failure detection | No health monitoring | VM loss undetected |
| Automatic recovery | No failover logic | Manual intervention required |
| Shared storage | No VM disk migration | Would need shared storage (Ceph/NFS) |

---

## Test Scenarios

### Scenario 1: Document Current VM Lifecycle

```bash
# Create a VM
grpcurl -plaintext localhost:50051 plasmavmc.VmService/CreateVm \
  -d '{"name":"test-vm","vcpus":1,"memory_mb":512}'

# Get VM ID from response
VM_ID="<returned-id>"

# Check VM state
grpcurl -plaintext localhost:50051 plasmavmc.VmService/GetVm \
  -d "{\"id\":\"$VM_ID\"}"

# Expected: VM running on this host
```

### Scenario 2: Host Process Kill (Simulated Failure)

```bash
# Kill PlasmaVMC server
kill -9 $(pgrep -f plasmavmc-server)

# QEMU processes continue running (orphaned)
ps aux | grep qemu

# Expected Behavior:
# - QEMU continues (not managed)
# - VM metadata in ChainFire still shows "Running"
# - No automatic recovery
```

### Scenario 3: Restart PlasmaVMC Server

```bash
# Restart server
./plasmavmc-server &

# Check if VM is rediscovered
grpcurl -plaintext localhost:50051 plasmavmc.VmService/ListVms

# Expected Behavior (DOCUMENT):
# Option A: Server reads ChainFire, finds orphan, reconnects QMP
# Option B: Server reads ChainFire, state mismatch (metadata vs reality)
# Option C: Server starts fresh, VMs lost from management
```

### Scenario 4: QEMU Process Kill (VM Crash)

```bash
# Kill QEMU directly
kill -9 $(pgrep -f "qemu.*$VM_ID")

# Check PlasmaVMC state
grpcurl -plaintext localhost:50051 plasmavmc.VmService/GetVm \
  -d "{\"id\":\"$VM_ID\"}"

# Expected:
# - State should transition to "Failed" or "Unknown"
# - (Or) State stale until next QMP poll
```

---

## Documentation Template

After testing, fill in this table:

| Failure Mode | Detection Time | Automatic Recovery? | Manual Steps Required |
|--------------|----------------|--------------------|-----------------------|
| PlasmaVMC server crash | N/A | NO | Restart server, reconcile state |
| QEMU process crash | ? seconds | NO | Delete/recreate VM |
| Host reboot | N/A | NO | VMs lost, recreate from metadata |
| Network partition | N/A | NO | No detection mechanism |

---

## Recommendations for Future Work

Based on test findings, document gaps for future implementation:

1. **Host Health Monitoring**
   - PlasmaVMC servers should heartbeat to ChainFire
   - Other nodes detect failure via missed heartbeats
   - Estimated effort: Medium

2. **VM State Reconciliation**
   - On startup, scan running QEMUs, match to ChainFire metadata
   - Handle orphans and stale entries
   - Estimated effort: Medium

3. **Live Migration (Full)**
   - Requires: shared storage, QMP migrate command, network coordination
   - Estimated effort: Large (weeks)

4. **Cold Migration (Simpler)**
   - Stop VM, copy disk, start on new host
   - More feasible short-term
   - Estimated effort: Medium

---

## Success Criteria for S3

| Criterion | Status |
|-----------|--------|
| Current HA capabilities documented | |
| Failure modes tested and recorded | |
| Recovery procedures documented | |
| Gap list with priorities created | |
| No false claims about live migration | |

---

## Notes

This runbook is intentionally about **documenting current behavior**, not testing features that don't exist. The value is in:
1. Clarifying what works today
2. Identifying gaps for production readiness
3. Informing T039 (production deployment) requirements