photoncloud-monorepo/docs/por/T040-ha-validation/s3-plasmavmc-ha-runbook.md
centra d2149b6249 fix(lightningstor): Fix SigV4 canonicalization for AWS S3 auth
- Replace form_urlencoded with RFC 3986 compliant URI encoding
- Implement aws_uri_encode() matching AWS SigV4 spec exactly
- Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded
- All other chars percent-encoded with uppercase hex
- Preserve slashes in paths, encode in query params
- Normalize empty paths to '/' per AWS spec
- Fix test expectations (body hash, HMAC values)
- Add comprehensive SigV4 signature determinism test

This fixes the canonicalization mismatch that caused signature
validation failures in T047. Auth can now be enabled for production.

Refs: T058.S1
2025-12-12 06:23:46 +09:00

4.2 KiB

T040.S3 PlasmaVMC HA Behavior Runbook

Objective

Document PlasmaVMC behavior when host fails. This is a gap documentation exercise - live migration is NOT implemented.

Current Capability Assessment

What IS Implemented

Feature Status Location
VM State tracking YES plasmavmc-types/src/vm.rs:56 - VmState::Migrating
KVM capability flag YES plasmavmc-kvm/src/lib.rs:147 - live_migration: true
QMP state parsing YES plasmavmc-kvm/src/qmp.rs:99 - parses "inmigrate"/"postmigrate"
ChainFire persistence YES VM metadata stored in cluster KVS

What is NOT Implemented (GAPS)

Feature Gap Impact
Live migration API No migrate() function VMs cannot move between hosts
Host failure detection No health monitoring VM loss undetected
Automatic recovery No failover logic Manual intervention required
Shared storage No VM disk migration Would need shared storage (Ceph/NFS)

Test Scenarios

Scenario 1: Document Current VM Lifecycle

# Create a VM
grpcurl -plaintext localhost:50051 plasmavmc.VmService/CreateVm \
  -d '{"name":"test-vm","vcpus":1,"memory_mb":512}'

# Get VM ID from response
VM_ID="<returned-id>"

# Check VM state
grpcurl -plaintext localhost:50051 plasmavmc.VmService/GetVm \
  -d "{\"id\":\"$VM_ID\"}"

# Expected: VM running on this host

Scenario 2: Host Process Kill (Simulated Failure)

# Kill PlasmaVMC server
kill -9 $(pgrep -f plasmavmc-server)

# QEMU processes continue running (orphaned)
ps aux | grep qemu

# Expected Behavior:
# - QEMU continues (not managed)
# - VM metadata in ChainFire still shows "Running"
# - No automatic recovery

Scenario 3: Restart PlasmaVMC Server

# Restart server
./plasmavmc-server &

# Check if VM is rediscovered
grpcurl -plaintext localhost:50051 plasmavmc.VmService/ListVms

# Expected Behavior (DOCUMENT):
# Option A: Server reads ChainFire, finds orphan, reconnects QMP
# Option B: Server reads ChainFire, state mismatch (metadata vs reality)
# Option C: Server starts fresh, VMs lost from management

Scenario 4: QEMU Process Kill (VM Crash)

# Kill QEMU directly
kill -9 $(pgrep -f "qemu.*$VM_ID")

# Check PlasmaVMC state
grpcurl -plaintext localhost:50051 plasmavmc.VmService/GetVm \
  -d "{\"id\":\"$VM_ID\"}"

# Expected:
# - State should transition to "Failed" or "Unknown"
# - (Or) State stale until next QMP poll

Documentation Template

After testing, fill in this table:

Failure Mode Detection Time Automatic Recovery? Manual Steps Required
PlasmaVMC server crash N/A NO Restart server, reconcile state
QEMU process crash ? seconds NO Delete/recreate VM
Host reboot N/A NO VMs lost, recreate from metadata
Network partition N/A NO No detection mechanism

Recommendations for Future Work

Based on test findings, document gaps for future implementation:

  1. Host Health Monitoring

    • PlasmaVMC servers should heartbeat to ChainFire
    • Other nodes detect failure via missed heartbeats
    • Estimated effort: Medium
  2. VM State Reconciliation

    • On startup, scan running QEMUs, match to ChainFire metadata
    • Handle orphans and stale entries
    • Estimated effort: Medium
  3. Live Migration (Full)

    • Requires: shared storage, QMP migrate command, network coordination
    • Estimated effort: Large (weeks)
  4. Cold Migration (Simpler)

    • Stop VM, copy disk, start on new host
    • More feasible short-term
    • Estimated effort: Medium

Success Criteria for S3

Criterion Status
Current HA capabilities documented
Failure modes tested and recorded
Recovery procedures documented
Gap list with priorities created
No false claims about live migration

Notes

This runbook is intentionally about documenting current behavior, not testing features that don't exist. The value is in:

  1. Clarifying what works today
  2. Identifying gaps for production readiness
  3. Informing T039 (production deployment) requirements