photoncloud-monorepo/docs/por/T040-ha-validation/s3-plasmavmc-ha-runbook.md
centra d2149b6249 fix(lightningstor): Fix SigV4 canonicalization for AWS S3 auth
- Replace form_urlencoded with RFC 3986 compliant URI encoding
- Implement aws_uri_encode() matching AWS SigV4 spec exactly
- Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded
- All other chars percent-encoded with uppercase hex
- Preserve slashes in paths, encode in query params
- Normalize empty paths to '/' per AWS spec
- Fix test expectations (body hash, HMAC values)
- Add comprehensive SigV4 signature determinism test

This fixes the canonicalization mismatch that caused signature
validation failures in T047. Auth can now be enabled for production.

Refs: T058.S1
2025-12-12 06:23:46 +09:00

147 lines
4.2 KiB
Markdown

# T040.S3 PlasmaVMC HA Behavior Runbook
## Objective
Document PlasmaVMC behavior when host fails. This is a **gap documentation** exercise - live migration is NOT implemented.
## Current Capability Assessment
### What IS Implemented
| Feature | Status | Location |
|---------|--------|----------|
| VM State tracking | YES | `plasmavmc-types/src/vm.rs:56` - VmState::Migrating |
| KVM capability flag | YES | `plasmavmc-kvm/src/lib.rs:147` - `live_migration: true` |
| QMP state parsing | YES | `plasmavmc-kvm/src/qmp.rs:99` - parses "inmigrate"/"postmigrate" |
| ChainFire persistence | YES | VM metadata stored in cluster KVS |
### What is NOT Implemented (GAPS)
| Feature | Gap | Impact |
|---------|-----|--------|
| Live migration API | No `migrate()` function | VMs cannot move between hosts |
| Host failure detection | No health monitoring | VM loss undetected |
| Automatic recovery | No failover logic | Manual intervention required |
| Shared storage | No VM disk migration | Would need shared storage (Ceph/NFS) |
---
## Test Scenarios
### Scenario 1: Document Current VM Lifecycle
```bash
# Create a VM
grpcurl -plaintext localhost:50051 plasmavmc.VmService/CreateVm \
-d '{"name":"test-vm","vcpus":1,"memory_mb":512}'
# Get VM ID from response
VM_ID="<returned-id>"
# Check VM state
grpcurl -plaintext localhost:50051 plasmavmc.VmService/GetVm \
-d "{\"id\":\"$VM_ID\"}"
# Expected: VM running on this host
```
### Scenario 2: Host Process Kill (Simulated Failure)
```bash
# Kill PlasmaVMC server
kill -9 $(pgrep -f plasmavmc-server)
# QEMU processes continue running (orphaned)
ps aux | grep qemu
# Expected Behavior:
# - QEMU continues (not managed)
# - VM metadata in ChainFire still shows "Running"
# - No automatic recovery
```
### Scenario 3: Restart PlasmaVMC Server
```bash
# Restart server
./plasmavmc-server &
# Check if VM is rediscovered
grpcurl -plaintext localhost:50051 plasmavmc.VmService/ListVms
# Expected Behavior (DOCUMENT):
# Option A: Server reads ChainFire, finds orphan, reconnects QMP
# Option B: Server reads ChainFire, state mismatch (metadata vs reality)
# Option C: Server starts fresh, VMs lost from management
```
### Scenario 4: QEMU Process Kill (VM Crash)
```bash
# Kill QEMU directly
kill -9 $(pgrep -f "qemu.*$VM_ID")
# Check PlasmaVMC state
grpcurl -plaintext localhost:50051 plasmavmc.VmService/GetVm \
-d "{\"id\":\"$VM_ID\"}"
# Expected:
# - State should transition to "Failed" or "Unknown"
# - (Or) State stale until next QMP poll
```
---
## Documentation Template
After testing, fill in this table:
| Failure Mode | Detection Time | Automatic Recovery? | Manual Steps Required |
|--------------|----------------|--------------------|-----------------------|
| PlasmaVMC server crash | N/A | NO | Restart server, reconcile state |
| QEMU process crash | ? seconds | NO | Delete/recreate VM |
| Host reboot | N/A | NO | VMs lost, recreate from metadata |
| Network partition | N/A | NO | No detection mechanism |
---
## Recommendations for Future Work
Based on test findings, document gaps for future implementation:
1. **Host Health Monitoring**
- PlasmaVMC servers should heartbeat to ChainFire
- Other nodes detect failure via missed heartbeats
- Estimated effort: Medium
2. **VM State Reconciliation**
- On startup, scan running QEMUs, match to ChainFire metadata
- Handle orphans and stale entries
- Estimated effort: Medium
3. **Live Migration (Full)**
- Requires: shared storage, QMP migrate command, network coordination
- Estimated effort: Large (weeks)
4. **Cold Migration (Simpler)**
- Stop VM, copy disk, start on new host
- More feasible short-term
- Estimated effort: Medium
---
## Success Criteria for S3
| Criterion | Status |
|-----------|--------|
| Current HA capabilities documented | |
| Failure modes tested and recorded | |
| Recovery procedures documented | |
| Gap list with priorities created | |
| No false claims about live migration | |
---
## Notes
This runbook is intentionally about **documenting current behavior**, not testing features that don't exist. The value is in:
1. Clarifying what works today
2. Identifying gaps for production readiness
3. Informing T039 (production deployment) requirements