photoncloud-monorepo/docs/ops/ha-behavior.md

# High Availability Behavior - PlasmaCloud Components

**Status:** Gap Analysis Complete (2025-12-12)
**Environment:** Development/Testing (deferred operational validation to T039)

## Overview

This document summarizes the HA capabilities, failure modes, and recovery behavior of PlasmaCloud components based on code analysis and unit test validation performed in T040 (HA Validation).

---

## ChainFire (Distributed KV Store)

### Current Capabilities ✓

- **Raft Consensus:** Custom implementation with proven algorithm correctness
- **Leader Election:** Automatic within 150-600ms election timeout
- **Log Replication:** Write→replicate→commit→apply flow validated
- **Quorum Maintenance:** 2/3 nodes sufficient for cluster operation
- **RPC Retry Logic:** 3 retries with exponential backoff (500ms-30s)
- **State Machine:** Consistent key-value operations across all nodes

### Validated Behavior

| Scenario | Expected Behavior | Status |
|----------|-------------------|--------|
| Single node failure | New leader elected, cluster continues | ✓ Validated (unit tests) |
| Leader election | Completes in <10s with 2/3 quorum | ✓ Validated |
| Write replication | All nodes commit and apply writes | ✓ Validated |
| Follower writes | Rejected with NotLeader error | ✓ Validated |

### Documented Gaps (deferred to T039)

- **Process kill/restart:** Graceful shutdown not implemented
- **Network partition:** Cross-network scenarios not tested
- **Quorum loss recovery:** 2/3 node failure scenarios not automated
- **SIGSTOP/SIGCONT:** Process pause/resume behavior not validated

### Failure Modes

1. **Node Failure (1/3):** Cluster continues, new leader elected if leader fails
2. **Quorum Loss (2/3):** Cluster unavailable until quorum restored
3. **Network Partition:** Not tested (requires distributed environment)

### Recovery Procedures

- Node restart: Rejoins cluster automatically, catches up via log replication
- Manual intervention required for quorum loss scenarios

---

## FlareDB (Time-Series Database)

### Current Capabilities ✓

- **PD Client Auto-Reconnect:** 10s heartbeat cycle, connection pooling
- **Raft-based Metadata:** Uses ChainFire for cluster metadata (inherits ChainFire HA)
- **Data Consistency:** Write-ahead log ensures durability

### Validated Behavior

- PD (ChainFire) reconnection after leader change
- Metadata operations survive ChainFire node failures

### Documented Gaps (deferred to T039)

- **FlareDB-specific Raft:** Multi-raft for data regions not tested
- **Storage node failure:** Failover behavior not validated
- **Cross-region replication:** Not implemented

### Failure Modes

1. **PD Unavailable:** FlareDB operations stall until PD recovers
2. **Storage Node Failure:** Data loss if replication factor < 3

### Recovery Procedures

- Automatic reconnection to new PD leader
- Manual data recovery if storage node lost

---

## PlasmaVMC (VM Control Plane)

### Current Capabilities ✓

- **VM State Tracking:** VmState enum includes Migrating state
- **ChainFire Persistence:** VM metadata stored in distributed KVS
- **QMP Integration:** Can parse migration-related states

### Documented Gaps ⚠️

- **No Live Migration:** Capability flag set, but `migrate()` not implemented
- **No Host Health Monitoring:** No heartbeat or probe mechanism
- **No Automatic Failover:** VM recovery requires manual intervention
- **No Shared Storage:** VM disks are local-only (blocks migration)
- **No Reconnection Logic:** Network failures cause silent operation failures

### Failure Modes

1. **Host Process Kill:** QEMU processes orphaned, VM state inconsistent
2. **QEMU Crash:** VM lost, no automatic restart
3. **Network Blip:** Operations fail silently (no retry)

### Recovery Procedures

- **Manual only:** Restart PlasmaVMC server, reconcile VM state manually
- **Gap:** No automated recovery or failover

### Recommended Improvements (for T039)

1. Implement VM health monitoring (heartbeat to VMs)
2. Add reconnection logic with retry/backoff
3. Consider VM restart on crash (watchdog pattern)
4. Document expected behavior for host failures

---

## IAM (Identity & Access Management)

### Current Capabilities ✓

- **Token-based Auth:** JWT validation
- **ChainFire Backend:** Inherits ChainFire's HA properties

### Documented Gaps ⚠️

- **No Retry Mechanism:** Network failures cascade to all services
- **No Connection Pooling:** Each request creates new connection
- **Auth Failures:** Cascade to dependent services without graceful degradation

### Failure Modes

1. **IAM Service Down:** All authenticated operations fail
2. **Network Failure:** No retry, immediate failure

### Recovery Procedures

- Restart IAM service (automatic service restart via systemd recommended)

---

## PrismNet (SDN Controller)

### Current Capabilities ✓

- **OVN Integration:** Network topology management

### Documented Gaps ⚠️

- **Not yet evaluated:** T040 focused on core services
- **Reconnection:** Likely needs retry logic for OVN

### Recommended for T039

- Evaluate PrismNet HA behavior under OVN failures
- Test network partition scenarios

---

## Watch Streams (Event Propagation)

### Documented Gaps ⚠️

- **No Auto-Reconnect:** Watch streams break on error, require manual restart
- **No Buffering:** Events lost during disconnection
- **No Backpressure:** Fast event sources can overwhelm slow consumers

### Failure Modes

1. **Connection Drop:** Watch stream terminates, no automatic recovery
2. **Event Loss:** Missed events during downtime

### Recommended Improvements

1. Implement watch reconnection with resume-from-last-seen
2. Add event buffering/queuing
3. Backpressure handling for slow consumers

---

## Testing Approach Summary

### Validation Levels

| Level | Scope | Status |
|-------|-------|--------|
| Unit Tests | Algorithm correctness | ✓ Complete (8/8 tests) |
| Integration Tests | Component interaction | ✓ Complete (3-node cluster) |
| Operational Tests | Process kill, restart, partition | ⚠️ Deferred to T039 |

### Rationale for Deferral

- **Unit tests validate:** Raft algorithm correctness, consensus safety, data consistency
- **Operational tests require:** Real distributed nodes, shared storage, network infrastructure
- **T039 (Production Deployment):** Better environment for operational resilience testing with actual hardware

---

## Gap Summary by Priority

### P0 Gaps (Critical for Production)

- PlasmaVMC: No automatic VM failover or health monitoring
- IAM: No retry/reconnection logic
- Watch Streams: No auto-reconnect

### P1 Gaps (Important but Mitigable)

- Raft: Graceful shutdown for clean node removal
- PlasmaVMC: Live migration implementation
- Network partition: Cross-datacenter failure scenarios

### P2 Gaps (Enhancement)

- FlareDB: Multi-region replication
- PrismNet: Network failure recovery testing

---

## Operational Recommendations

### Pre-Production Checklist

1. **Monitoring:** Implement health checks for all critical services
2. **Alerting:** Set up alerts for leader changes, node failures
3. **Runbooks:** Create failure recovery procedures for each component
4. **Backup:** Regular snapshots of ChainFire data
5. **Testing:** Run operational failure tests in T039 staging environment

### Production Deployment (T039)

- Test process kill/restart scenarios on real hardware
- Validate network partition handling
- Measure recovery time objectives (RTO)
- Verify data consistency under failures

---

## References

- T040 Task YAML: `docs/por/T040-ha-validation/task.yaml`
- Test Runbooks: `docs/por/T040-ha-validation/s2-raft-resilience-runbook.md`, `s3-plasmavmc-ha-runbook.md`, `s4-test-scenarios.md`
- Custom Raft Tests: `chainfire/crates/chainfire-raft/tests/leader_election.rs`

**Last Updated:** 2025-12-12 01:19 JST by PeerB