- Replace form_urlencoded with RFC 3986 compliant URI encoding - Implement aws_uri_encode() matching AWS SigV4 spec exactly - Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded - All other chars percent-encoded with uppercase hex - Preserve slashes in paths, encode in query params - Normalize empty paths to '/' per AWS spec - Fix test expectations (body hash, HMAC values) - Add comprehensive SigV4 signature determinism test This fixes the canonicalization mismatch that caused signature validation failures in T047. Auth can now be enabled for production. Refs: T058.S1
7.6 KiB
7.6 KiB
High Availability Behavior - PlasmaCloud Components
Status: Gap Analysis Complete (2025-12-12) Environment: Development/Testing (deferred operational validation to T039)
Overview
This document summarizes the HA capabilities, failure modes, and recovery behavior of PlasmaCloud components based on code analysis and unit test validation performed in T040 (HA Validation).
ChainFire (Distributed KV Store)
Current Capabilities ✓
- Raft Consensus: Custom implementation with proven algorithm correctness
- Leader Election: Automatic within 150-600ms election timeout
- Log Replication: Write→replicate→commit→apply flow validated
- Quorum Maintenance: 2/3 nodes sufficient for cluster operation
- RPC Retry Logic: 3 retries with exponential backoff (500ms-30s)
- State Machine: Consistent key-value operations across all nodes
Validated Behavior
| Scenario | Expected Behavior | Status |
|---|---|---|
| Single node failure | New leader elected, cluster continues | ✓ Validated (unit tests) |
| Leader election | Completes in <10s with 2/3 quorum | ✓ Validated |
| Write replication | All nodes commit and apply writes | ✓ Validated |
| Follower writes | Rejected with NotLeader error | ✓ Validated |
Documented Gaps (deferred to T039)
- Process kill/restart: Graceful shutdown not implemented
- Network partition: Cross-network scenarios not tested
- Quorum loss recovery: 2/3 node failure scenarios not automated
- SIGSTOP/SIGCONT: Process pause/resume behavior not validated
Failure Modes
- Node Failure (1/3): Cluster continues, new leader elected if leader fails
- Quorum Loss (2/3): Cluster unavailable until quorum restored
- Network Partition: Not tested (requires distributed environment)
Recovery Procedures
- Node restart: Rejoins cluster automatically, catches up via log replication
- Manual intervention required for quorum loss scenarios
FlareDB (Time-Series Database)
Current Capabilities ✓
- PD Client Auto-Reconnect: 10s heartbeat cycle, connection pooling
- Raft-based Metadata: Uses ChainFire for cluster metadata (inherits ChainFire HA)
- Data Consistency: Write-ahead log ensures durability
Validated Behavior
- PD (ChainFire) reconnection after leader change
- Metadata operations survive ChainFire node failures
Documented Gaps (deferred to T039)
- FlareDB-specific Raft: Multi-raft for data regions not tested
- Storage node failure: Failover behavior not validated
- Cross-region replication: Not implemented
Failure Modes
- PD Unavailable: FlareDB operations stall until PD recovers
- Storage Node Failure: Data loss if replication factor < 3
Recovery Procedures
- Automatic reconnection to new PD leader
- Manual data recovery if storage node lost
PlasmaVMC (VM Control Plane)
Current Capabilities ✓
- VM State Tracking: VmState enum includes Migrating state
- ChainFire Persistence: VM metadata stored in distributed KVS
- QMP Integration: Can parse migration-related states
Documented Gaps ⚠️
- No Live Migration: Capability flag set, but
migrate()not implemented - No Host Health Monitoring: No heartbeat or probe mechanism
- No Automatic Failover: VM recovery requires manual intervention
- No Shared Storage: VM disks are local-only (blocks migration)
- No Reconnection Logic: Network failures cause silent operation failures
Failure Modes
- Host Process Kill: QEMU processes orphaned, VM state inconsistent
- QEMU Crash: VM lost, no automatic restart
- Network Blip: Operations fail silently (no retry)
Recovery Procedures
- Manual only: Restart PlasmaVMC server, reconcile VM state manually
- Gap: No automated recovery or failover
Recommended Improvements (for T039)
- Implement VM health monitoring (heartbeat to VMs)
- Add reconnection logic with retry/backoff
- Consider VM restart on crash (watchdog pattern)
- Document expected behavior for host failures
IAM (Identity & Access Management)
Current Capabilities ✓
- Token-based Auth: JWT validation
- ChainFire Backend: Inherits ChainFire's HA properties
Documented Gaps ⚠️
- No Retry Mechanism: Network failures cascade to all services
- No Connection Pooling: Each request creates new connection
- Auth Failures: Cascade to dependent services without graceful degradation
Failure Modes
- IAM Service Down: All authenticated operations fail
- Network Failure: No retry, immediate failure
Recovery Procedures
- Restart IAM service (automatic service restart via systemd recommended)
PrismNet (SDN Controller)
Current Capabilities ✓
- OVN Integration: Network topology management
Documented Gaps ⚠️
- Not yet evaluated: T040 focused on core services
- Reconnection: Likely needs retry logic for OVN
Recommended for T039
- Evaluate PrismNet HA behavior under OVN failures
- Test network partition scenarios
Watch Streams (Event Propagation)
Documented Gaps ⚠️
- No Auto-Reconnect: Watch streams break on error, require manual restart
- No Buffering: Events lost during disconnection
- No Backpressure: Fast event sources can overwhelm slow consumers
Failure Modes
- Connection Drop: Watch stream terminates, no automatic recovery
- Event Loss: Missed events during downtime
Recommended Improvements
- Implement watch reconnection with resume-from-last-seen
- Add event buffering/queuing
- Backpressure handling for slow consumers
Testing Approach Summary
Validation Levels
| Level | Scope | Status |
|---|---|---|
| Unit Tests | Algorithm correctness | ✓ Complete (8/8 tests) |
| Integration Tests | Component interaction | ✓ Complete (3-node cluster) |
| Operational Tests | Process kill, restart, partition | ⚠️ Deferred to T039 |
Rationale for Deferral
- Unit tests validate: Raft algorithm correctness, consensus safety, data consistency
- Operational tests require: Real distributed nodes, shared storage, network infrastructure
- T039 (Production Deployment): Better environment for operational resilience testing with actual hardware
Gap Summary by Priority
P0 Gaps (Critical for Production)
- PlasmaVMC: No automatic VM failover or health monitoring
- IAM: No retry/reconnection logic
- Watch Streams: No auto-reconnect
P1 Gaps (Important but Mitigable)
- Raft: Graceful shutdown for clean node removal
- PlasmaVMC: Live migration implementation
- Network partition: Cross-datacenter failure scenarios
P2 Gaps (Enhancement)
- FlareDB: Multi-region replication
- PrismNet: Network failure recovery testing
Operational Recommendations
Pre-Production Checklist
- Monitoring: Implement health checks for all critical services
- Alerting: Set up alerts for leader changes, node failures
- Runbooks: Create failure recovery procedures for each component
- Backup: Regular snapshots of ChainFire data
- Testing: Run operational failure tests in T039 staging environment
Production Deployment (T039)
- Test process kill/restart scenarios on real hardware
- Validate network partition handling
- Measure recovery time objectives (RTO)
- Verify data consistency under failures
References
- T040 Task YAML:
docs/por/T040-ha-validation/task.yaml - Test Runbooks:
docs/por/T040-ha-validation/s2-raft-resilience-runbook.md,s3-plasmavmc-ha-runbook.md,s4-test-scenarios.md - Custom Raft Tests:
chainfire/crates/chainfire-raft/tests/leader_election.rs
Last Updated: 2025-12-12 01:19 JST by PeerB