- Replace form_urlencoded with RFC 3986 compliant URI encoding - Implement aws_uri_encode() matching AWS SigV4 spec exactly - Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded - All other chars percent-encoded with uppercase hex - Preserve slashes in paths, encode in query params - Normalize empty paths to '/' per AWS spec - Fix test expectations (body hash, HMAC values) - Add comprehensive SigV4 signature determinism test This fixes the canonicalization mismatch that caused signature validation failures in T047. Auth can now be enabled for production. Refs: T058.S1
246 lines
7.6 KiB
Markdown
246 lines
7.6 KiB
Markdown
# High Availability Behavior - PlasmaCloud Components
|
||
|
||
**Status:** Gap Analysis Complete (2025-12-12)
|
||
**Environment:** Development/Testing (deferred operational validation to T039)
|
||
|
||
## Overview
|
||
|
||
This document summarizes the HA capabilities, failure modes, and recovery behavior of PlasmaCloud components based on code analysis and unit test validation performed in T040 (HA Validation).
|
||
|
||
---
|
||
|
||
## ChainFire (Distributed KV Store)
|
||
|
||
### Current Capabilities ✓
|
||
|
||
- **Raft Consensus:** Custom implementation with proven algorithm correctness
|
||
- **Leader Election:** Automatic within 150-600ms election timeout
|
||
- **Log Replication:** Write→replicate→commit→apply flow validated
|
||
- **Quorum Maintenance:** 2/3 nodes sufficient for cluster operation
|
||
- **RPC Retry Logic:** 3 retries with exponential backoff (500ms-30s)
|
||
- **State Machine:** Consistent key-value operations across all nodes
|
||
|
||
### Validated Behavior
|
||
|
||
| Scenario | Expected Behavior | Status |
|
||
|----------|-------------------|--------|
|
||
| Single node failure | New leader elected, cluster continues | ✓ Validated (unit tests) |
|
||
| Leader election | Completes in <10s with 2/3 quorum | ✓ Validated |
|
||
| Write replication | All nodes commit and apply writes | ✓ Validated |
|
||
| Follower writes | Rejected with NotLeader error | ✓ Validated |
|
||
|
||
### Documented Gaps (deferred to T039)
|
||
|
||
- **Process kill/restart:** Graceful shutdown not implemented
|
||
- **Network partition:** Cross-network scenarios not tested
|
||
- **Quorum loss recovery:** 2/3 node failure scenarios not automated
|
||
- **SIGSTOP/SIGCONT:** Process pause/resume behavior not validated
|
||
|
||
### Failure Modes
|
||
|
||
1. **Node Failure (1/3):** Cluster continues, new leader elected if leader fails
|
||
2. **Quorum Loss (2/3):** Cluster unavailable until quorum restored
|
||
3. **Network Partition:** Not tested (requires distributed environment)
|
||
|
||
### Recovery Procedures
|
||
|
||
- Node restart: Rejoins cluster automatically, catches up via log replication
|
||
- Manual intervention required for quorum loss scenarios
|
||
|
||
---
|
||
|
||
## FlareDB (Time-Series Database)
|
||
|
||
### Current Capabilities ✓
|
||
|
||
- **PD Client Auto-Reconnect:** 10s heartbeat cycle, connection pooling
|
||
- **Raft-based Metadata:** Uses ChainFire for cluster metadata (inherits ChainFire HA)
|
||
- **Data Consistency:** Write-ahead log ensures durability
|
||
|
||
### Validated Behavior
|
||
|
||
- PD (ChainFire) reconnection after leader change
|
||
- Metadata operations survive ChainFire node failures
|
||
|
||
### Documented Gaps (deferred to T039)
|
||
|
||
- **FlareDB-specific Raft:** Multi-raft for data regions not tested
|
||
- **Storage node failure:** Failover behavior not validated
|
||
- **Cross-region replication:** Not implemented
|
||
|
||
### Failure Modes
|
||
|
||
1. **PD Unavailable:** FlareDB operations stall until PD recovers
|
||
2. **Storage Node Failure:** Data loss if replication factor < 3
|
||
|
||
### Recovery Procedures
|
||
|
||
- Automatic reconnection to new PD leader
|
||
- Manual data recovery if storage node lost
|
||
|
||
---
|
||
|
||
## PlasmaVMC (VM Control Plane)
|
||
|
||
### Current Capabilities ✓
|
||
|
||
- **VM State Tracking:** VmState enum includes Migrating state
|
||
- **ChainFire Persistence:** VM metadata stored in distributed KVS
|
||
- **QMP Integration:** Can parse migration-related states
|
||
|
||
### Documented Gaps ⚠️
|
||
|
||
- **No Live Migration:** Capability flag set, but `migrate()` not implemented
|
||
- **No Host Health Monitoring:** No heartbeat or probe mechanism
|
||
- **No Automatic Failover:** VM recovery requires manual intervention
|
||
- **No Shared Storage:** VM disks are local-only (blocks migration)
|
||
- **No Reconnection Logic:** Network failures cause silent operation failures
|
||
|
||
### Failure Modes
|
||
|
||
1. **Host Process Kill:** QEMU processes orphaned, VM state inconsistent
|
||
2. **QEMU Crash:** VM lost, no automatic restart
|
||
3. **Network Blip:** Operations fail silently (no retry)
|
||
|
||
### Recovery Procedures
|
||
|
||
- **Manual only:** Restart PlasmaVMC server, reconcile VM state manually
|
||
- **Gap:** No automated recovery or failover
|
||
|
||
### Recommended Improvements (for T039)
|
||
|
||
1. Implement VM health monitoring (heartbeat to VMs)
|
||
2. Add reconnection logic with retry/backoff
|
||
3. Consider VM restart on crash (watchdog pattern)
|
||
4. Document expected behavior for host failures
|
||
|
||
---
|
||
|
||
## IAM (Identity & Access Management)
|
||
|
||
### Current Capabilities ✓
|
||
|
||
- **Token-based Auth:** JWT validation
|
||
- **ChainFire Backend:** Inherits ChainFire's HA properties
|
||
|
||
### Documented Gaps ⚠️
|
||
|
||
- **No Retry Mechanism:** Network failures cascade to all services
|
||
- **No Connection Pooling:** Each request creates new connection
|
||
- **Auth Failures:** Cascade to dependent services without graceful degradation
|
||
|
||
### Failure Modes
|
||
|
||
1. **IAM Service Down:** All authenticated operations fail
|
||
2. **Network Failure:** No retry, immediate failure
|
||
|
||
### Recovery Procedures
|
||
|
||
- Restart IAM service (automatic service restart via systemd recommended)
|
||
|
||
---
|
||
|
||
## PrismNet (SDN Controller)
|
||
|
||
### Current Capabilities ✓
|
||
|
||
- **OVN Integration:** Network topology management
|
||
|
||
### Documented Gaps ⚠️
|
||
|
||
- **Not yet evaluated:** T040 focused on core services
|
||
- **Reconnection:** Likely needs retry logic for OVN
|
||
|
||
### Recommended for T039
|
||
|
||
- Evaluate PrismNet HA behavior under OVN failures
|
||
- Test network partition scenarios
|
||
|
||
---
|
||
|
||
## Watch Streams (Event Propagation)
|
||
|
||
### Documented Gaps ⚠️
|
||
|
||
- **No Auto-Reconnect:** Watch streams break on error, require manual restart
|
||
- **No Buffering:** Events lost during disconnection
|
||
- **No Backpressure:** Fast event sources can overwhelm slow consumers
|
||
|
||
### Failure Modes
|
||
|
||
1. **Connection Drop:** Watch stream terminates, no automatic recovery
|
||
2. **Event Loss:** Missed events during downtime
|
||
|
||
### Recommended Improvements
|
||
|
||
1. Implement watch reconnection with resume-from-last-seen
|
||
2. Add event buffering/queuing
|
||
3. Backpressure handling for slow consumers
|
||
|
||
---
|
||
|
||
## Testing Approach Summary
|
||
|
||
### Validation Levels
|
||
|
||
| Level | Scope | Status |
|
||
|-------|-------|--------|
|
||
| Unit Tests | Algorithm correctness | ✓ Complete (8/8 tests) |
|
||
| Integration Tests | Component interaction | ✓ Complete (3-node cluster) |
|
||
| Operational Tests | Process kill, restart, partition | ⚠️ Deferred to T039 |
|
||
|
||
### Rationale for Deferral
|
||
|
||
- **Unit tests validate:** Raft algorithm correctness, consensus safety, data consistency
|
||
- **Operational tests require:** Real distributed nodes, shared storage, network infrastructure
|
||
- **T039 (Production Deployment):** Better environment for operational resilience testing with actual hardware
|
||
|
||
---
|
||
|
||
## Gap Summary by Priority
|
||
|
||
### P0 Gaps (Critical for Production)
|
||
|
||
- PlasmaVMC: No automatic VM failover or health monitoring
|
||
- IAM: No retry/reconnection logic
|
||
- Watch Streams: No auto-reconnect
|
||
|
||
### P1 Gaps (Important but Mitigable)
|
||
|
||
- Raft: Graceful shutdown for clean node removal
|
||
- PlasmaVMC: Live migration implementation
|
||
- Network partition: Cross-datacenter failure scenarios
|
||
|
||
### P2 Gaps (Enhancement)
|
||
|
||
- FlareDB: Multi-region replication
|
||
- PrismNet: Network failure recovery testing
|
||
|
||
---
|
||
|
||
## Operational Recommendations
|
||
|
||
### Pre-Production Checklist
|
||
|
||
1. **Monitoring:** Implement health checks for all critical services
|
||
2. **Alerting:** Set up alerts for leader changes, node failures
|
||
3. **Runbooks:** Create failure recovery procedures for each component
|
||
4. **Backup:** Regular snapshots of ChainFire data
|
||
5. **Testing:** Run operational failure tests in T039 staging environment
|
||
|
||
### Production Deployment (T039)
|
||
|
||
- Test process kill/restart scenarios on real hardware
|
||
- Validate network partition handling
|
||
- Measure recovery time objectives (RTO)
|
||
- Verify data consistency under failures
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
- T040 Task YAML: `docs/por/T040-ha-validation/task.yaml`
|
||
- Test Runbooks: `docs/por/T040-ha-validation/s2-raft-resilience-runbook.md`, `s3-plasmavmc-ha-runbook.md`, `s4-test-scenarios.md`
|
||
- Custom Raft Tests: `chainfire/crates/chainfire-raft/tests/leader_election.rs`
|
||
|
||
**Last Updated:** 2025-12-12 01:19 JST by PeerB
|