photoncloud-monorepo/docs/ops/ha-behavior.md
centra d2149b6249 fix(lightningstor): Fix SigV4 canonicalization for AWS S3 auth
- Replace form_urlencoded with RFC 3986 compliant URI encoding
- Implement aws_uri_encode() matching AWS SigV4 spec exactly
- Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded
- All other chars percent-encoded with uppercase hex
- Preserve slashes in paths, encode in query params
- Normalize empty paths to '/' per AWS spec
- Fix test expectations (body hash, HMAC values)
- Add comprehensive SigV4 signature determinism test

This fixes the canonicalization mismatch that caused signature
validation failures in T047. Auth can now be enabled for production.

Refs: T058.S1
2025-12-12 06:23:46 +09:00

246 lines
7.6 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# High Availability Behavior - PlasmaCloud Components
**Status:** Gap Analysis Complete (2025-12-12)
**Environment:** Development/Testing (deferred operational validation to T039)
## Overview
This document summarizes the HA capabilities, failure modes, and recovery behavior of PlasmaCloud components based on code analysis and unit test validation performed in T040 (HA Validation).
---
## ChainFire (Distributed KV Store)
### Current Capabilities ✓
- **Raft Consensus:** Custom implementation with proven algorithm correctness
- **Leader Election:** Automatic within 150-600ms election timeout
- **Log Replication:** Write→replicate→commit→apply flow validated
- **Quorum Maintenance:** 2/3 nodes sufficient for cluster operation
- **RPC Retry Logic:** 3 retries with exponential backoff (500ms-30s)
- **State Machine:** Consistent key-value operations across all nodes
### Validated Behavior
| Scenario | Expected Behavior | Status |
|----------|-------------------|--------|
| Single node failure | New leader elected, cluster continues | ✓ Validated (unit tests) |
| Leader election | Completes in <10s with 2/3 quorum | Validated |
| Write replication | All nodes commit and apply writes | Validated |
| Follower writes | Rejected with NotLeader error | Validated |
### Documented Gaps (deferred to T039)
- **Process kill/restart:** Graceful shutdown not implemented
- **Network partition:** Cross-network scenarios not tested
- **Quorum loss recovery:** 2/3 node failure scenarios not automated
- **SIGSTOP/SIGCONT:** Process pause/resume behavior not validated
### Failure Modes
1. **Node Failure (1/3):** Cluster continues, new leader elected if leader fails
2. **Quorum Loss (2/3):** Cluster unavailable until quorum restored
3. **Network Partition:** Not tested (requires distributed environment)
### Recovery Procedures
- Node restart: Rejoins cluster automatically, catches up via log replication
- Manual intervention required for quorum loss scenarios
---
## FlareDB (Time-Series Database)
### Current Capabilities ✓
- **PD Client Auto-Reconnect:** 10s heartbeat cycle, connection pooling
- **Raft-based Metadata:** Uses ChainFire for cluster metadata (inherits ChainFire HA)
- **Data Consistency:** Write-ahead log ensures durability
### Validated Behavior
- PD (ChainFire) reconnection after leader change
- Metadata operations survive ChainFire node failures
### Documented Gaps (deferred to T039)
- **FlareDB-specific Raft:** Multi-raft for data regions not tested
- **Storage node failure:** Failover behavior not validated
- **Cross-region replication:** Not implemented
### Failure Modes
1. **PD Unavailable:** FlareDB operations stall until PD recovers
2. **Storage Node Failure:** Data loss if replication factor < 3
### Recovery Procedures
- Automatic reconnection to new PD leader
- Manual data recovery if storage node lost
---
## PlasmaVMC (VM Control Plane)
### Current Capabilities ✓
- **VM State Tracking:** VmState enum includes Migrating state
- **ChainFire Persistence:** VM metadata stored in distributed KVS
- **QMP Integration:** Can parse migration-related states
### Documented Gaps ⚠️
- **No Live Migration:** Capability flag set, but `migrate()` not implemented
- **No Host Health Monitoring:** No heartbeat or probe mechanism
- **No Automatic Failover:** VM recovery requires manual intervention
- **No Shared Storage:** VM disks are local-only (blocks migration)
- **No Reconnection Logic:** Network failures cause silent operation failures
### Failure Modes
1. **Host Process Kill:** QEMU processes orphaned, VM state inconsistent
2. **QEMU Crash:** VM lost, no automatic restart
3. **Network Blip:** Operations fail silently (no retry)
### Recovery Procedures
- **Manual only:** Restart PlasmaVMC server, reconcile VM state manually
- **Gap:** No automated recovery or failover
### Recommended Improvements (for T039)
1. Implement VM health monitoring (heartbeat to VMs)
2. Add reconnection logic with retry/backoff
3. Consider VM restart on crash (watchdog pattern)
4. Document expected behavior for host failures
---
## IAM (Identity & Access Management)
### Current Capabilities ✓
- **Token-based Auth:** JWT validation
- **ChainFire Backend:** Inherits ChainFire's HA properties
### Documented Gaps ⚠️
- **No Retry Mechanism:** Network failures cascade to all services
- **No Connection Pooling:** Each request creates new connection
- **Auth Failures:** Cascade to dependent services without graceful degradation
### Failure Modes
1. **IAM Service Down:** All authenticated operations fail
2. **Network Failure:** No retry, immediate failure
### Recovery Procedures
- Restart IAM service (automatic service restart via systemd recommended)
---
## PrismNet (SDN Controller)
### Current Capabilities ✓
- **OVN Integration:** Network topology management
### Documented Gaps ⚠️
- **Not yet evaluated:** T040 focused on core services
- **Reconnection:** Likely needs retry logic for OVN
### Recommended for T039
- Evaluate PrismNet HA behavior under OVN failures
- Test network partition scenarios
---
## Watch Streams (Event Propagation)
### Documented Gaps ⚠️
- **No Auto-Reconnect:** Watch streams break on error, require manual restart
- **No Buffering:** Events lost during disconnection
- **No Backpressure:** Fast event sources can overwhelm slow consumers
### Failure Modes
1. **Connection Drop:** Watch stream terminates, no automatic recovery
2. **Event Loss:** Missed events during downtime
### Recommended Improvements
1. Implement watch reconnection with resume-from-last-seen
2. Add event buffering/queuing
3. Backpressure handling for slow consumers
---
## Testing Approach Summary
### Validation Levels
| Level | Scope | Status |
|-------|-------|--------|
| Unit Tests | Algorithm correctness | Complete (8/8 tests) |
| Integration Tests | Component interaction | Complete (3-node cluster) |
| Operational Tests | Process kill, restart, partition | Deferred to T039 |
### Rationale for Deferral
- **Unit tests validate:** Raft algorithm correctness, consensus safety, data consistency
- **Operational tests require:** Real distributed nodes, shared storage, network infrastructure
- **T039 (Production Deployment):** Better environment for operational resilience testing with actual hardware
---
## Gap Summary by Priority
### P0 Gaps (Critical for Production)
- PlasmaVMC: No automatic VM failover or health monitoring
- IAM: No retry/reconnection logic
- Watch Streams: No auto-reconnect
### P1 Gaps (Important but Mitigable)
- Raft: Graceful shutdown for clean node removal
- PlasmaVMC: Live migration implementation
- Network partition: Cross-datacenter failure scenarios
### P2 Gaps (Enhancement)
- FlareDB: Multi-region replication
- PrismNet: Network failure recovery testing
---
## Operational Recommendations
### Pre-Production Checklist
1. **Monitoring:** Implement health checks for all critical services
2. **Alerting:** Set up alerts for leader changes, node failures
3. **Runbooks:** Create failure recovery procedures for each component
4. **Backup:** Regular snapshots of ChainFire data
5. **Testing:** Run operational failure tests in T039 staging environment
### Production Deployment (T039)
- Test process kill/restart scenarios on real hardware
- Validate network partition handling
- Measure recovery time objectives (RTO)
- Verify data consistency under failures
---
## References
- T040 Task YAML: `docs/por/T040-ha-validation/task.yaml`
- Test Runbooks: `docs/por/T040-ha-validation/s2-raft-resilience-runbook.md`, `s3-plasmavmc-ha-runbook.md`, `s4-test-scenarios.md`
- Custom Raft Tests: `chainfire/crates/chainfire-raft/tests/leader_election.rs`
**Last Updated:** 2025-12-12 01:19 JST by PeerB