photoncloud-monorepo/docs/ops/ha-behavior.md
centra d2149b6249 fix(lightningstor): Fix SigV4 canonicalization for AWS S3 auth
- Replace form_urlencoded with RFC 3986 compliant URI encoding
- Implement aws_uri_encode() matching AWS SigV4 spec exactly
- Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded
- All other chars percent-encoded with uppercase hex
- Preserve slashes in paths, encode in query params
- Normalize empty paths to '/' per AWS spec
- Fix test expectations (body hash, HMAC values)
- Add comprehensive SigV4 signature determinism test

This fixes the canonicalization mismatch that caused signature
validation failures in T047. Auth can now be enabled for production.

Refs: T058.S1
2025-12-12 06:23:46 +09:00

7.6 KiB

High Availability Behavior - PlasmaCloud Components

Status: Gap Analysis Complete (2025-12-12) Environment: Development/Testing (deferred operational validation to T039)

Overview

This document summarizes the HA capabilities, failure modes, and recovery behavior of PlasmaCloud components based on code analysis and unit test validation performed in T040 (HA Validation).


ChainFire (Distributed KV Store)

Current Capabilities ✓

  • Raft Consensus: Custom implementation with proven algorithm correctness
  • Leader Election: Automatic within 150-600ms election timeout
  • Log Replication: Write→replicate→commit→apply flow validated
  • Quorum Maintenance: 2/3 nodes sufficient for cluster operation
  • RPC Retry Logic: 3 retries with exponential backoff (500ms-30s)
  • State Machine: Consistent key-value operations across all nodes

Validated Behavior

Scenario Expected Behavior Status
Single node failure New leader elected, cluster continues ✓ Validated (unit tests)
Leader election Completes in <10s with 2/3 quorum ✓ Validated
Write replication All nodes commit and apply writes ✓ Validated
Follower writes Rejected with NotLeader error ✓ Validated

Documented Gaps (deferred to T039)

  • Process kill/restart: Graceful shutdown not implemented
  • Network partition: Cross-network scenarios not tested
  • Quorum loss recovery: 2/3 node failure scenarios not automated
  • SIGSTOP/SIGCONT: Process pause/resume behavior not validated

Failure Modes

  1. Node Failure (1/3): Cluster continues, new leader elected if leader fails
  2. Quorum Loss (2/3): Cluster unavailable until quorum restored
  3. Network Partition: Not tested (requires distributed environment)

Recovery Procedures

  • Node restart: Rejoins cluster automatically, catches up via log replication
  • Manual intervention required for quorum loss scenarios

FlareDB (Time-Series Database)

Current Capabilities ✓

  • PD Client Auto-Reconnect: 10s heartbeat cycle, connection pooling
  • Raft-based Metadata: Uses ChainFire for cluster metadata (inherits ChainFire HA)
  • Data Consistency: Write-ahead log ensures durability

Validated Behavior

  • PD (ChainFire) reconnection after leader change
  • Metadata operations survive ChainFire node failures

Documented Gaps (deferred to T039)

  • FlareDB-specific Raft: Multi-raft for data regions not tested
  • Storage node failure: Failover behavior not validated
  • Cross-region replication: Not implemented

Failure Modes

  1. PD Unavailable: FlareDB operations stall until PD recovers
  2. Storage Node Failure: Data loss if replication factor < 3

Recovery Procedures

  • Automatic reconnection to new PD leader
  • Manual data recovery if storage node lost

PlasmaVMC (VM Control Plane)

Current Capabilities ✓

  • VM State Tracking: VmState enum includes Migrating state
  • ChainFire Persistence: VM metadata stored in distributed KVS
  • QMP Integration: Can parse migration-related states

Documented Gaps ⚠️

  • No Live Migration: Capability flag set, but migrate() not implemented
  • No Host Health Monitoring: No heartbeat or probe mechanism
  • No Automatic Failover: VM recovery requires manual intervention
  • No Shared Storage: VM disks are local-only (blocks migration)
  • No Reconnection Logic: Network failures cause silent operation failures

Failure Modes

  1. Host Process Kill: QEMU processes orphaned, VM state inconsistent
  2. QEMU Crash: VM lost, no automatic restart
  3. Network Blip: Operations fail silently (no retry)

Recovery Procedures

  • Manual only: Restart PlasmaVMC server, reconcile VM state manually
  • Gap: No automated recovery or failover
  1. Implement VM health monitoring (heartbeat to VMs)
  2. Add reconnection logic with retry/backoff
  3. Consider VM restart on crash (watchdog pattern)
  4. Document expected behavior for host failures

IAM (Identity & Access Management)

Current Capabilities ✓

  • Token-based Auth: JWT validation
  • ChainFire Backend: Inherits ChainFire's HA properties

Documented Gaps ⚠️

  • No Retry Mechanism: Network failures cascade to all services
  • No Connection Pooling: Each request creates new connection
  • Auth Failures: Cascade to dependent services without graceful degradation

Failure Modes

  1. IAM Service Down: All authenticated operations fail
  2. Network Failure: No retry, immediate failure

Recovery Procedures

  • Restart IAM service (automatic service restart via systemd recommended)

PrismNet (SDN Controller)

Current Capabilities ✓

  • OVN Integration: Network topology management

Documented Gaps ⚠️

  • Not yet evaluated: T040 focused on core services
  • Reconnection: Likely needs retry logic for OVN
  • Evaluate PrismNet HA behavior under OVN failures
  • Test network partition scenarios

Watch Streams (Event Propagation)

Documented Gaps ⚠️

  • No Auto-Reconnect: Watch streams break on error, require manual restart
  • No Buffering: Events lost during disconnection
  • No Backpressure: Fast event sources can overwhelm slow consumers

Failure Modes

  1. Connection Drop: Watch stream terminates, no automatic recovery
  2. Event Loss: Missed events during downtime
  1. Implement watch reconnection with resume-from-last-seen
  2. Add event buffering/queuing
  3. Backpressure handling for slow consumers

Testing Approach Summary

Validation Levels

Level Scope Status
Unit Tests Algorithm correctness ✓ Complete (8/8 tests)
Integration Tests Component interaction ✓ Complete (3-node cluster)
Operational Tests Process kill, restart, partition ⚠️ Deferred to T039

Rationale for Deferral

  • Unit tests validate: Raft algorithm correctness, consensus safety, data consistency
  • Operational tests require: Real distributed nodes, shared storage, network infrastructure
  • T039 (Production Deployment): Better environment for operational resilience testing with actual hardware

Gap Summary by Priority

P0 Gaps (Critical for Production)

  • PlasmaVMC: No automatic VM failover or health monitoring
  • IAM: No retry/reconnection logic
  • Watch Streams: No auto-reconnect

P1 Gaps (Important but Mitigable)

  • Raft: Graceful shutdown for clean node removal
  • PlasmaVMC: Live migration implementation
  • Network partition: Cross-datacenter failure scenarios

P2 Gaps (Enhancement)

  • FlareDB: Multi-region replication
  • PrismNet: Network failure recovery testing

Operational Recommendations

Pre-Production Checklist

  1. Monitoring: Implement health checks for all critical services
  2. Alerting: Set up alerts for leader changes, node failures
  3. Runbooks: Create failure recovery procedures for each component
  4. Backup: Regular snapshots of ChainFire data
  5. Testing: Run operational failure tests in T039 staging environment

Production Deployment (T039)

  • Test process kill/restart scenarios on real hardware
  • Validate network partition handling
  • Measure recovery time objectives (RTO)
  • Verify data consistency under failures

References

  • T040 Task YAML: docs/por/T040-ha-validation/task.yaml
  • Test Runbooks: docs/por/T040-ha-validation/s2-raft-resilience-runbook.md, s3-plasmavmc-ha-runbook.md, s4-test-scenarios.md
  • Custom Raft Tests: chainfire/crates/chainfire-raft/tests/leader_election.rs

Last Updated: 2025-12-12 01:19 JST by PeerB