photoncloud-monorepo/docs/por/T040-ha-validation/s4-test-scenarios.md
centra d2149b6249 fix(lightningstor): Fix SigV4 canonicalization for AWS S3 auth
- Replace form_urlencoded with RFC 3986 compliant URI encoding
- Implement aws_uri_encode() matching AWS SigV4 spec exactly
- Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded
- All other chars percent-encoded with uppercase hex
- Preserve slashes in paths, encode in query params
- Normalize empty paths to '/' per AWS spec
- Fix test expectations (body hash, HMAC values)
- Add comprehensive SigV4 signature determinism test

This fixes the canonicalization mismatch that caused signature
validation failures in T047. Auth can now be enabled for production.

Refs: T058.S1
2025-12-12 06:23:46 +09:00

5.2 KiB

T040.S4 Service Reconnection Test Scenarios

Overview

Test scenarios for validating service reconnection behavior after transient failures.

Test Environment: Option B2 (Local Multi-Instance)

Approved: 2025-12-11

Setup: 3 instances per service running on localhost with different ports

  • ChainFire: ports 2379, 2380, 2381 (or similar)
  • FlareDB: ports 5000, 5001, 5002 (or similar)

Failure Simulation Methods (adapted from VM approach):

  • Process kill: kill -9 <pid> simulates sudden node failure
  • SIGTERM: kill <pid> simulates graceful shutdown
  • Port blocking: iptables -A INPUT -p tcp --dport <port> -j DROP (if root)
  • Pause: kill -STOP <pid> / kill -CONT <pid> simulates freeze

Note: Cross-VM network partition tests deferred to T039 (production deployment)

Current State Analysis

Services WITH Reconnection Logic

Service Mechanism Location
ChainFire Exponential backoff (3 retries, 2.0x multiplier, 500ms-30s) chainfire/crates/chainfire-api/src/raft_client.rs
FlareDB PD client auto-reconnect (10s cycle), connection pooling flaredb/crates/flaredb-server/src/main.rs:283-356

Services WITHOUT Reconnection Logic (GAPS)

Service Gap Risk
PlasmaVMC No retry/reconnection VM operations fail silently on network blip
IAM No retry mechanism Auth failures cascade to all services
Watch streams Break on error, no auto-reconnect Config/event propagation stops

Test Scenarios

Scenario 1: ChainFire Raft Recovery

Goal: Verify Raft RPC retry logic works under network failures

Steps:

  1. Start 3-node ChainFire cluster
  2. Write key-value pair
  3. Use iptables to block traffic to leader node
  4. Attempt read/write operation from client
  5. Observe retry behavior (should retry with backoff)
  6. Unblock traffic
  7. Verify operation completes or fails gracefully

Expected:

  • Client retries up to 3 times with exponential backoff
  • Clear error message on final failure
  • No data corruption

Evidence: Client logs showing retry attempts, timing


Scenario 2: FlareDB PD Reconnection

Goal: Verify FlareDB server reconnects to ChainFire (PD) after restart

Steps:

  1. Start ChainFire cluster (PD)
  2. Start FlareDB server connected to PD
  3. Verify heartbeat working (check logs)
  4. Kill ChainFire leader
  5. Wait for new leader election
  6. Observe FlareDB reconnection behavior

Expected:

  • FlareDB logs "Reconnected to PD" within 10-20s
  • Client operations resume after reconnection
  • No data loss during transition

Evidence: Server logs, client operation success


Scenario 3: Network Partition (iptables)

Goal: Verify cluster behavior during network partition

Steps:

  1. Start 3-node cluster (ChainFire + FlareDB)
  2. Write data to cluster
  3. Create network partition: iptables -A INPUT -s <node2-ip> -j DROP
  4. Attempt writes (should succeed with 2/3 quorum)
  5. Kill another node (should lose quorum)
  6. Verify writes fail gracefully
  7. Restore partition, verify cluster recovery

Expected:

  • 2/3 nodes: writes succeed
  • 1/3 nodes: writes fail, no data corruption
  • Recovery: cluster resumes normal operation

Evidence: Write success/failure, data consistency check


Scenario 4: Service Restart Recovery

Goal: Verify clients reconnect after service restart

Steps:

  1. Start service (FlareDB/ChainFire)
  2. Connect client
  3. Perform operations
  4. Restart service (systemctl restart or SIGTERM + start)
  5. Attempt client operations

Expected ChainFire: Client reconnects via retry logic Expected FlareDB: Connection pool creates new connection Expected IAM: Manual reconnect required (gap)

Evidence: Client operation success after restart


Scenario 5: Watch Stream Recovery (GAP DOCUMENTATION)

Goal: Document watch stream behavior on connection loss

Steps:

  1. Start ChainFire server
  2. Connect watch client
  3. Verify events received
  4. Kill server
  5. Observe client behavior

Expected: Watch breaks, no auto-reconnect GAP: Need application-level reconnect loop

Evidence: Client logs showing stream termination


Test Matrix

Scenario ChainFire FlareDB PlasmaVMC IAM
S1: Raft Recovery TEST n/a n/a n/a
S2: PD Reconnect n/a TEST n/a n/a
S3: Network Partition TEST TEST SKIP SKIP
S4: Restart Recovery TEST TEST DOC-GAP DOC-GAP
S5: Watch Recovery DOC-GAP DOC-GAP n/a n/a

Prerequisites (Option B2 - Local Multi-Instance)

  • 3 ChainFire instances running on localhost (S1 provides)
  • 3 FlareDB instances running on localhost (S1 provides)
  • Separate data directories per instance
  • Logging enabled at DEBUG level for evidence
  • Process management tools (kill, pkill)
  • Optional: iptables for port blocking tests (requires root)

Success Criteria

  • All TEST scenarios pass
  • GAP scenarios documented with recommendations
  • No data loss in any failure scenario
  • Clear error messages on unrecoverable failures

Future Work (Identified Gaps)

  1. PlasmaVMC: Add retry logic for remote service calls
  2. IAM Client: Add exponential backoff retry
  3. Watch streams: Add auto-reconnect wrapper