- Replace form_urlencoded with RFC 3986 compliant URI encoding - Implement aws_uri_encode() matching AWS SigV4 spec exactly - Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded - All other chars percent-encoded with uppercase hex - Preserve slashes in paths, encode in query params - Normalize empty paths to '/' per AWS spec - Fix test expectations (body hash, HMAC values) - Add comprehensive SigV4 signature determinism test This fixes the canonicalization mismatch that caused signature validation failures in T047. Auth can now be enabled for production. Refs: T058.S1
5.2 KiB
T040.S4 Service Reconnection Test Scenarios
Overview
Test scenarios for validating service reconnection behavior after transient failures.
Test Environment: Option B2 (Local Multi-Instance)
Approved: 2025-12-11
Setup: 3 instances per service running on localhost with different ports
- ChainFire: ports 2379, 2380, 2381 (or similar)
- FlareDB: ports 5000, 5001, 5002 (or similar)
Failure Simulation Methods (adapted from VM approach):
- Process kill:
kill -9 <pid>simulates sudden node failure - SIGTERM:
kill <pid>simulates graceful shutdown - Port blocking:
iptables -A INPUT -p tcp --dport <port> -j DROP(if root) - Pause:
kill -STOP <pid>/kill -CONT <pid>simulates freeze
Note: Cross-VM network partition tests deferred to T039 (production deployment)
Current State Analysis
Services WITH Reconnection Logic
| Service | Mechanism | Location |
|---|---|---|
| ChainFire | Exponential backoff (3 retries, 2.0x multiplier, 500ms-30s) | chainfire/crates/chainfire-api/src/raft_client.rs |
| FlareDB | PD client auto-reconnect (10s cycle), connection pooling | flaredb/crates/flaredb-server/src/main.rs:283-356 |
Services WITHOUT Reconnection Logic (GAPS)
| Service | Gap | Risk |
|---|---|---|
| PlasmaVMC | No retry/reconnection | VM operations fail silently on network blip |
| IAM | No retry mechanism | Auth failures cascade to all services |
| Watch streams | Break on error, no auto-reconnect | Config/event propagation stops |
Test Scenarios
Scenario 1: ChainFire Raft Recovery
Goal: Verify Raft RPC retry logic works under network failures
Steps:
- Start 3-node ChainFire cluster
- Write key-value pair
- Use
iptablesto block traffic to leader node - Attempt read/write operation from client
- Observe retry behavior (should retry with backoff)
- Unblock traffic
- Verify operation completes or fails gracefully
Expected:
- Client retries up to 3 times with exponential backoff
- Clear error message on final failure
- No data corruption
Evidence: Client logs showing retry attempts, timing
Scenario 2: FlareDB PD Reconnection
Goal: Verify FlareDB server reconnects to ChainFire (PD) after restart
Steps:
- Start ChainFire cluster (PD)
- Start FlareDB server connected to PD
- Verify heartbeat working (check logs)
- Kill ChainFire leader
- Wait for new leader election
- Observe FlareDB reconnection behavior
Expected:
- FlareDB logs "Reconnected to PD" within 10-20s
- Client operations resume after reconnection
- No data loss during transition
Evidence: Server logs, client operation success
Scenario 3: Network Partition (iptables)
Goal: Verify cluster behavior during network partition
Steps:
- Start 3-node cluster (ChainFire + FlareDB)
- Write data to cluster
- Create network partition:
iptables -A INPUT -s <node2-ip> -j DROP - Attempt writes (should succeed with 2/3 quorum)
- Kill another node (should lose quorum)
- Verify writes fail gracefully
- Restore partition, verify cluster recovery
Expected:
- 2/3 nodes: writes succeed
- 1/3 nodes: writes fail, no data corruption
- Recovery: cluster resumes normal operation
Evidence: Write success/failure, data consistency check
Scenario 4: Service Restart Recovery
Goal: Verify clients reconnect after service restart
Steps:
- Start service (FlareDB/ChainFire)
- Connect client
- Perform operations
- Restart service (
systemctl restartor SIGTERM + start) - Attempt client operations
Expected ChainFire: Client reconnects via retry logic Expected FlareDB: Connection pool creates new connection Expected IAM: Manual reconnect required (gap)
Evidence: Client operation success after restart
Scenario 5: Watch Stream Recovery (GAP DOCUMENTATION)
Goal: Document watch stream behavior on connection loss
Steps:
- Start ChainFire server
- Connect watch client
- Verify events received
- Kill server
- Observe client behavior
Expected: Watch breaks, no auto-reconnect GAP: Need application-level reconnect loop
Evidence: Client logs showing stream termination
Test Matrix
| Scenario | ChainFire | FlareDB | PlasmaVMC | IAM |
|---|---|---|---|---|
| S1: Raft Recovery | TEST | n/a | n/a | n/a |
| S2: PD Reconnect | n/a | TEST | n/a | n/a |
| S3: Network Partition | TEST | TEST | SKIP | SKIP |
| S4: Restart Recovery | TEST | TEST | DOC-GAP | DOC-GAP |
| S5: Watch Recovery | DOC-GAP | DOC-GAP | n/a | n/a |
Prerequisites (Option B2 - Local Multi-Instance)
- 3 ChainFire instances running on localhost (S1 provides)
- 3 FlareDB instances running on localhost (S1 provides)
- Separate data directories per instance
- Logging enabled at DEBUG level for evidence
- Process management tools (kill, pkill)
- Optional: iptables for port blocking tests (requires root)
Success Criteria
- All TEST scenarios pass
- GAP scenarios documented with recommendations
- No data loss in any failure scenario
- Clear error messages on unrecoverable failures
Future Work (Identified Gaps)
- PlasmaVMC: Add retry logic for remote service calls
- IAM Client: Add exponential backoff retry
- Watch streams: Add auto-reconnect wrapper