- Replace form_urlencoded with RFC 3986 compliant URI encoding - Implement aws_uri_encode() matching AWS SigV4 spec exactly - Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded - All other chars percent-encoded with uppercase hex - Preserve slashes in paths, encode in query params - Normalize empty paths to '/' per AWS spec - Fix test expectations (body hash, HMAC values) - Add comprehensive SigV4 signature determinism test This fixes the canonicalization mismatch that caused signature validation failures in T047. Auth can now be enabled for production. Refs: T058.S1
166 lines
5.2 KiB
Markdown
166 lines
5.2 KiB
Markdown
# T040.S4 Service Reconnection Test Scenarios
|
|
|
|
## Overview
|
|
Test scenarios for validating service reconnection behavior after transient failures.
|
|
|
|
## Test Environment: Option B2 (Local Multi-Instance)
|
|
**Approved**: 2025-12-11
|
|
|
|
**Setup**: 3 instances per service running on localhost with different ports
|
|
- ChainFire: ports 2379, 2380, 2381 (or similar)
|
|
- FlareDB: ports 5000, 5001, 5002 (or similar)
|
|
|
|
**Failure Simulation Methods** (adapted from VM approach):
|
|
- **Process kill**: `kill -9 <pid>` simulates sudden node failure
|
|
- **SIGTERM**: `kill <pid>` simulates graceful shutdown
|
|
- **Port blocking**: `iptables -A INPUT -p tcp --dport <port> -j DROP` (if root)
|
|
- **Pause**: `kill -STOP <pid>` / `kill -CONT <pid>` simulates freeze
|
|
|
|
**Note**: Cross-VM network partition tests deferred to T039 (production deployment)
|
|
|
|
## Current State Analysis
|
|
|
|
### Services WITH Reconnection Logic
|
|
| Service | Mechanism | Location |
|
|
|---------|-----------|----------|
|
|
| ChainFire | Exponential backoff (3 retries, 2.0x multiplier, 500ms-30s) | `chainfire/crates/chainfire-api/src/raft_client.rs` |
|
|
| FlareDB | PD client auto-reconnect (10s cycle), connection pooling | `flaredb/crates/flaredb-server/src/main.rs:283-356` |
|
|
|
|
### Services WITHOUT Reconnection Logic (GAPS)
|
|
| Service | Gap | Risk |
|
|
|---------|-----|------|
|
|
| PlasmaVMC | No retry/reconnection | VM operations fail silently on network blip |
|
|
| IAM | No retry mechanism | Auth failures cascade to all services |
|
|
| Watch streams | Break on error, no auto-reconnect | Config/event propagation stops |
|
|
|
|
---
|
|
|
|
## Test Scenarios
|
|
|
|
### Scenario 1: ChainFire Raft Recovery
|
|
**Goal**: Verify Raft RPC retry logic works under network failures
|
|
|
|
**Steps**:
|
|
1. Start 3-node ChainFire cluster
|
|
2. Write key-value pair
|
|
3. Use `iptables` to block traffic to leader node
|
|
4. Attempt read/write operation from client
|
|
5. Observe retry behavior (should retry with backoff)
|
|
6. Unblock traffic
|
|
7. Verify operation completes or fails gracefully
|
|
|
|
**Expected**:
|
|
- Client retries up to 3 times with exponential backoff
|
|
- Clear error message on final failure
|
|
- No data corruption
|
|
|
|
**Evidence**: Client logs showing retry attempts, timing
|
|
|
|
---
|
|
|
|
### Scenario 2: FlareDB PD Reconnection
|
|
**Goal**: Verify FlareDB server reconnects to ChainFire (PD) after restart
|
|
|
|
**Steps**:
|
|
1. Start ChainFire cluster (PD)
|
|
2. Start FlareDB server connected to PD
|
|
3. Verify heartbeat working (check logs)
|
|
4. Kill ChainFire leader
|
|
5. Wait for new leader election
|
|
6. Observe FlareDB reconnection behavior
|
|
|
|
**Expected**:
|
|
- FlareDB logs "Reconnected to PD" within 10-20s
|
|
- Client operations resume after reconnection
|
|
- No data loss during transition
|
|
|
|
**Evidence**: Server logs, client operation success
|
|
|
|
---
|
|
|
|
### Scenario 3: Network Partition (iptables)
|
|
**Goal**: Verify cluster behavior during network partition
|
|
|
|
**Steps**:
|
|
1. Start 3-node cluster (ChainFire + FlareDB)
|
|
2. Write data to cluster
|
|
3. Create network partition: `iptables -A INPUT -s <node2-ip> -j DROP`
|
|
4. Attempt writes (should succeed with 2/3 quorum)
|
|
5. Kill another node (should lose quorum)
|
|
6. Verify writes fail gracefully
|
|
7. Restore partition, verify cluster recovery
|
|
|
|
**Expected**:
|
|
- 2/3 nodes: writes succeed
|
|
- 1/3 nodes: writes fail, no data corruption
|
|
- Recovery: cluster resumes normal operation
|
|
|
|
**Evidence**: Write success/failure, data consistency check
|
|
|
|
---
|
|
|
|
### Scenario 4: Service Restart Recovery
|
|
**Goal**: Verify clients reconnect after service restart
|
|
|
|
**Steps**:
|
|
1. Start service (FlareDB/ChainFire)
|
|
2. Connect client
|
|
3. Perform operations
|
|
4. Restart service (`systemctl restart` or SIGTERM + start)
|
|
5. Attempt client operations
|
|
|
|
**Expected ChainFire**: Client reconnects via retry logic
|
|
**Expected FlareDB**: Connection pool creates new connection
|
|
**Expected IAM**: Manual reconnect required (gap)
|
|
|
|
**Evidence**: Client operation success after restart
|
|
|
|
---
|
|
|
|
### Scenario 5: Watch Stream Recovery (GAP DOCUMENTATION)
|
|
**Goal**: Document watch stream behavior on connection loss
|
|
|
|
**Steps**:
|
|
1. Start ChainFire server
|
|
2. Connect watch client
|
|
3. Verify events received
|
|
4. Kill server
|
|
5. Observe client behavior
|
|
|
|
**Expected**: Watch breaks, no auto-reconnect
|
|
**GAP**: Need application-level reconnect loop
|
|
|
|
**Evidence**: Client logs showing stream termination
|
|
|
|
---
|
|
|
|
## Test Matrix
|
|
|
|
| Scenario | ChainFire | FlareDB | PlasmaVMC | IAM |
|
|
|----------|-----------|---------|-----------|-----|
|
|
| S1: Raft Recovery | TEST | n/a | n/a | n/a |
|
|
| S2: PD Reconnect | n/a | TEST | n/a | n/a |
|
|
| S3: Network Partition | TEST | TEST | SKIP | SKIP |
|
|
| S4: Restart Recovery | TEST | TEST | DOC-GAP | DOC-GAP |
|
|
| S5: Watch Recovery | DOC-GAP | DOC-GAP | n/a | n/a |
|
|
|
|
---
|
|
|
|
## Prerequisites (Option B2 - Local Multi-Instance)
|
|
- 3 ChainFire instances running on localhost (S1 provides)
|
|
- 3 FlareDB instances running on localhost (S1 provides)
|
|
- Separate data directories per instance
|
|
- Logging enabled at DEBUG level for evidence
|
|
- Process management tools (kill, pkill)
|
|
- Optional: iptables for port blocking tests (requires root)
|
|
|
|
## Success Criteria
|
|
- All TEST scenarios pass
|
|
- GAP scenarios documented with recommendations
|
|
- No data loss in any failure scenario
|
|
- Clear error messages on unrecoverable failures
|
|
|
|
## Future Work (Identified Gaps)
|
|
1. PlasmaVMC: Add retry logic for remote service calls
|
|
2. IAM Client: Add exponential backoff retry
|
|
3. Watch streams: Add auto-reconnect wrapper
|