- Replace form_urlencoded with RFC 3986 compliant URI encoding - Implement aws_uri_encode() matching AWS SigV4 spec exactly - Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded - All other chars percent-encoded with uppercase hex - Preserve slashes in paths, encode in query params - Normalize empty paths to '/' per AWS spec - Fix test expectations (body hash, HMAC values) - Add comprehensive SigV4 signature determinism test This fixes the canonicalization mismatch that caused signature validation failures in T047. Auth can now be enabled for production. Refs: T058.S1
5.2 KiB
5.2 KiB
T040.S2 Raft Cluster Resilience Test Runbook
Prerequisites
- S1 complete: 3 ChainFire + 3 FlareDB instances running
- All instances in same directory structure:
/tmp/t040/ chainfire-1/ (data-dir, port 2379/2380) chainfire-2/ (data-dir, port 2381/2382) chainfire-3/ (data-dir, port 2383/2384) flaredb-1/ (data-dir, port 5001) flaredb-2/ (data-dir, port 5002) flaredb-3/ (data-dir, port 5003)
Test 1: Single Node Failure (Quorum Maintained)
1.1 ChainFire Leader Kill
# Find leader (check logs or use API)
# Kill leader node (e.g., node-1)
kill -9 $(pgrep -f "chainfire-server.*2379")
# Verify cluster still works (2/3 quorum)
# From remaining node (port 2381):
grpcurl -plaintext localhost:2381 chainfire.api.Kv/Put \
-d '{"key":"dGVzdA==","value":"YWZ0ZXItZmFpbHVyZQ=="}'
# Expected: Operation succeeds, new leader elected
# Evidence: Logs show "became leader" on surviving node
1.2 Verify New Leader Election
# Check cluster status
grpcurl -plaintext localhost:2381 chainfire.api.Cluster/GetLeader
# Expected: Returns node_id != killed node
# Timing: Leader election should complete within 5-10 seconds
1.3 Restart Failed Node
# Restart node-1
./chainfire-server --config /tmp/t040/chainfire-1/config.toml &
# Wait for rejoin (check logs)
# Verify cluster is 3/3 again
grpcurl -plaintext localhost:2379 chainfire.api.Cluster/GetMembers
# Expected: All 3 nodes listed, cluster healthy
Test 2: FlareDB Node Failure
2.1 Write Test Data
# Write to FlareDB cluster
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawPut \
-d '{"key":"dGVzdC1rZXk=","value":"dGVzdC12YWx1ZQ==","cf":"default"}'
# Verify read
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawGet \
-d '{"key":"dGVzdC1rZXk=","cf":"default"}'
2.2 Kill FlareDB Node
# Kill node-2
kill -9 $(pgrep -f "flaredb-server.*5002")
# Verify writes still work (2/3 quorum)
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawPut \
-d '{"key":"YWZ0ZXItZmFpbA==","value":"c3RpbGwtd29ya3M="}'
# Verify read from another node
grpcurl -plaintext localhost:5003 flaredb.kv.KvRaw/RawGet \
-d '{"key":"YWZ0ZXItZmFpbA=="}'
# Expected: Both operations succeed
2.3 Data Consistency Check
# Read all keys from surviving nodes - should match
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawScan \
-d '{"start_key":"","end_key":"//8=","limit":100}'
grpcurl -plaintext localhost:5003 flaredb.kv.KvRaw/RawScan \
-d '{"start_key":"","end_key":"//8=","limit":100}'
# Expected: Identical results (no data loss)
Test 3: Quorum Loss (2 of 3 Nodes Down)
3.1 Kill Second Node
# With node-2 already down, kill node-3
kill -9 $(pgrep -f "chainfire-server.*2383")
# Attempt write
grpcurl -plaintext localhost:2379 chainfire.api.Kv/Put \
-d '{"key":"bm8tcXVvcnVt","value":"c2hvdWxkLWZhaWw="}'
# Expected: Timeout or error (no quorum)
# Error message should indicate cluster unavailable
3.2 Graceful Degradation
# Verify reads still work (from local Raft log)
grpcurl -plaintext localhost:2379 chainfire.api.Kv/Get \
-d '{"key":"dGVzdA=="}'
# Expected: Read succeeds (stale read allowed)
# OR: Read fails with clear "no quorum" error
3.3 Recovery
# Restart node-3
./chainfire-server --config /tmp/t040/chainfire-3/config.toml &
# Wait for quorum restoration
# Retry write
grpcurl -plaintext localhost:2379 chainfire.api.Kv/Put \
-d '{"key":"cmVjb3ZlcmVk","value":"c3VjY2Vzcw=="}'
# Expected: Write succeeds, cluster operational
Test 4: Process Pause (Simulated Freeze)
# Pause leader process
kill -STOP $(pgrep -f "chainfire-server.*2379")
# Wait for heartbeat timeout (typically 1-5 seconds)
sleep 10
# Verify new leader elected
grpcurl -plaintext localhost:2381 chainfire.api.Cluster/GetLeader
# Resume paused process
kill -CONT $(pgrep -f "chainfire-server.*2379")
# Verify old leader rejoins as follower
# (check logs for "became follower" message)
Evidence Collection
For each test, record:
- Timestamps: When failure injected, when detected, when recovered
- Leader transitions: Old leader ID → New leader ID
- Data verification: Keys written during failure, confirmed after recovery
- Error messages: Exact error returned during quorum loss
Log Snippets to Capture
# ChainFire leader election
grep -i "leader\|election\|became" /tmp/t040/chainfire-*/logs/*
# FlareDB Raft state
grep -i "raft\|leader\|commit" /tmp/t040/flaredb-*/logs/*
Success Criteria
| Test | Expected | Pass/Fail |
|---|---|---|
| 1.1 Leader kill | Cluster continues, new leader in <10s | |
| 1.2 Leader election | Correct leader ID returned | |
| 1.3 Node rejoin | Cluster returns to 3/3 | |
| 2.1-2.3 FlareDB quorum | Writes succeed with 2/3, data consistent | |
| 3.1-3.3 Quorum loss | Graceful error, recovery works | |
| 4 Process pause | Leader election on timeout, old node rejoins |
Known Gaps (Document, Don't Block)
- Cross-network partition: Not tested (requires iptables/network namespace)
- Disk failure: Not simulated
- Clock skew: Not tested
These are deferred to T039 (production deployment) or future work.