# T040.S2 Raft Cluster Resilience Test Runbook ## Prerequisites - S1 complete: 3 ChainFire + 3 FlareDB instances running - All instances in same directory structure: ``` /tmp/t040/ chainfire-1/ (data-dir, port 2379/2380) chainfire-2/ (data-dir, port 2381/2382) chainfire-3/ (data-dir, port 2383/2384) flaredb-1/ (data-dir, port 5001) flaredb-2/ (data-dir, port 5002) flaredb-3/ (data-dir, port 5003) ``` ## Test 1: Single Node Failure (Quorum Maintained) ### 1.1 ChainFire Leader Kill ```bash # Find leader (check logs or use API) # Kill leader node (e.g., node-1) kill -9 $(pgrep -f "chainfire-server.*2379") # Verify cluster still works (2/3 quorum) # From remaining node (port 2381): grpcurl -plaintext localhost:2381 chainfire.api.Kv/Put \ -d '{"key":"dGVzdA==","value":"YWZ0ZXItZmFpbHVyZQ=="}' # Expected: Operation succeeds, new leader elected # Evidence: Logs show "became leader" on surviving node ``` ### 1.2 Verify New Leader Election ```bash # Check cluster status grpcurl -plaintext localhost:2381 chainfire.api.Cluster/GetLeader # Expected: Returns node_id != killed node # Timing: Leader election should complete within 5-10 seconds ``` ### 1.3 Restart Failed Node ```bash # Restart node-1 ./chainfire-server --config /tmp/t040/chainfire-1/config.toml & # Wait for rejoin (check logs) # Verify cluster is 3/3 again grpcurl -plaintext localhost:2379 chainfire.api.Cluster/GetMembers # Expected: All 3 nodes listed, cluster healthy ``` --- ## Test 2: FlareDB Node Failure ### 2.1 Write Test Data ```bash # Write to FlareDB cluster grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawPut \ -d '{"key":"dGVzdC1rZXk=","value":"dGVzdC12YWx1ZQ==","cf":"default"}' # Verify read grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawGet \ -d '{"key":"dGVzdC1rZXk=","cf":"default"}' ``` ### 2.2 Kill FlareDB Node ```bash # Kill node-2 kill -9 $(pgrep -f "flaredb-server.*5002") # Verify writes still work (2/3 quorum) grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawPut \ -d '{"key":"YWZ0ZXItZmFpbA==","value":"c3RpbGwtd29ya3M="}' # Verify read from another node grpcurl -plaintext localhost:5003 flaredb.kv.KvRaw/RawGet \ -d '{"key":"YWZ0ZXItZmFpbA=="}' # Expected: Both operations succeed ``` ### 2.3 Data Consistency Check ```bash # Read all keys from surviving nodes - should match grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawScan \ -d '{"start_key":"","end_key":"//8=","limit":100}' grpcurl -plaintext localhost:5003 flaredb.kv.KvRaw/RawScan \ -d '{"start_key":"","end_key":"//8=","limit":100}' # Expected: Identical results (no data loss) ``` --- ## Test 3: Quorum Loss (2 of 3 Nodes Down) ### 3.1 Kill Second Node ```bash # With node-2 already down, kill node-3 kill -9 $(pgrep -f "chainfire-server.*2383") # Attempt write grpcurl -plaintext localhost:2379 chainfire.api.Kv/Put \ -d '{"key":"bm8tcXVvcnVt","value":"c2hvdWxkLWZhaWw="}' # Expected: Timeout or error (no quorum) # Error message should indicate cluster unavailable ``` ### 3.2 Graceful Degradation ```bash # Verify reads still work (from local Raft log) grpcurl -plaintext localhost:2379 chainfire.api.Kv/Get \ -d '{"key":"dGVzdA=="}' # Expected: Read succeeds (stale read allowed) # OR: Read fails with clear "no quorum" error ``` ### 3.3 Recovery ```bash # Restart node-3 ./chainfire-server --config /tmp/t040/chainfire-3/config.toml & # Wait for quorum restoration # Retry write grpcurl -plaintext localhost:2379 chainfire.api.Kv/Put \ -d '{"key":"cmVjb3ZlcmVk","value":"c3VjY2Vzcw=="}' # Expected: Write succeeds, cluster operational ``` --- ## Test 4: Process Pause (Simulated Freeze) ```bash # Pause leader process kill -STOP $(pgrep -f "chainfire-server.*2379") # Wait for heartbeat timeout (typically 1-5 seconds) sleep 10 # Verify new leader elected grpcurl -plaintext localhost:2381 chainfire.api.Cluster/GetLeader # Resume paused process kill -CONT $(pgrep -f "chainfire-server.*2379") # Verify old leader rejoins as follower # (check logs for "became follower" message) ``` --- ## Evidence Collection For each test, record: 1. **Timestamps**: When failure injected, when detected, when recovered 2. **Leader transitions**: Old leader ID → New leader ID 3. **Data verification**: Keys written during failure, confirmed after recovery 4. **Error messages**: Exact error returned during quorum loss ### Log Snippets to Capture ```bash # ChainFire leader election grep -i "leader\|election\|became" /tmp/t040/chainfire-*/logs/* # FlareDB Raft state grep -i "raft\|leader\|commit" /tmp/t040/flaredb-*/logs/* ``` --- ## Success Criteria | Test | Expected | Pass/Fail | |------|----------|-----------| | 1.1 Leader kill | Cluster continues, new leader in <10s | | | 1.2 Leader election | Correct leader ID returned | | | 1.3 Node rejoin | Cluster returns to 3/3 | | | 2.1-2.3 FlareDB quorum | Writes succeed with 2/3, data consistent | | | 3.1-3.3 Quorum loss | Graceful error, recovery works | | | 4 Process pause | Leader election on timeout, old node rejoins | | --- ## Known Gaps (Document, Don't Block) 1. **Cross-network partition**: Not tested (requires iptables/network namespace) 2. **Disk failure**: Not simulated 3. **Clock skew**: Not tested These are deferred to T039 (production deployment) or future work.