- Replace form_urlencoded with RFC 3986 compliant URI encoding - Implement aws_uri_encode() matching AWS SigV4 spec exactly - Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded - All other chars percent-encoded with uppercase hex - Preserve slashes in paths, encode in query params - Normalize empty paths to '/' per AWS spec - Fix test expectations (body hash, HMAC values) - Add comprehensive SigV4 signature determinism test This fixes the canonicalization mismatch that caused signature validation failures in T047. Auth can now be enabled for production. Refs: T058.S1
208 lines
5.2 KiB
Markdown
208 lines
5.2 KiB
Markdown
# T040.S2 Raft Cluster Resilience Test Runbook
|
|
|
|
## Prerequisites
|
|
- S1 complete: 3 ChainFire + 3 FlareDB instances running
|
|
- All instances in same directory structure:
|
|
```
|
|
/tmp/t040/
|
|
chainfire-1/ (data-dir, port 2379/2380)
|
|
chainfire-2/ (data-dir, port 2381/2382)
|
|
chainfire-3/ (data-dir, port 2383/2384)
|
|
flaredb-1/ (data-dir, port 5001)
|
|
flaredb-2/ (data-dir, port 5002)
|
|
flaredb-3/ (data-dir, port 5003)
|
|
```
|
|
|
|
## Test 1: Single Node Failure (Quorum Maintained)
|
|
|
|
### 1.1 ChainFire Leader Kill
|
|
|
|
```bash
|
|
# Find leader (check logs or use API)
|
|
# Kill leader node (e.g., node-1)
|
|
kill -9 $(pgrep -f "chainfire-server.*2379")
|
|
|
|
# Verify cluster still works (2/3 quorum)
|
|
# From remaining node (port 2381):
|
|
grpcurl -plaintext localhost:2381 chainfire.api.Kv/Put \
|
|
-d '{"key":"dGVzdA==","value":"YWZ0ZXItZmFpbHVyZQ=="}'
|
|
|
|
# Expected: Operation succeeds, new leader elected
|
|
# Evidence: Logs show "became leader" on surviving node
|
|
```
|
|
|
|
### 1.2 Verify New Leader Election
|
|
|
|
```bash
|
|
# Check cluster status
|
|
grpcurl -plaintext localhost:2381 chainfire.api.Cluster/GetLeader
|
|
|
|
# Expected: Returns node_id != killed node
|
|
# Timing: Leader election should complete within 5-10 seconds
|
|
```
|
|
|
|
### 1.3 Restart Failed Node
|
|
|
|
```bash
|
|
# Restart node-1
|
|
./chainfire-server --config /tmp/t040/chainfire-1/config.toml &
|
|
|
|
# Wait for rejoin (check logs)
|
|
# Verify cluster is 3/3 again
|
|
grpcurl -plaintext localhost:2379 chainfire.api.Cluster/GetMembers
|
|
|
|
# Expected: All 3 nodes listed, cluster healthy
|
|
```
|
|
|
|
---
|
|
|
|
## Test 2: FlareDB Node Failure
|
|
|
|
### 2.1 Write Test Data
|
|
|
|
```bash
|
|
# Write to FlareDB cluster
|
|
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawPut \
|
|
-d '{"key":"dGVzdC1rZXk=","value":"dGVzdC12YWx1ZQ==","cf":"default"}'
|
|
|
|
# Verify read
|
|
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawGet \
|
|
-d '{"key":"dGVzdC1rZXk=","cf":"default"}'
|
|
```
|
|
|
|
### 2.2 Kill FlareDB Node
|
|
|
|
```bash
|
|
# Kill node-2
|
|
kill -9 $(pgrep -f "flaredb-server.*5002")
|
|
|
|
# Verify writes still work (2/3 quorum)
|
|
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawPut \
|
|
-d '{"key":"YWZ0ZXItZmFpbA==","value":"c3RpbGwtd29ya3M="}'
|
|
|
|
# Verify read from another node
|
|
grpcurl -plaintext localhost:5003 flaredb.kv.KvRaw/RawGet \
|
|
-d '{"key":"YWZ0ZXItZmFpbA=="}'
|
|
|
|
# Expected: Both operations succeed
|
|
```
|
|
|
|
### 2.3 Data Consistency Check
|
|
|
|
```bash
|
|
# Read all keys from surviving nodes - should match
|
|
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawScan \
|
|
-d '{"start_key":"","end_key":"//8=","limit":100}'
|
|
|
|
grpcurl -plaintext localhost:5003 flaredb.kv.KvRaw/RawScan \
|
|
-d '{"start_key":"","end_key":"//8=","limit":100}'
|
|
|
|
# Expected: Identical results (no data loss)
|
|
```
|
|
|
|
---
|
|
|
|
## Test 3: Quorum Loss (2 of 3 Nodes Down)
|
|
|
|
### 3.1 Kill Second Node
|
|
|
|
```bash
|
|
# With node-2 already down, kill node-3
|
|
kill -9 $(pgrep -f "chainfire-server.*2383")
|
|
|
|
# Attempt write
|
|
grpcurl -plaintext localhost:2379 chainfire.api.Kv/Put \
|
|
-d '{"key":"bm8tcXVvcnVt","value":"c2hvdWxkLWZhaWw="}'
|
|
|
|
# Expected: Timeout or error (no quorum)
|
|
# Error message should indicate cluster unavailable
|
|
```
|
|
|
|
### 3.2 Graceful Degradation
|
|
|
|
```bash
|
|
# Verify reads still work (from local Raft log)
|
|
grpcurl -plaintext localhost:2379 chainfire.api.Kv/Get \
|
|
-d '{"key":"dGVzdA=="}'
|
|
|
|
# Expected: Read succeeds (stale read allowed)
|
|
# OR: Read fails with clear "no quorum" error
|
|
```
|
|
|
|
### 3.3 Recovery
|
|
|
|
```bash
|
|
# Restart node-3
|
|
./chainfire-server --config /tmp/t040/chainfire-3/config.toml &
|
|
|
|
# Wait for quorum restoration
|
|
# Retry write
|
|
grpcurl -plaintext localhost:2379 chainfire.api.Kv/Put \
|
|
-d '{"key":"cmVjb3ZlcmVk","value":"c3VjY2Vzcw=="}'
|
|
|
|
# Expected: Write succeeds, cluster operational
|
|
```
|
|
|
|
---
|
|
|
|
## Test 4: Process Pause (Simulated Freeze)
|
|
|
|
```bash
|
|
# Pause leader process
|
|
kill -STOP $(pgrep -f "chainfire-server.*2379")
|
|
|
|
# Wait for heartbeat timeout (typically 1-5 seconds)
|
|
sleep 10
|
|
|
|
# Verify new leader elected
|
|
grpcurl -plaintext localhost:2381 chainfire.api.Cluster/GetLeader
|
|
|
|
# Resume paused process
|
|
kill -CONT $(pgrep -f "chainfire-server.*2379")
|
|
|
|
# Verify old leader rejoins as follower
|
|
# (check logs for "became follower" message)
|
|
```
|
|
|
|
---
|
|
|
|
## Evidence Collection
|
|
|
|
For each test, record:
|
|
1. **Timestamps**: When failure injected, when detected, when recovered
|
|
2. **Leader transitions**: Old leader ID → New leader ID
|
|
3. **Data verification**: Keys written during failure, confirmed after recovery
|
|
4. **Error messages**: Exact error returned during quorum loss
|
|
|
|
### Log Snippets to Capture
|
|
|
|
```bash
|
|
# ChainFire leader election
|
|
grep -i "leader\|election\|became" /tmp/t040/chainfire-*/logs/*
|
|
|
|
# FlareDB Raft state
|
|
grep -i "raft\|leader\|commit" /tmp/t040/flaredb-*/logs/*
|
|
```
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
| Test | Expected | Pass/Fail |
|
|
|------|----------|-----------|
|
|
| 1.1 Leader kill | Cluster continues, new leader in <10s | |
|
|
| 1.2 Leader election | Correct leader ID returned | |
|
|
| 1.3 Node rejoin | Cluster returns to 3/3 | |
|
|
| 2.1-2.3 FlareDB quorum | Writes succeed with 2/3, data consistent | |
|
|
| 3.1-3.3 Quorum loss | Graceful error, recovery works | |
|
|
| 4 Process pause | Leader election on timeout, old node rejoins | |
|
|
|
|
---
|
|
|
|
## Known Gaps (Document, Don't Block)
|
|
|
|
1. **Cross-network partition**: Not tested (requires iptables/network namespace)
|
|
2. **Disk failure**: Not simulated
|
|
3. **Clock skew**: Not tested
|
|
|
|
These are deferred to T039 (production deployment) or future work.
|