photoncloud-monorepo/docs/por/T040-ha-validation/s2-raft-resilience-runbook.md
centra d2149b6249 fix(lightningstor): Fix SigV4 canonicalization for AWS S3 auth
- Replace form_urlencoded with RFC 3986 compliant URI encoding
- Implement aws_uri_encode() matching AWS SigV4 spec exactly
- Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded
- All other chars percent-encoded with uppercase hex
- Preserve slashes in paths, encode in query params
- Normalize empty paths to '/' per AWS spec
- Fix test expectations (body hash, HMAC values)
- Add comprehensive SigV4 signature determinism test

This fixes the canonicalization mismatch that caused signature
validation failures in T047. Auth can now be enabled for production.

Refs: T058.S1
2025-12-12 06:23:46 +09:00

208 lines
5.2 KiB
Markdown

# T040.S2 Raft Cluster Resilience Test Runbook
## Prerequisites
- S1 complete: 3 ChainFire + 3 FlareDB instances running
- All instances in same directory structure:
```
/tmp/t040/
chainfire-1/ (data-dir, port 2379/2380)
chainfire-2/ (data-dir, port 2381/2382)
chainfire-3/ (data-dir, port 2383/2384)
flaredb-1/ (data-dir, port 5001)
flaredb-2/ (data-dir, port 5002)
flaredb-3/ (data-dir, port 5003)
```
## Test 1: Single Node Failure (Quorum Maintained)
### 1.1 ChainFire Leader Kill
```bash
# Find leader (check logs or use API)
# Kill leader node (e.g., node-1)
kill -9 $(pgrep -f "chainfire-server.*2379")
# Verify cluster still works (2/3 quorum)
# From remaining node (port 2381):
grpcurl -plaintext localhost:2381 chainfire.api.Kv/Put \
-d '{"key":"dGVzdA==","value":"YWZ0ZXItZmFpbHVyZQ=="}'
# Expected: Operation succeeds, new leader elected
# Evidence: Logs show "became leader" on surviving node
```
### 1.2 Verify New Leader Election
```bash
# Check cluster status
grpcurl -plaintext localhost:2381 chainfire.api.Cluster/GetLeader
# Expected: Returns node_id != killed node
# Timing: Leader election should complete within 5-10 seconds
```
### 1.3 Restart Failed Node
```bash
# Restart node-1
./chainfire-server --config /tmp/t040/chainfire-1/config.toml &
# Wait for rejoin (check logs)
# Verify cluster is 3/3 again
grpcurl -plaintext localhost:2379 chainfire.api.Cluster/GetMembers
# Expected: All 3 nodes listed, cluster healthy
```
---
## Test 2: FlareDB Node Failure
### 2.1 Write Test Data
```bash
# Write to FlareDB cluster
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawPut \
-d '{"key":"dGVzdC1rZXk=","value":"dGVzdC12YWx1ZQ==","cf":"default"}'
# Verify read
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawGet \
-d '{"key":"dGVzdC1rZXk=","cf":"default"}'
```
### 2.2 Kill FlareDB Node
```bash
# Kill node-2
kill -9 $(pgrep -f "flaredb-server.*5002")
# Verify writes still work (2/3 quorum)
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawPut \
-d '{"key":"YWZ0ZXItZmFpbA==","value":"c3RpbGwtd29ya3M="}'
# Verify read from another node
grpcurl -plaintext localhost:5003 flaredb.kv.KvRaw/RawGet \
-d '{"key":"YWZ0ZXItZmFpbA=="}'
# Expected: Both operations succeed
```
### 2.3 Data Consistency Check
```bash
# Read all keys from surviving nodes - should match
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawScan \
-d '{"start_key":"","end_key":"//8=","limit":100}'
grpcurl -plaintext localhost:5003 flaredb.kv.KvRaw/RawScan \
-d '{"start_key":"","end_key":"//8=","limit":100}'
# Expected: Identical results (no data loss)
```
---
## Test 3: Quorum Loss (2 of 3 Nodes Down)
### 3.1 Kill Second Node
```bash
# With node-2 already down, kill node-3
kill -9 $(pgrep -f "chainfire-server.*2383")
# Attempt write
grpcurl -plaintext localhost:2379 chainfire.api.Kv/Put \
-d '{"key":"bm8tcXVvcnVt","value":"c2hvdWxkLWZhaWw="}'
# Expected: Timeout or error (no quorum)
# Error message should indicate cluster unavailable
```
### 3.2 Graceful Degradation
```bash
# Verify reads still work (from local Raft log)
grpcurl -plaintext localhost:2379 chainfire.api.Kv/Get \
-d '{"key":"dGVzdA=="}'
# Expected: Read succeeds (stale read allowed)
# OR: Read fails with clear "no quorum" error
```
### 3.3 Recovery
```bash
# Restart node-3
./chainfire-server --config /tmp/t040/chainfire-3/config.toml &
# Wait for quorum restoration
# Retry write
grpcurl -plaintext localhost:2379 chainfire.api.Kv/Put \
-d '{"key":"cmVjb3ZlcmVk","value":"c3VjY2Vzcw=="}'
# Expected: Write succeeds, cluster operational
```
---
## Test 4: Process Pause (Simulated Freeze)
```bash
# Pause leader process
kill -STOP $(pgrep -f "chainfire-server.*2379")
# Wait for heartbeat timeout (typically 1-5 seconds)
sleep 10
# Verify new leader elected
grpcurl -plaintext localhost:2381 chainfire.api.Cluster/GetLeader
# Resume paused process
kill -CONT $(pgrep -f "chainfire-server.*2379")
# Verify old leader rejoins as follower
# (check logs for "became follower" message)
```
---
## Evidence Collection
For each test, record:
1. **Timestamps**: When failure injected, when detected, when recovered
2. **Leader transitions**: Old leader ID → New leader ID
3. **Data verification**: Keys written during failure, confirmed after recovery
4. **Error messages**: Exact error returned during quorum loss
### Log Snippets to Capture
```bash
# ChainFire leader election
grep -i "leader\|election\|became" /tmp/t040/chainfire-*/logs/*
# FlareDB Raft state
grep -i "raft\|leader\|commit" /tmp/t040/flaredb-*/logs/*
```
---
## Success Criteria
| Test | Expected | Pass/Fail |
|------|----------|-----------|
| 1.1 Leader kill | Cluster continues, new leader in <10s | |
| 1.2 Leader election | Correct leader ID returned | |
| 1.3 Node rejoin | Cluster returns to 3/3 | |
| 2.1-2.3 FlareDB quorum | Writes succeed with 2/3, data consistent | |
| 3.1-3.3 Quorum loss | Graceful error, recovery works | |
| 4 Process pause | Leader election on timeout, old node rejoins | |
---
## Known Gaps (Document, Don't Block)
1. **Cross-network partition**: Not tested (requires iptables/network namespace)
2. **Disk failure**: Not simulated
3. **Clock skew**: Not tested
These are deferred to T039 (production deployment) or future work.