photoncloud-monorepo/docs/por/T040-ha-validation/s2-raft-resilience-runbook.md
centra d2149b6249 fix(lightningstor): Fix SigV4 canonicalization for AWS S3 auth
- Replace form_urlencoded with RFC 3986 compliant URI encoding
- Implement aws_uri_encode() matching AWS SigV4 spec exactly
- Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded
- All other chars percent-encoded with uppercase hex
- Preserve slashes in paths, encode in query params
- Normalize empty paths to '/' per AWS spec
- Fix test expectations (body hash, HMAC values)
- Add comprehensive SigV4 signature determinism test

This fixes the canonicalization mismatch that caused signature
validation failures in T047. Auth can now be enabled for production.

Refs: T058.S1
2025-12-12 06:23:46 +09:00

5.2 KiB

T040.S2 Raft Cluster Resilience Test Runbook

Prerequisites

  • S1 complete: 3 ChainFire + 3 FlareDB instances running
  • All instances in same directory structure:
    /tmp/t040/
      chainfire-1/  (data-dir, port 2379/2380)
      chainfire-2/  (data-dir, port 2381/2382)
      chainfire-3/  (data-dir, port 2383/2384)
      flaredb-1/    (data-dir, port 5001)
      flaredb-2/    (data-dir, port 5002)
      flaredb-3/    (data-dir, port 5003)
    

Test 1: Single Node Failure (Quorum Maintained)

1.1 ChainFire Leader Kill

# Find leader (check logs or use API)
# Kill leader node (e.g., node-1)
kill -9 $(pgrep -f "chainfire-server.*2379")

# Verify cluster still works (2/3 quorum)
# From remaining node (port 2381):
grpcurl -plaintext localhost:2381 chainfire.api.Kv/Put \
  -d '{"key":"dGVzdA==","value":"YWZ0ZXItZmFpbHVyZQ=="}'

# Expected: Operation succeeds, new leader elected
# Evidence: Logs show "became leader" on surviving node

1.2 Verify New Leader Election

# Check cluster status
grpcurl -plaintext localhost:2381 chainfire.api.Cluster/GetLeader

# Expected: Returns node_id != killed node
# Timing: Leader election should complete within 5-10 seconds

1.3 Restart Failed Node

# Restart node-1
./chainfire-server --config /tmp/t040/chainfire-1/config.toml &

# Wait for rejoin (check logs)
# Verify cluster is 3/3 again
grpcurl -plaintext localhost:2379 chainfire.api.Cluster/GetMembers

# Expected: All 3 nodes listed, cluster healthy

Test 2: FlareDB Node Failure

2.1 Write Test Data

# Write to FlareDB cluster
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawPut \
  -d '{"key":"dGVzdC1rZXk=","value":"dGVzdC12YWx1ZQ==","cf":"default"}'

# Verify read
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawGet \
  -d '{"key":"dGVzdC1rZXk=","cf":"default"}'

2.2 Kill FlareDB Node

# Kill node-2
kill -9 $(pgrep -f "flaredb-server.*5002")

# Verify writes still work (2/3 quorum)
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawPut \
  -d '{"key":"YWZ0ZXItZmFpbA==","value":"c3RpbGwtd29ya3M="}'

# Verify read from another node
grpcurl -plaintext localhost:5003 flaredb.kv.KvRaw/RawGet \
  -d '{"key":"YWZ0ZXItZmFpbA=="}'

# Expected: Both operations succeed

2.3 Data Consistency Check

# Read all keys from surviving nodes - should match
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawScan \
  -d '{"start_key":"","end_key":"//8=","limit":100}'

grpcurl -plaintext localhost:5003 flaredb.kv.KvRaw/RawScan \
  -d '{"start_key":"","end_key":"//8=","limit":100}'

# Expected: Identical results (no data loss)

Test 3: Quorum Loss (2 of 3 Nodes Down)

3.1 Kill Second Node

# With node-2 already down, kill node-3
kill -9 $(pgrep -f "chainfire-server.*2383")

# Attempt write
grpcurl -plaintext localhost:2379 chainfire.api.Kv/Put \
  -d '{"key":"bm8tcXVvcnVt","value":"c2hvdWxkLWZhaWw="}'

# Expected: Timeout or error (no quorum)
# Error message should indicate cluster unavailable

3.2 Graceful Degradation

# Verify reads still work (from local Raft log)
grpcurl -plaintext localhost:2379 chainfire.api.Kv/Get \
  -d '{"key":"dGVzdA=="}'

# Expected: Read succeeds (stale read allowed)
# OR: Read fails with clear "no quorum" error

3.3 Recovery

# Restart node-3
./chainfire-server --config /tmp/t040/chainfire-3/config.toml &

# Wait for quorum restoration
# Retry write
grpcurl -plaintext localhost:2379 chainfire.api.Kv/Put \
  -d '{"key":"cmVjb3ZlcmVk","value":"c3VjY2Vzcw=="}'

# Expected: Write succeeds, cluster operational

Test 4: Process Pause (Simulated Freeze)

# Pause leader process
kill -STOP $(pgrep -f "chainfire-server.*2379")

# Wait for heartbeat timeout (typically 1-5 seconds)
sleep 10

# Verify new leader elected
grpcurl -plaintext localhost:2381 chainfire.api.Cluster/GetLeader

# Resume paused process
kill -CONT $(pgrep -f "chainfire-server.*2379")

# Verify old leader rejoins as follower
# (check logs for "became follower" message)

Evidence Collection

For each test, record:

  1. Timestamps: When failure injected, when detected, when recovered
  2. Leader transitions: Old leader ID → New leader ID
  3. Data verification: Keys written during failure, confirmed after recovery
  4. Error messages: Exact error returned during quorum loss

Log Snippets to Capture

# ChainFire leader election
grep -i "leader\|election\|became" /tmp/t040/chainfire-*/logs/*

# FlareDB Raft state
grep -i "raft\|leader\|commit" /tmp/t040/flaredb-*/logs/*

Success Criteria

Test Expected Pass/Fail
1.1 Leader kill Cluster continues, new leader in <10s
1.2 Leader election Correct leader ID returned
1.3 Node rejoin Cluster returns to 3/3
2.1-2.3 FlareDB quorum Writes succeed with 2/3, data consistent
3.1-3.3 Quorum loss Graceful error, recovery works
4 Process pause Leader election on timeout, old node rejoins

Known Gaps (Document, Don't Block)

  1. Cross-network partition: Not tested (requires iptables/network namespace)
  2. Disk failure: Not simulated
  3. Clock skew: Not tested

These are deferred to T039 (production deployment) or future work.