photoncloud-monorepo/docs/por/T040-ha-validation/s2-raft-resilience-runbook.md

# T040.S2 Raft Cluster Resilience Test Runbook

## Prerequisites
- S1 complete: 3 ChainFire + 3 FlareDB instances running
- All instances in same directory structure:
  ```
  /tmp/t040/
    chainfire-1/  (data-dir, port 2379/2380)
    chainfire-2/  (data-dir, port 2381/2382)
    chainfire-3/  (data-dir, port 2383/2384)
    flaredb-1/    (data-dir, port 5001)
    flaredb-2/    (data-dir, port 5002)
    flaredb-3/    (data-dir, port 5003)
  ```

## Test 1: Single Node Failure (Quorum Maintained)

### 1.1 ChainFire Leader Kill

```bash
# Find leader (check logs or use API)
# Kill leader node (e.g., node-1)
kill -9 $(pgrep -f "chainfire-server.*2379")

# Verify cluster still works (2/3 quorum)
# From remaining node (port 2381):
grpcurl -plaintext localhost:2381 chainfire.api.Kv/Put \
  -d '{"key":"dGVzdA==","value":"YWZ0ZXItZmFpbHVyZQ=="}'

# Expected: Operation succeeds, new leader elected
# Evidence: Logs show "became leader" on surviving node
```

### 1.2 Verify New Leader Election

```bash
# Check cluster status
grpcurl -plaintext localhost:2381 chainfire.api.Cluster/GetLeader

# Expected: Returns node_id != killed node
# Timing: Leader election should complete within 5-10 seconds
```

### 1.3 Restart Failed Node

```bash
# Restart node-1
./chainfire-server --config /tmp/t040/chainfire-1/config.toml &

# Wait for rejoin (check logs)
# Verify cluster is 3/3 again
grpcurl -plaintext localhost:2379 chainfire.api.Cluster/GetMembers

# Expected: All 3 nodes listed, cluster healthy
```

---

## Test 2: FlareDB Node Failure

### 2.1 Write Test Data

```bash
# Write to FlareDB cluster
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawPut \
  -d '{"key":"dGVzdC1rZXk=","value":"dGVzdC12YWx1ZQ==","cf":"default"}'

# Verify read
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawGet \
  -d '{"key":"dGVzdC1rZXk=","cf":"default"}'
```

### 2.2 Kill FlareDB Node

```bash
# Kill node-2
kill -9 $(pgrep -f "flaredb-server.*5002")

# Verify writes still work (2/3 quorum)
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawPut \
  -d '{"key":"YWZ0ZXItZmFpbA==","value":"c3RpbGwtd29ya3M="}'

# Verify read from another node
grpcurl -plaintext localhost:5003 flaredb.kv.KvRaw/RawGet \
  -d '{"key":"YWZ0ZXItZmFpbA=="}'

# Expected: Both operations succeed
```

### 2.3 Data Consistency Check

```bash
# Read all keys from surviving nodes - should match
grpcurl -plaintext localhost:5001 flaredb.kv.KvRaw/RawScan \
  -d '{"start_key":"","end_key":"//8=","limit":100}'

grpcurl -plaintext localhost:5003 flaredb.kv.KvRaw/RawScan \
  -d '{"start_key":"","end_key":"//8=","limit":100}'

# Expected: Identical results (no data loss)
```

---

## Test 3: Quorum Loss (2 of 3 Nodes Down)

### 3.1 Kill Second Node

```bash
# With node-2 already down, kill node-3
kill -9 $(pgrep -f "chainfire-server.*2383")

# Attempt write
grpcurl -plaintext localhost:2379 chainfire.api.Kv/Put \
  -d '{"key":"bm8tcXVvcnVt","value":"c2hvdWxkLWZhaWw="}'

# Expected: Timeout or error (no quorum)
# Error message should indicate cluster unavailable
```

### 3.2 Graceful Degradation

```bash
# Verify reads still work (from local Raft log)
grpcurl -plaintext localhost:2379 chainfire.api.Kv/Get \
  -d '{"key":"dGVzdA=="}'

# Expected: Read succeeds (stale read allowed)
# OR: Read fails with clear "no quorum" error
```

### 3.3 Recovery

```bash
# Restart node-3
./chainfire-server --config /tmp/t040/chainfire-3/config.toml &

# Wait for quorum restoration
# Retry write
grpcurl -plaintext localhost:2379 chainfire.api.Kv/Put \
  -d '{"key":"cmVjb3ZlcmVk","value":"c3VjY2Vzcw=="}'

# Expected: Write succeeds, cluster operational
```

---

## Test 4: Process Pause (Simulated Freeze)

```bash
# Pause leader process
kill -STOP $(pgrep -f "chainfire-server.*2379")

# Wait for heartbeat timeout (typically 1-5 seconds)
sleep 10

# Verify new leader elected
grpcurl -plaintext localhost:2381 chainfire.api.Cluster/GetLeader

# Resume paused process
kill -CONT $(pgrep -f "chainfire-server.*2379")

# Verify old leader rejoins as follower
# (check logs for "became follower" message)
```

---

## Evidence Collection

For each test, record:
1. **Timestamps**: When failure injected, when detected, when recovered
2. **Leader transitions**: Old leader ID → New leader ID
3. **Data verification**: Keys written during failure, confirmed after recovery
4. **Error messages**: Exact error returned during quorum loss

### Log Snippets to Capture

```bash
# ChainFire leader election
grep -i "leader\|election\|became" /tmp/t040/chainfire-*/logs/*

# FlareDB Raft state
grep -i "raft\|leader\|commit" /tmp/t040/flaredb-*/logs/*
```

---

## Success Criteria

| Test | Expected | Pass/Fail |
|------|----------|-----------|
| 1.1 Leader kill | Cluster continues, new leader in <10s | |
| 1.2 Leader election | Correct leader ID returned | |
| 1.3 Node rejoin | Cluster returns to 3/3 | |
| 2.1-2.3 FlareDB quorum | Writes succeed with 2/3, data consistent | |
| 3.1-3.3 Quorum loss | Graceful error, recovery works | |
| 4 Process pause | Leader election on timeout, old node rejoins | |

---

## Known Gaps (Document, Don't Block)

1. **Cross-network partition**: Not tested (requires iptables/network namespace)
2. **Disk failure**: Not simulated
3. **Clock skew**: Not tested

These are deferred to T039 (production deployment) or future work.