photoncloud-monorepo/docs/por/T040-ha-validation/s4-test-scenarios.md

# T040.S4 Service Reconnection Test Scenarios

## Overview
Test scenarios for validating service reconnection behavior after transient failures.

## Test Environment: Option B2 (Local Multi-Instance)
**Approved**: 2025-12-11

**Setup**: 3 instances per service running on localhost with different ports
- ChainFire: ports 2379, 2380, 2381 (or similar)
- FlareDB: ports 5000, 5001, 5002 (or similar)

**Failure Simulation Methods** (adapted from VM approach):
- **Process kill**: `kill -9 <pid>` simulates sudden node failure
- **SIGTERM**: `kill <pid>` simulates graceful shutdown
- **Port blocking**: `iptables -A INPUT -p tcp --dport <port> -j DROP` (if root)
- **Pause**: `kill -STOP <pid>` / `kill -CONT <pid>` simulates freeze

**Note**: Cross-VM network partition tests deferred to T039 (production deployment)

## Current State Analysis

### Services WITH Reconnection Logic
| Service | Mechanism | Location |
|---------|-----------|----------|
| ChainFire | Exponential backoff (3 retries, 2.0x multiplier, 500ms-30s) | `chainfire/crates/chainfire-api/src/raft_client.rs` |
| FlareDB | PD client auto-reconnect (10s cycle), connection pooling | `flaredb/crates/flaredb-server/src/main.rs:283-356` |

### Services WITHOUT Reconnection Logic (GAPS)
| Service | Gap | Risk |
|---------|-----|------|
| PlasmaVMC | No retry/reconnection | VM operations fail silently on network blip |
| IAM | No retry mechanism | Auth failures cascade to all services |
| Watch streams | Break on error, no auto-reconnect | Config/event propagation stops |

---

## Test Scenarios

### Scenario 1: ChainFire Raft Recovery
**Goal**: Verify Raft RPC retry logic works under network failures

**Steps**:
1. Start 3-node ChainFire cluster
2. Write key-value pair
3. Use `iptables` to block traffic to leader node
4. Attempt read/write operation from client
5. Observe retry behavior (should retry with backoff)
6. Unblock traffic
7. Verify operation completes or fails gracefully

**Expected**:
- Client retries up to 3 times with exponential backoff
- Clear error message on final failure
- No data corruption

**Evidence**: Client logs showing retry attempts, timing

---

### Scenario 2: FlareDB PD Reconnection
**Goal**: Verify FlareDB server reconnects to ChainFire (PD) after restart

**Steps**:
1. Start ChainFire cluster (PD)
2. Start FlareDB server connected to PD
3. Verify heartbeat working (check logs)
4. Kill ChainFire leader
5. Wait for new leader election
6. Observe FlareDB reconnection behavior

**Expected**:
- FlareDB logs "Reconnected to PD" within 10-20s
- Client operations resume after reconnection
- No data loss during transition

**Evidence**: Server logs, client operation success

---

### Scenario 3: Network Partition (iptables)
**Goal**: Verify cluster behavior during network partition

**Steps**:
1. Start 3-node cluster (ChainFire + FlareDB)
2. Write data to cluster
3. Create network partition: `iptables -A INPUT -s <node2-ip> -j DROP`
4. Attempt writes (should succeed with 2/3 quorum)
5. Kill another node (should lose quorum)
6. Verify writes fail gracefully
7. Restore partition, verify cluster recovery

**Expected**:
- 2/3 nodes: writes succeed
- 1/3 nodes: writes fail, no data corruption
- Recovery: cluster resumes normal operation

**Evidence**: Write success/failure, data consistency check

---

### Scenario 4: Service Restart Recovery
**Goal**: Verify clients reconnect after service restart

**Steps**:
1. Start service (FlareDB/ChainFire)
2. Connect client
3. Perform operations
4. Restart service (`systemctl restart` or SIGTERM + start)
5. Attempt client operations

**Expected ChainFire**: Client reconnects via retry logic
**Expected FlareDB**: Connection pool creates new connection
**Expected IAM**: Manual reconnect required (gap)

**Evidence**: Client operation success after restart

---

### Scenario 5: Watch Stream Recovery (GAP DOCUMENTATION)
**Goal**: Document watch stream behavior on connection loss

**Steps**:
1. Start ChainFire server
2. Connect watch client
3. Verify events received
4. Kill server
5. Observe client behavior

**Expected**: Watch breaks, no auto-reconnect
**GAP**: Need application-level reconnect loop

**Evidence**: Client logs showing stream termination

---

## Test Matrix

| Scenario | ChainFire | FlareDB | PlasmaVMC | IAM |
|----------|-----------|---------|-----------|-----|
| S1: Raft Recovery | TEST | n/a | n/a | n/a |
| S2: PD Reconnect | n/a | TEST | n/a | n/a |
| S3: Network Partition | TEST | TEST | SKIP | SKIP |
| S4: Restart Recovery | TEST | TEST | DOC-GAP | DOC-GAP |
| S5: Watch Recovery | DOC-GAP | DOC-GAP | n/a | n/a |

---

## Prerequisites (Option B2 - Local Multi-Instance)
- 3 ChainFire instances running on localhost (S1 provides)
- 3 FlareDB instances running on localhost (S1 provides)
- Separate data directories per instance
- Logging enabled at DEBUG level for evidence
- Process management tools (kill, pkill)
- Optional: iptables for port blocking tests (requires root)

## Success Criteria
- All TEST scenarios pass
- GAP scenarios documented with recommendations
- No data loss in any failure scenario
- Clear error messages on unrecoverable failures

## Future Work (Identified Gaps)
1. PlasmaVMC: Add retry logic for remote service calls
2. IAM Client: Add exponential backoff retry
3. Watch streams: Add auto-reconnect wrapper