photoncloud-monorepo/docs/ops/troubleshooting.md

# Troubleshooting Runbook

## Overview

This runbook provides diagnostic procedures and solutions for common operational issues with Chainfire (distributed KV) and FlareDB (time-series DB).

## Quick Diagnostics

### Health Check Commands

```bash
# Chainfire cluster health
chainfire-client --endpoint http://NODE_IP:2379 status
chainfire-client --endpoint http://NODE_IP:2379 member-list

# FlareDB cluster health
flaredb-client --endpoint http://PD_IP:2379 cluster-status
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | {id, state, capacity}'

# Service status
systemctl status chainfire
systemctl status flaredb

# Port connectivity
nc -zv NODE_IP 2379  # API port
nc -zv NODE_IP 2380  # Raft port
nc -zv NODE_IP 2381  # Gossip port

# Resource usage
top -bn1 | head -20
df -h
iostat -x 1 5

# Recent logs
journalctl -u chainfire -n 100 --no-pager
journalctl -u flaredb -n 100 --no-pager
```

## Chainfire Issues

### Issue: Node Cannot Join Cluster

**Symptoms:**
- `member-add` command hangs or times out
- New node logs show "connection refused" or "timeout" errors
- `member-list` does not show the new node

**Diagnosis:**
```bash
# 1. Check network connectivity
nc -zv NEW_NODE_IP 2380

# 2. Verify Raft server is listening on new node
ssh NEW_NODE_IP "ss -tlnp | grep 2380"

# 3. Check firewall rules
ssh NEW_NODE_IP "sudo iptables -L -n | grep 2380"

# 4. Verify TLS configuration matches
ssh NEW_NODE_IP "grep -A5 '\[network.tls\]' /etc/centra-cloud/chainfire.toml"

# 5. Check leader logs
ssh LEADER_NODE "journalctl -u chainfire -n 50 | grep -i 'add.*node'"
```

**Resolution:**

**If network issue:**
```bash
# Open firewall ports on new node
sudo firewall-cmd --permanent --add-port=2379/tcp
sudo firewall-cmd --permanent --add-port=2380/tcp
sudo firewall-cmd --permanent --add-port=2381/tcp
sudo firewall-cmd --reload
```

**If TLS mismatch:**
```bash
# Ensure new node has correct certificates
sudo ls -l /etc/centra-cloud/certs/
# Should have: ca.crt, chainfire-node-N.crt, chainfire-node-N.key

# Verify certificate is valid
openssl x509 -in /etc/centra-cloud/certs/chainfire-node-N.crt -noout -text
```

**If bootstrap flag set incorrectly:**
```bash
# Edit config on new node
sudo vi /etc/centra-cloud/chainfire.toml

# Ensure:
# [cluster]
# bootstrap = false  # MUST be false for joining nodes

sudo systemctl restart chainfire
```

### Issue: No Leader / Leader Election Fails

**Symptoms:**
- Writes fail with "no leader elected" error
- `chainfire-client status` shows `leader: none`
- Logs show repeated "election timeout" messages

**Diagnosis:**
```bash
# 1. Check cluster membership
chainfire-client --endpoint http://NODE1_IP:2379 member-list

# 2. Check Raft state on all nodes
for node in node1 node2 node3; do
  echo "=== $node ==="
  ssh $node "journalctl -u chainfire -n 20 | grep -i 'raft\|leader\|election'"
done

# 3. Check network partition
for node in node1 node2 node3; do
  for peer in node1 node2 node3; do
    echo "$node -> $peer:"
    ssh $node "ping -c 3 $peer"
  done
done

# 4. Check quorum
# For 3-node cluster, need 2 nodes (majority)
RUNNING_NODES=$(for node in node1 node2 node3; do ssh $node "systemctl is-active chainfire" 2>/dev/null; done | grep -c active)
echo "Running nodes: $RUNNING_NODES (need >= 2 for quorum)"
```

**Resolution:**

**If <50% nodes are up (no quorum):**
```bash
# Start majority of nodes
ssh node1 "sudo systemctl start chainfire"
ssh node2 "sudo systemctl start chainfire"

# Wait for leader election
sleep 10

# Verify leader elected
chainfire-client --endpoint http://node1:2379 status | grep leader
```

**If network partition:**
```bash
# Check and fix network connectivity
# Ensure bidirectional connectivity between all nodes

# Restart affected nodes
ssh ISOLATED_NODE "sudo systemctl restart chainfire"
```

**If split-brain (multiple leaders):**
```bash
# DANGER: This wipes follower data
# Stop all nodes
for node in node1 node2 node3; do
  ssh $node "sudo systemctl stop chainfire"
done

# Keep only the node with highest raft_index
# Wipe others
ssh node2 "sudo rm -rf /var/lib/chainfire/*"
ssh node3 "sudo rm -rf /var/lib/chainfire/*"

# Restart leader (node1 in this example)
ssh node1 "sudo systemctl start chainfire"
sleep 10

# Re-add followers via member-add
chainfire-client --endpoint http://node1:2379 member-add --node-id 2 --peer-url node2:2380
chainfire-client --endpoint http://node1:2379 member-add --node-id 3 --peer-url node3:2380

# Start followers
ssh node2 "sudo systemctl start chainfire"
ssh node3 "sudo systemctl start chainfire"
```

### Issue: High Write Latency

**Symptoms:**
- `chainfire-client put` commands take >100ms
- Application reports slow writes
- Metrics show p99 latency >500ms

**Diagnosis:**
```bash
# 1. Check disk I/O
iostat -x 1 10
# Watch for %util > 80% or await > 20ms

# 2. Check Raft replication lag
chainfire-client --endpoint http://LEADER_IP:2379 status
# Compare raft_index across nodes

# 3. Check network latency between nodes
for node in node1 node2 node3; do
  echo "=== $node ==="
  ping -c 10 $node
done

# 4. Check CPU usage
top -bn1 | grep chainfire

# 5. Check RocksDB stats
# Look for stalls in logs
journalctl -u chainfire -n 500 | grep -i stall
```

**Resolution:**

**If disk I/O bottleneck:**
```bash
# 1. Check data directory is on SSD (not HDD)
df -h /var/lib/chainfire
mount | grep /var/lib/chainfire

# 2. Tune RocksDB settings (in config)
[storage]
# Increase write buffer size
write_buffer_size = 134217728  # 128MB (default: 64MB)
# Increase block cache
block_cache_size = 536870912   # 512MB (default: 256MB)

# 3. Enable direct I/O if on dedicated disk
# Add to config:
use_direct_io_for_flush_and_compaction = true

# 4. Restart service
sudo systemctl restart chainfire
```

**If network latency:**
```bash
# Verify nodes are in same datacenter
# For cross-datacenter, expect higher latency
# Consider adding learner nodes instead of voters

# Check MTU settings
ip link show | grep mtu
# Ensure MTU is consistent across nodes (typically 1500 or 9000 for jumbo frames)
```

**If CPU bottleneck:**
```bash
# Scale vertically (add CPU cores)
# Or scale horizontally (add read replicas as learner nodes)

# Tune Raft tick interval (in config)
[raft]
tick_interval_ms = 200  # Increase from default 100ms
```

### Issue: Data Inconsistency After Crash

**Symptoms:**
- After node crash/restart, reads return stale data
- `raft_index` does not advance
- Logs show "corrupted log entry" errors

**Diagnosis:**
```bash
# 1. Check RocksDB integrity
# Stop service first
sudo systemctl stop chainfire

# Run RocksDB repair
rocksdb_ldb --db=/var/lib/chainfire repair

# Check for corruption
rocksdb_ldb --db=/var/lib/chainfire checkconsistency
```

**Resolution:**

**If minor corruption (repair successful):**
```bash
# Restart service
sudo systemctl start chainfire

# Let Raft catch up from leader
# Monitor raft_index
watch -n 1 "chainfire-client --endpoint http://localhost:2379 status | grep raft_index"
```

**If major corruption (repair failed):**
```bash
# Restore from backup
sudo systemctl stop chainfire
sudo mv /var/lib/chainfire /var/lib/chainfire.corrupted
sudo mkdir -p /var/lib/chainfire

# Extract latest backup
LATEST_BACKUP=$(ls -t /var/backups/chainfire/*.tar.gz | head -1)
sudo tar -xzf "$LATEST_BACKUP" -C /var/lib/chainfire --strip-components=1

# Fix permissions
sudo chown -R chainfire:chainfire /var/lib/chainfire

# Restart
sudo systemctl start chainfire
```

**If cannot restore (no backup):**
```bash
# Remove node from cluster and re-add fresh
# From leader node:
chainfire-client --endpoint http://LEADER_IP:2379 member-remove --node-id FAILED_NODE_ID

# On failed node, wipe and rejoin
sudo systemctl stop chainfire
sudo rm -rf /var/lib/chainfire/*
sudo systemctl start chainfire

# Re-add from leader
chainfire-client --endpoint http://LEADER_IP:2379 member-add \
  --node-id FAILED_NODE_ID \
  --peer-url FAILED_NODE_IP:2380 \
  --learner

# Promote after catchup
chainfire-client --endpoint http://LEADER_IP:2379 member-promote --node-id FAILED_NODE_ID
```

## FlareDB Issues

### Issue: Store Not Registering with PD

**Symptoms:**
- New FlareDB store starts but doesn't appear in `cluster-status`
- Store logs show "failed to register with PD" errors
- PD logs show no registration attempts

**Diagnosis:**
```bash
# 1. Check PD connectivity
ssh FLAREDB_NODE "nc -zv PD_IP 2379"

# 2. Verify PD address in config
ssh FLAREDB_NODE "grep pd_addr /etc/centra-cloud/flaredb.toml"

# 3. Check store logs
ssh FLAREDB_NODE "journalctl -u flaredb -n 100 | grep -i 'pd\|register'"

# 4. Check PD logs
ssh PD_NODE "journalctl -u placement-driver -n 100 | grep -i register"

# 5. Verify store_id is unique
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | .id'
```

**Resolution:**

**If network issue:**
```bash
# Open firewall on PD node
ssh PD_NODE "sudo firewall-cmd --permanent --add-port=2379/tcp"
ssh PD_NODE "sudo firewall-cmd --reload"

# Restart store
ssh FLAREDB_NODE "sudo systemctl restart flaredb"
```

**If duplicate store_id:**
```bash
# Assign new unique store_id
ssh FLAREDB_NODE "sudo vi /etc/centra-cloud/flaredb.toml"
# Change: store_id = <NEW_UNIQUE_ID>

# Wipe old data (contains old store_id)
ssh FLAREDB_NODE "sudo rm -rf /var/lib/flaredb/*"

# Restart
ssh FLAREDB_NODE "sudo systemctl restart flaredb"
```

**If TLS mismatch:**
```bash
# Ensure PD and store have matching TLS config
# Either both use TLS or both don't

# If PD uses TLS:
ssh FLAREDB_NODE "sudo vi /etc/centra-cloud/flaredb.toml"
# Add/verify:
# [tls]
# cert_file = "/etc/centra-cloud/certs/flaredb-node-N.crt"
# key_file = "/etc/centra-cloud/certs/flaredb-node-N.key"
# ca_file = "/etc/centra-cloud/certs/ca.crt"

# Restart
ssh FLAREDB_NODE "sudo systemctl restart flaredb"
```

### Issue: Region Rebalancing Stuck

**Symptoms:**
- `pd/api/v1/stats/region` shows high `pending_peers` count
- Regions not moving to new stores
- PD logs show "failed to schedule operator" errors

**Diagnosis:**
```bash
# 1. Check region stats
curl http://PD_IP:2379/pd/api/v1/stats/region | jq

# 2. Check store capacity
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | {id, state, available, capacity}'

# 3. Check pending operators
curl http://PD_IP:2379/pd/api/v1/operators | jq

# 4. Check PD scheduler config
curl http://PD_IP:2379/pd/api/v1/config/schedule | jq
```

**Resolution:**

**If store is down:**
```bash
# Identify down store
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | select(.state!="Up")'

# Fix or remove down store
ssh DOWN_STORE_NODE "sudo systemctl restart flaredb"

# If cannot recover, remove store:
curl -X DELETE http://PD_IP:2379/pd/api/v1/store/DOWN_STORE_ID
```

**If disk full:**
```bash
# Identify full stores
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | select((.available / .capacity) < 0.1)'

# Add more storage or scale out with new stores
# See scale-out.md for adding stores
```

**If scheduler disabled:**
```bash
# Check scheduler status
curl http://PD_IP:2379/pd/api/v1/config/schedule | jq '.schedulers'

# Enable schedulers if disabled
curl -X POST http://PD_IP:2379/pd/api/v1/config/schedule \
  -d '{"max-snapshot-count": 3, "max-pending-peer-count": 16}'
```

### Issue: Read/Write Timeout

**Symptoms:**
- Client operations timeout after 30s
- Logs show "context deadline exceeded"
- No leader election issues visible

**Diagnosis:**
```bash
# 1. Check client timeout config
# Default timeout is 30s

# 2. Check store responsiveness
time flaredb-client --endpoint http://STORE_IP:2379 get test-key

# 3. Check CPU usage on stores
ssh STORE_NODE "top -bn1 | grep flaredb"

# 4. Check slow queries
ssh STORE_NODE "journalctl -u flaredb -n 500 | grep -i 'slow\|timeout'"

# 5. Check disk latency
ssh STORE_NODE "iostat -x 1 10"
```

**Resolution:**

**If disk I/O bottleneck:**
```bash
# Same as Chainfire high latency issue
# 1. Verify SSD usage
# 2. Tune RocksDB settings
# 3. Add more stores for read distribution
```

**If CPU bottleneck:**
```bash
# Check compaction storms
ssh STORE_NODE "journalctl -u flaredb | grep -i compaction | tail -50"

# Throttle compaction if needed
# Add to flaredb config:
[storage]
max_background_compactions = 2  # Reduce from default 4
max_background_flushes = 1      # Reduce from default 2

sudo systemctl restart flaredb
```

**If network partition:**
```bash
# Check connectivity between store and PD
ssh STORE_NODE "ping -c 10 PD_IP"

# Check for packet loss
# If >1% loss, investigate network infrastructure
```

## TLS/mTLS Issues

### Issue: TLS Handshake Failures

**Symptoms:**
- Logs show "tls: bad certificate" or "certificate verify failed"
- Connections fail immediately
- curl commands fail with SSL errors

**Diagnosis:**
```bash
# 1. Verify certificate files exist
ls -l /etc/centra-cloud/certs/

# 2. Check certificate validity
openssl x509 -in /etc/centra-cloud/certs/chainfire-node-1.crt -noout -dates

# 3. Verify CA matches
openssl x509 -in /etc/centra-cloud/certs/ca.crt -noout -subject
openssl x509 -in /etc/centra-cloud/certs/chainfire-node-1.crt -noout -issuer

# 4. Test TLS connection
openssl s_client -connect NODE_IP:2379 \
  -CAfile /etc/centra-cloud/certs/ca.crt \
  -cert /etc/centra-cloud/certs/chainfire-node-1.crt \
  -key /etc/centra-cloud/certs/chainfire-node-1.key
```

**Resolution:**

**If certificate expired:**
```bash
# Regenerate certificates
cd /path/to/centra-cloud
./scripts/generate-dev-certs.sh /etc/centra-cloud/certs

# Distribute to all nodes
for node in node1 node2 node3; do
  scp /etc/centra-cloud/certs/* $node:/etc/centra-cloud/certs/
done

# Restart services
for node in node1 node2 node3; do
  ssh $node "sudo systemctl restart chainfire"
done
```

**If CA mismatch:**
```bash
# Ensure all nodes use same CA
# Regenerate all certs from same CA

# On CA-generating node:
./scripts/generate-dev-certs.sh /tmp/new-certs

# Distribute to all nodes
for node in node1 node2 node3; do
  scp /tmp/new-certs/* $node:/etc/centra-cloud/certs/
  ssh $node "sudo chown -R chainfire:chainfire /etc/centra-cloud/certs"
  ssh $node "sudo chmod 600 /etc/centra-cloud/certs/*.key"
done

# Restart all services
for node in node1 node2 node3; do
  ssh $node "sudo systemctl restart chainfire"
done
```

**If permissions issue:**
```bash
# Fix certificate file permissions
sudo chown chainfire:chainfire /etc/centra-cloud/certs/*
sudo chmod 644 /etc/centra-cloud/certs/*.crt
sudo chmod 600 /etc/centra-cloud/certs/*.key

# Restart service
sudo systemctl restart chainfire
```

## Performance Tuning

### Chainfire Performance Optimization

**For write-heavy workloads:**
```toml
# /etc/centra-cloud/chainfire.toml

[storage]
# Increase write buffer
write_buffer_size = 134217728  # 128MB

# More write buffers
max_write_buffer_number = 4

# Larger block cache for hot data
block_cache_size = 1073741824  # 1GB

# Reduce compaction frequency
level0_file_num_compaction_trigger = 8  # Default: 4
```

**For read-heavy workloads:**
```toml
[storage]
# Larger block cache
block_cache_size = 2147483648  # 2GB

# Enable bloom filters
bloom_filter_bits_per_key = 10

# More table cache
max_open_files = 10000  # Default: 1000
```

**For low-latency requirements:**
```toml
[raft]
# Reduce tick interval
tick_interval_ms = 50  # Default: 100

[storage]
# Enable direct I/O
use_direct_io_for_flush_and_compaction = true
```

### FlareDB Performance Optimization

**For high ingestion rate:**
```toml
# /etc/centra-cloud/flaredb.toml

[storage]
# Larger write buffers
write_buffer_size = 268435456  # 256MB
max_write_buffer_number = 6

# More background jobs
max_background_compactions = 4
max_background_flushes = 2
```

**For large query workloads:**
```toml
[storage]
# Larger block cache
block_cache_size = 4294967296  # 4GB

# Keep more files open
max_open_files = 20000
```

## Monitoring & Alerts

### Key Metrics to Monitor

**Chainfire:**
- `raft_index` - should advance steadily
- `raft_term` - should be stable (not increasing frequently)
- Write latency p50, p95, p99
- Disk I/O utilization
- Network bandwidth between nodes

**FlareDB:**
- Store state (Up/Down)
- Region count and distribution
- Pending peers count (should be near 0)
- Read/write QPS per store
- Disk space available

### Prometheus Queries

```promql
# Chainfire write latency
histogram_quantile(0.99, rate(chainfire_write_duration_seconds_bucket[5m]))

# Raft log replication lag
chainfire_raft_index{role="leader"} - chainfire_raft_index{role="follower"}

# FlareDB store health
flaredb_store_state == 1  # 1 = Up, 0 = Down

# Region rebalancing activity
rate(flaredb_pending_peers_total[5m])
```

### Alerting Rules

```yaml
# Prometheus alerting rules

groups:
  - name: chainfire
    rules:
      - alert: ChainfireNoLeader
        expr: chainfire_has_leader == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Chainfire cluster has no leader"

      - alert: ChainfireHighWriteLatency
        expr: histogram_quantile(0.99, rate(chainfire_write_duration_seconds_bucket[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Chainfire p99 write latency >500ms"

      - alert: ChainfireNodeDown
        expr: up{job="chainfire"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Chainfire node {{ $labels.instance }} is down"

  - name: flaredb
    rules:
      - alert: FlareDBStoreDown
        expr: flaredb_store_state == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "FlareDB store {{ $labels.store_id }} is down"

      - alert: FlareDBHighPendingPeers
        expr: flaredb_pending_peers_total > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "FlareDB has {{ $value }} pending peers (rebalancing stuck?)"
```

## Log Analysis

### Common Log Patterns

**Chainfire healthy operation:**
```
INFO chainfire_raft: Leader elected, term=3
INFO chainfire_storage: Committed entry, index=12345
INFO chainfire_api: Handled put request, latency=15ms
```

**Chainfire warning signs:**
```
WARN chainfire_raft: Election timeout, no heartbeat from leader
WARN chainfire_storage: RocksDB stall detected, duration=2000ms
ERROR chainfire_network: Failed to connect to peer, addr=node2:2380
```

**FlareDB healthy operation:**
```
INFO flaredb_pd_client: Registered with PD, store_id=1
INFO flaredb_raft: Applied snapshot, index=5000
INFO flaredb_service: Handled query, rows=1000, latency=50ms
```

**FlareDB warning signs:**
```
WARN flaredb_pd_client: Heartbeat to PD failed, retrying...
WARN flaredb_storage: Compaction is slow, duration=30s
ERROR flaredb_raft: Failed to replicate log, peer=store2
```

### Log Aggregation Queries

**Using journalctl:**
```bash
# Find all errors in last hour
journalctl -u chainfire --since "1 hour ago" | grep ERROR

# Count error types
journalctl -u chainfire --since "1 day ago" | grep ERROR | awk '{print $NF}' | sort | uniq -c | sort -rn

# Track leader changes
journalctl -u chainfire | grep "Leader elected" | tail -20
```

**Using grep for pattern matching:**
```bash
# Find slow operations
journalctl -u chainfire -n 10000 | grep -E 'latency=[0-9]{3,}ms'

# Find connection errors
journalctl -u chainfire -n 5000 | grep -i 'connection refused\|timeout\|unreachable'

# Find replication lag
journalctl -u chainfire | grep -i 'lag\|behind\|catch.*up'
```

## References

- Configuration: `specifications/configuration.md`
- Backup/Restore: `docs/ops/backup-restore.md`
- Scale-Out: `docs/ops/scale-out.md`
- Upgrade: `docs/ops/upgrade.md`
- RocksDB Tuning: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide