photoncloud-monorepo/docs/ops/troubleshooting.md
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

19 KiB

Troubleshooting Runbook

Overview

This runbook provides diagnostic procedures and solutions for common operational issues with Chainfire (distributed KV) and FlareDB (time-series DB).

Quick Diagnostics

Health Check Commands

# Chainfire cluster health
chainfire-client --endpoint http://NODE_IP:2379 status
chainfire-client --endpoint http://NODE_IP:2379 member-list

# FlareDB cluster health
flaredb-client --endpoint http://PD_IP:2379 cluster-status
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | {id, state, capacity}'

# Service status
systemctl status chainfire
systemctl status flaredb

# Port connectivity
nc -zv NODE_IP 2379  # API port
nc -zv NODE_IP 2380  # Raft port
nc -zv NODE_IP 2381  # Gossip port

# Resource usage
top -bn1 | head -20
df -h
iostat -x 1 5

# Recent logs
journalctl -u chainfire -n 100 --no-pager
journalctl -u flaredb -n 100 --no-pager

Chainfire Issues

Issue: Node Cannot Join Cluster

Symptoms:

  • member-add command hangs or times out
  • New node logs show "connection refused" or "timeout" errors
  • member-list does not show the new node

Diagnosis:

# 1. Check network connectivity
nc -zv NEW_NODE_IP 2380

# 2. Verify Raft server is listening on new node
ssh NEW_NODE_IP "ss -tlnp | grep 2380"

# 3. Check firewall rules
ssh NEW_NODE_IP "sudo iptables -L -n | grep 2380"

# 4. Verify TLS configuration matches
ssh NEW_NODE_IP "grep -A5 '\[network.tls\]' /etc/centra-cloud/chainfire.toml"

# 5. Check leader logs
ssh LEADER_NODE "journalctl -u chainfire -n 50 | grep -i 'add.*node'"

Resolution:

If network issue:

# Open firewall ports on new node
sudo firewall-cmd --permanent --add-port=2379/tcp
sudo firewall-cmd --permanent --add-port=2380/tcp
sudo firewall-cmd --permanent --add-port=2381/tcp
sudo firewall-cmd --reload

If TLS mismatch:

# Ensure new node has correct certificates
sudo ls -l /etc/centra-cloud/certs/
# Should have: ca.crt, chainfire-node-N.crt, chainfire-node-N.key

# Verify certificate is valid
openssl x509 -in /etc/centra-cloud/certs/chainfire-node-N.crt -noout -text

If bootstrap flag set incorrectly:

# Edit config on new node
sudo vi /etc/centra-cloud/chainfire.toml

# Ensure:
# [cluster]
# bootstrap = false  # MUST be false for joining nodes

sudo systemctl restart chainfire

Issue: No Leader / Leader Election Fails

Symptoms:

  • Writes fail with "no leader elected" error
  • chainfire-client status shows leader: none
  • Logs show repeated "election timeout" messages

Diagnosis:

# 1. Check cluster membership
chainfire-client --endpoint http://NODE1_IP:2379 member-list

# 2. Check Raft state on all nodes
for node in node1 node2 node3; do
  echo "=== $node ==="
  ssh $node "journalctl -u chainfire -n 20 | grep -i 'raft\|leader\|election'"
done

# 3. Check network partition
for node in node1 node2 node3; do
  for peer in node1 node2 node3; do
    echo "$node -> $peer:"
    ssh $node "ping -c 3 $peer"
  done
done

# 4. Check quorum
# For 3-node cluster, need 2 nodes (majority)
RUNNING_NODES=$(for node in node1 node2 node3; do ssh $node "systemctl is-active chainfire" 2>/dev/null; done | grep -c active)
echo "Running nodes: $RUNNING_NODES (need >= 2 for quorum)"

Resolution:

If <50% nodes are up (no quorum):

# Start majority of nodes
ssh node1 "sudo systemctl start chainfire"
ssh node2 "sudo systemctl start chainfire"

# Wait for leader election
sleep 10

# Verify leader elected
chainfire-client --endpoint http://node1:2379 status | grep leader

If network partition:

# Check and fix network connectivity
# Ensure bidirectional connectivity between all nodes

# Restart affected nodes
ssh ISOLATED_NODE "sudo systemctl restart chainfire"

If split-brain (multiple leaders):

# DANGER: This wipes follower data
# Stop all nodes
for node in node1 node2 node3; do
  ssh $node "sudo systemctl stop chainfire"
done

# Keep only the node with highest raft_index
# Wipe others
ssh node2 "sudo rm -rf /var/lib/chainfire/*"
ssh node3 "sudo rm -rf /var/lib/chainfire/*"

# Restart leader (node1 in this example)
ssh node1 "sudo systemctl start chainfire"
sleep 10

# Re-add followers via member-add
chainfire-client --endpoint http://node1:2379 member-add --node-id 2 --peer-url node2:2380
chainfire-client --endpoint http://node1:2379 member-add --node-id 3 --peer-url node3:2380

# Start followers
ssh node2 "sudo systemctl start chainfire"
ssh node3 "sudo systemctl start chainfire"

Issue: High Write Latency

Symptoms:

  • chainfire-client put commands take >100ms
  • Application reports slow writes
  • Metrics show p99 latency >500ms

Diagnosis:

# 1. Check disk I/O
iostat -x 1 10
# Watch for %util > 80% or await > 20ms

# 2. Check Raft replication lag
chainfire-client --endpoint http://LEADER_IP:2379 status
# Compare raft_index across nodes

# 3. Check network latency between nodes
for node in node1 node2 node3; do
  echo "=== $node ==="
  ping -c 10 $node
done

# 4. Check CPU usage
top -bn1 | grep chainfire

# 5. Check RocksDB stats
# Look for stalls in logs
journalctl -u chainfire -n 500 | grep -i stall

Resolution:

If disk I/O bottleneck:

# 1. Check data directory is on SSD (not HDD)
df -h /var/lib/chainfire
mount | grep /var/lib/chainfire

# 2. Tune RocksDB settings (in config)
[storage]
# Increase write buffer size
write_buffer_size = 134217728  # 128MB (default: 64MB)
# Increase block cache
block_cache_size = 536870912   # 512MB (default: 256MB)

# 3. Enable direct I/O if on dedicated disk
# Add to config:
use_direct_io_for_flush_and_compaction = true

# 4. Restart service
sudo systemctl restart chainfire

If network latency:

# Verify nodes are in same datacenter
# For cross-datacenter, expect higher latency
# Consider adding learner nodes instead of voters

# Check MTU settings
ip link show | grep mtu
# Ensure MTU is consistent across nodes (typically 1500 or 9000 for jumbo frames)

If CPU bottleneck:

# Scale vertically (add CPU cores)
# Or scale horizontally (add read replicas as learner nodes)

# Tune Raft tick interval (in config)
[raft]
tick_interval_ms = 200  # Increase from default 100ms

Issue: Data Inconsistency After Crash

Symptoms:

  • After node crash/restart, reads return stale data
  • raft_index does not advance
  • Logs show "corrupted log entry" errors

Diagnosis:

# 1. Check RocksDB integrity
# Stop service first
sudo systemctl stop chainfire

# Run RocksDB repair
rocksdb_ldb --db=/var/lib/chainfire repair

# Check for corruption
rocksdb_ldb --db=/var/lib/chainfire checkconsistency

Resolution:

If minor corruption (repair successful):

# Restart service
sudo systemctl start chainfire

# Let Raft catch up from leader
# Monitor raft_index
watch -n 1 "chainfire-client --endpoint http://localhost:2379 status | grep raft_index"

If major corruption (repair failed):

# Restore from backup
sudo systemctl stop chainfire
sudo mv /var/lib/chainfire /var/lib/chainfire.corrupted
sudo mkdir -p /var/lib/chainfire

# Extract latest backup
LATEST_BACKUP=$(ls -t /var/backups/chainfire/*.tar.gz | head -1)
sudo tar -xzf "$LATEST_BACKUP" -C /var/lib/chainfire --strip-components=1

# Fix permissions
sudo chown -R chainfire:chainfire /var/lib/chainfire

# Restart
sudo systemctl start chainfire

If cannot restore (no backup):

# Remove node from cluster and re-add fresh
# From leader node:
chainfire-client --endpoint http://LEADER_IP:2379 member-remove --node-id FAILED_NODE_ID

# On failed node, wipe and rejoin
sudo systemctl stop chainfire
sudo rm -rf /var/lib/chainfire/*
sudo systemctl start chainfire

# Re-add from leader
chainfire-client --endpoint http://LEADER_IP:2379 member-add \
  --node-id FAILED_NODE_ID \
  --peer-url FAILED_NODE_IP:2380 \
  --learner

# Promote after catchup
chainfire-client --endpoint http://LEADER_IP:2379 member-promote --node-id FAILED_NODE_ID

FlareDB Issues

Issue: Store Not Registering with PD

Symptoms:

  • New FlareDB store starts but doesn't appear in cluster-status
  • Store logs show "failed to register with PD" errors
  • PD logs show no registration attempts

Diagnosis:

# 1. Check PD connectivity
ssh FLAREDB_NODE "nc -zv PD_IP 2379"

# 2. Verify PD address in config
ssh FLAREDB_NODE "grep pd_addr /etc/centra-cloud/flaredb.toml"

# 3. Check store logs
ssh FLAREDB_NODE "journalctl -u flaredb -n 100 | grep -i 'pd\|register'"

# 4. Check PD logs
ssh PD_NODE "journalctl -u placement-driver -n 100 | grep -i register"

# 5. Verify store_id is unique
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | .id'

Resolution:

If network issue:

# Open firewall on PD node
ssh PD_NODE "sudo firewall-cmd --permanent --add-port=2379/tcp"
ssh PD_NODE "sudo firewall-cmd --reload"

# Restart store
ssh FLAREDB_NODE "sudo systemctl restart flaredb"

If duplicate store_id:

# Assign new unique store_id
ssh FLAREDB_NODE "sudo vi /etc/centra-cloud/flaredb.toml"
# Change: store_id = <NEW_UNIQUE_ID>

# Wipe old data (contains old store_id)
ssh FLAREDB_NODE "sudo rm -rf /var/lib/flaredb/*"

# Restart
ssh FLAREDB_NODE "sudo systemctl restart flaredb"

If TLS mismatch:

# Ensure PD and store have matching TLS config
# Either both use TLS or both don't

# If PD uses TLS:
ssh FLAREDB_NODE "sudo vi /etc/centra-cloud/flaredb.toml"
# Add/verify:
# [tls]
# cert_file = "/etc/centra-cloud/certs/flaredb-node-N.crt"
# key_file = "/etc/centra-cloud/certs/flaredb-node-N.key"
# ca_file = "/etc/centra-cloud/certs/ca.crt"

# Restart
ssh FLAREDB_NODE "sudo systemctl restart flaredb"

Issue: Region Rebalancing Stuck

Symptoms:

  • pd/api/v1/stats/region shows high pending_peers count
  • Regions not moving to new stores
  • PD logs show "failed to schedule operator" errors

Diagnosis:

# 1. Check region stats
curl http://PD_IP:2379/pd/api/v1/stats/region | jq

# 2. Check store capacity
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | {id, state, available, capacity}'

# 3. Check pending operators
curl http://PD_IP:2379/pd/api/v1/operators | jq

# 4. Check PD scheduler config
curl http://PD_IP:2379/pd/api/v1/config/schedule | jq

Resolution:

If store is down:

# Identify down store
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | select(.state!="Up")'

# Fix or remove down store
ssh DOWN_STORE_NODE "sudo systemctl restart flaredb"

# If cannot recover, remove store:
curl -X DELETE http://PD_IP:2379/pd/api/v1/store/DOWN_STORE_ID

If disk full:

# Identify full stores
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | select((.available / .capacity) < 0.1)'

# Add more storage or scale out with new stores
# See scale-out.md for adding stores

If scheduler disabled:

# Check scheduler status
curl http://PD_IP:2379/pd/api/v1/config/schedule | jq '.schedulers'

# Enable schedulers if disabled
curl -X POST http://PD_IP:2379/pd/api/v1/config/schedule \
  -d '{"max-snapshot-count": 3, "max-pending-peer-count": 16}'

Issue: Read/Write Timeout

Symptoms:

  • Client operations timeout after 30s
  • Logs show "context deadline exceeded"
  • No leader election issues visible

Diagnosis:

# 1. Check client timeout config
# Default timeout is 30s

# 2. Check store responsiveness
time flaredb-client --endpoint http://STORE_IP:2379 get test-key

# 3. Check CPU usage on stores
ssh STORE_NODE "top -bn1 | grep flaredb"

# 4. Check slow queries
ssh STORE_NODE "journalctl -u flaredb -n 500 | grep -i 'slow\|timeout'"

# 5. Check disk latency
ssh STORE_NODE "iostat -x 1 10"

Resolution:

If disk I/O bottleneck:

# Same as Chainfire high latency issue
# 1. Verify SSD usage
# 2. Tune RocksDB settings
# 3. Add more stores for read distribution

If CPU bottleneck:

# Check compaction storms
ssh STORE_NODE "journalctl -u flaredb | grep -i compaction | tail -50"

# Throttle compaction if needed
# Add to flaredb config:
[storage]
max_background_compactions = 2  # Reduce from default 4
max_background_flushes = 1      # Reduce from default 2

sudo systemctl restart flaredb

If network partition:

# Check connectivity between store and PD
ssh STORE_NODE "ping -c 10 PD_IP"

# Check for packet loss
# If >1% loss, investigate network infrastructure

TLS/mTLS Issues

Issue: TLS Handshake Failures

Symptoms:

  • Logs show "tls: bad certificate" or "certificate verify failed"
  • Connections fail immediately
  • curl commands fail with SSL errors

Diagnosis:

# 1. Verify certificate files exist
ls -l /etc/centra-cloud/certs/

# 2. Check certificate validity
openssl x509 -in /etc/centra-cloud/certs/chainfire-node-1.crt -noout -dates

# 3. Verify CA matches
openssl x509 -in /etc/centra-cloud/certs/ca.crt -noout -subject
openssl x509 -in /etc/centra-cloud/certs/chainfire-node-1.crt -noout -issuer

# 4. Test TLS connection
openssl s_client -connect NODE_IP:2379 \
  -CAfile /etc/centra-cloud/certs/ca.crt \
  -cert /etc/centra-cloud/certs/chainfire-node-1.crt \
  -key /etc/centra-cloud/certs/chainfire-node-1.key

Resolution:

If certificate expired:

# Regenerate certificates
cd /path/to/centra-cloud
./scripts/generate-dev-certs.sh /etc/centra-cloud/certs

# Distribute to all nodes
for node in node1 node2 node3; do
  scp /etc/centra-cloud/certs/* $node:/etc/centra-cloud/certs/
done

# Restart services
for node in node1 node2 node3; do
  ssh $node "sudo systemctl restart chainfire"
done

If CA mismatch:

# Ensure all nodes use same CA
# Regenerate all certs from same CA

# On CA-generating node:
./scripts/generate-dev-certs.sh /tmp/new-certs

# Distribute to all nodes
for node in node1 node2 node3; do
  scp /tmp/new-certs/* $node:/etc/centra-cloud/certs/
  ssh $node "sudo chown -R chainfire:chainfire /etc/centra-cloud/certs"
  ssh $node "sudo chmod 600 /etc/centra-cloud/certs/*.key"
done

# Restart all services
for node in node1 node2 node3; do
  ssh $node "sudo systemctl restart chainfire"
done

If permissions issue:

# Fix certificate file permissions
sudo chown chainfire:chainfire /etc/centra-cloud/certs/*
sudo chmod 644 /etc/centra-cloud/certs/*.crt
sudo chmod 600 /etc/centra-cloud/certs/*.key

# Restart service
sudo systemctl restart chainfire

Performance Tuning

Chainfire Performance Optimization

For write-heavy workloads:

# /etc/centra-cloud/chainfire.toml

[storage]
# Increase write buffer
write_buffer_size = 134217728  # 128MB

# More write buffers
max_write_buffer_number = 4

# Larger block cache for hot data
block_cache_size = 1073741824  # 1GB

# Reduce compaction frequency
level0_file_num_compaction_trigger = 8  # Default: 4

For read-heavy workloads:

[storage]
# Larger block cache
block_cache_size = 2147483648  # 2GB

# Enable bloom filters
bloom_filter_bits_per_key = 10

# More table cache
max_open_files = 10000  # Default: 1000

For low-latency requirements:

[raft]
# Reduce tick interval
tick_interval_ms = 50  # Default: 100

[storage]
# Enable direct I/O
use_direct_io_for_flush_and_compaction = true

FlareDB Performance Optimization

For high ingestion rate:

# /etc/centra-cloud/flaredb.toml

[storage]
# Larger write buffers
write_buffer_size = 268435456  # 256MB
max_write_buffer_number = 6

# More background jobs
max_background_compactions = 4
max_background_flushes = 2

For large query workloads:

[storage]
# Larger block cache
block_cache_size = 4294967296  # 4GB

# Keep more files open
max_open_files = 20000

Monitoring & Alerts

Key Metrics to Monitor

Chainfire:

  • raft_index - should advance steadily
  • raft_term - should be stable (not increasing frequently)
  • Write latency p50, p95, p99
  • Disk I/O utilization
  • Network bandwidth between nodes

FlareDB:

  • Store state (Up/Down)
  • Region count and distribution
  • Pending peers count (should be near 0)
  • Read/write QPS per store
  • Disk space available

Prometheus Queries

# Chainfire write latency
histogram_quantile(0.99, rate(chainfire_write_duration_seconds_bucket[5m]))

# Raft log replication lag
chainfire_raft_index{role="leader"} - chainfire_raft_index{role="follower"}

# FlareDB store health
flaredb_store_state == 1  # 1 = Up, 0 = Down

# Region rebalancing activity
rate(flaredb_pending_peers_total[5m])

Alerting Rules

# Prometheus alerting rules

groups:
  - name: chainfire
    rules:
      - alert: ChainfireNoLeader
        expr: chainfire_has_leader == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Chainfire cluster has no leader"

      - alert: ChainfireHighWriteLatency
        expr: histogram_quantile(0.99, rate(chainfire_write_duration_seconds_bucket[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Chainfire p99 write latency >500ms"

      - alert: ChainfireNodeDown
        expr: up{job="chainfire"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Chainfire node {{ $labels.instance }} is down"

  - name: flaredb
    rules:
      - alert: FlareDBStoreDown
        expr: flaredb_store_state == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "FlareDB store {{ $labels.store_id }} is down"

      - alert: FlareDBHighPendingPeers
        expr: flaredb_pending_peers_total > 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "FlareDB has {{ $value }} pending peers (rebalancing stuck?)"

Log Analysis

Common Log Patterns

Chainfire healthy operation:

INFO chainfire_raft: Leader elected, term=3
INFO chainfire_storage: Committed entry, index=12345
INFO chainfire_api: Handled put request, latency=15ms

Chainfire warning signs:

WARN chainfire_raft: Election timeout, no heartbeat from leader
WARN chainfire_storage: RocksDB stall detected, duration=2000ms
ERROR chainfire_network: Failed to connect to peer, addr=node2:2380

FlareDB healthy operation:

INFO flaredb_pd_client: Registered with PD, store_id=1
INFO flaredb_raft: Applied snapshot, index=5000
INFO flaredb_service: Handled query, rows=1000, latency=50ms

FlareDB warning signs:

WARN flaredb_pd_client: Heartbeat to PD failed, retrying...
WARN flaredb_storage: Compaction is slow, duration=30s
ERROR flaredb_raft: Failed to replicate log, peer=store2

Log Aggregation Queries

Using journalctl:

# Find all errors in last hour
journalctl -u chainfire --since "1 hour ago" | grep ERROR

# Count error types
journalctl -u chainfire --since "1 day ago" | grep ERROR | awk '{print $NF}' | sort | uniq -c | sort -rn

# Track leader changes
journalctl -u chainfire | grep "Leader elected" | tail -20

Using grep for pattern matching:

# Find slow operations
journalctl -u chainfire -n 10000 | grep -E 'latency=[0-9]{3,}ms'

# Find connection errors
journalctl -u chainfire -n 5000 | grep -i 'connection refused\|timeout\|unreachable'

# Find replication lag
journalctl -u chainfire | grep -i 'lag\|behind\|catch.*up'

References