- netboot-base.nix with SSH key auth - Launch scripts for node01/02/03 - Node configuration.nix and disko.nix - Nix modules for first-boot automation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
19 KiB
Troubleshooting Runbook
Overview
This runbook provides diagnostic procedures and solutions for common operational issues with Chainfire (distributed KV) and FlareDB (time-series DB).
Quick Diagnostics
Health Check Commands
# Chainfire cluster health
chainfire-client --endpoint http://NODE_IP:2379 status
chainfire-client --endpoint http://NODE_IP:2379 member-list
# FlareDB cluster health
flaredb-client --endpoint http://PD_IP:2379 cluster-status
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | {id, state, capacity}'
# Service status
systemctl status chainfire
systemctl status flaredb
# Port connectivity
nc -zv NODE_IP 2379 # API port
nc -zv NODE_IP 2380 # Raft port
nc -zv NODE_IP 2381 # Gossip port
# Resource usage
top -bn1 | head -20
df -h
iostat -x 1 5
# Recent logs
journalctl -u chainfire -n 100 --no-pager
journalctl -u flaredb -n 100 --no-pager
Chainfire Issues
Issue: Node Cannot Join Cluster
Symptoms:
member-addcommand hangs or times out- New node logs show "connection refused" or "timeout" errors
member-listdoes not show the new node
Diagnosis:
# 1. Check network connectivity
nc -zv NEW_NODE_IP 2380
# 2. Verify Raft server is listening on new node
ssh NEW_NODE_IP "ss -tlnp | grep 2380"
# 3. Check firewall rules
ssh NEW_NODE_IP "sudo iptables -L -n | grep 2380"
# 4. Verify TLS configuration matches
ssh NEW_NODE_IP "grep -A5 '\[network.tls\]' /etc/centra-cloud/chainfire.toml"
# 5. Check leader logs
ssh LEADER_NODE "journalctl -u chainfire -n 50 | grep -i 'add.*node'"
Resolution:
If network issue:
# Open firewall ports on new node
sudo firewall-cmd --permanent --add-port=2379/tcp
sudo firewall-cmd --permanent --add-port=2380/tcp
sudo firewall-cmd --permanent --add-port=2381/tcp
sudo firewall-cmd --reload
If TLS mismatch:
# Ensure new node has correct certificates
sudo ls -l /etc/centra-cloud/certs/
# Should have: ca.crt, chainfire-node-N.crt, chainfire-node-N.key
# Verify certificate is valid
openssl x509 -in /etc/centra-cloud/certs/chainfire-node-N.crt -noout -text
If bootstrap flag set incorrectly:
# Edit config on new node
sudo vi /etc/centra-cloud/chainfire.toml
# Ensure:
# [cluster]
# bootstrap = false # MUST be false for joining nodes
sudo systemctl restart chainfire
Issue: No Leader / Leader Election Fails
Symptoms:
- Writes fail with "no leader elected" error
chainfire-client statusshowsleader: none- Logs show repeated "election timeout" messages
Diagnosis:
# 1. Check cluster membership
chainfire-client --endpoint http://NODE1_IP:2379 member-list
# 2. Check Raft state on all nodes
for node in node1 node2 node3; do
echo "=== $node ==="
ssh $node "journalctl -u chainfire -n 20 | grep -i 'raft\|leader\|election'"
done
# 3. Check network partition
for node in node1 node2 node3; do
for peer in node1 node2 node3; do
echo "$node -> $peer:"
ssh $node "ping -c 3 $peer"
done
done
# 4. Check quorum
# For 3-node cluster, need 2 nodes (majority)
RUNNING_NODES=$(for node in node1 node2 node3; do ssh $node "systemctl is-active chainfire" 2>/dev/null; done | grep -c active)
echo "Running nodes: $RUNNING_NODES (need >= 2 for quorum)"
Resolution:
If <50% nodes are up (no quorum):
# Start majority of nodes
ssh node1 "sudo systemctl start chainfire"
ssh node2 "sudo systemctl start chainfire"
# Wait for leader election
sleep 10
# Verify leader elected
chainfire-client --endpoint http://node1:2379 status | grep leader
If network partition:
# Check and fix network connectivity
# Ensure bidirectional connectivity between all nodes
# Restart affected nodes
ssh ISOLATED_NODE "sudo systemctl restart chainfire"
If split-brain (multiple leaders):
# DANGER: This wipes follower data
# Stop all nodes
for node in node1 node2 node3; do
ssh $node "sudo systemctl stop chainfire"
done
# Keep only the node with highest raft_index
# Wipe others
ssh node2 "sudo rm -rf /var/lib/chainfire/*"
ssh node3 "sudo rm -rf /var/lib/chainfire/*"
# Restart leader (node1 in this example)
ssh node1 "sudo systemctl start chainfire"
sleep 10
# Re-add followers via member-add
chainfire-client --endpoint http://node1:2379 member-add --node-id 2 --peer-url node2:2380
chainfire-client --endpoint http://node1:2379 member-add --node-id 3 --peer-url node3:2380
# Start followers
ssh node2 "sudo systemctl start chainfire"
ssh node3 "sudo systemctl start chainfire"
Issue: High Write Latency
Symptoms:
chainfire-client putcommands take >100ms- Application reports slow writes
- Metrics show p99 latency >500ms
Diagnosis:
# 1. Check disk I/O
iostat -x 1 10
# Watch for %util > 80% or await > 20ms
# 2. Check Raft replication lag
chainfire-client --endpoint http://LEADER_IP:2379 status
# Compare raft_index across nodes
# 3. Check network latency between nodes
for node in node1 node2 node3; do
echo "=== $node ==="
ping -c 10 $node
done
# 4. Check CPU usage
top -bn1 | grep chainfire
# 5. Check RocksDB stats
# Look for stalls in logs
journalctl -u chainfire -n 500 | grep -i stall
Resolution:
If disk I/O bottleneck:
# 1. Check data directory is on SSD (not HDD)
df -h /var/lib/chainfire
mount | grep /var/lib/chainfire
# 2. Tune RocksDB settings (in config)
[storage]
# Increase write buffer size
write_buffer_size = 134217728 # 128MB (default: 64MB)
# Increase block cache
block_cache_size = 536870912 # 512MB (default: 256MB)
# 3. Enable direct I/O if on dedicated disk
# Add to config:
use_direct_io_for_flush_and_compaction = true
# 4. Restart service
sudo systemctl restart chainfire
If network latency:
# Verify nodes are in same datacenter
# For cross-datacenter, expect higher latency
# Consider adding learner nodes instead of voters
# Check MTU settings
ip link show | grep mtu
# Ensure MTU is consistent across nodes (typically 1500 or 9000 for jumbo frames)
If CPU bottleneck:
# Scale vertically (add CPU cores)
# Or scale horizontally (add read replicas as learner nodes)
# Tune Raft tick interval (in config)
[raft]
tick_interval_ms = 200 # Increase from default 100ms
Issue: Data Inconsistency After Crash
Symptoms:
- After node crash/restart, reads return stale data
raft_indexdoes not advance- Logs show "corrupted log entry" errors
Diagnosis:
# 1. Check RocksDB integrity
# Stop service first
sudo systemctl stop chainfire
# Run RocksDB repair
rocksdb_ldb --db=/var/lib/chainfire repair
# Check for corruption
rocksdb_ldb --db=/var/lib/chainfire checkconsistency
Resolution:
If minor corruption (repair successful):
# Restart service
sudo systemctl start chainfire
# Let Raft catch up from leader
# Monitor raft_index
watch -n 1 "chainfire-client --endpoint http://localhost:2379 status | grep raft_index"
If major corruption (repair failed):
# Restore from backup
sudo systemctl stop chainfire
sudo mv /var/lib/chainfire /var/lib/chainfire.corrupted
sudo mkdir -p /var/lib/chainfire
# Extract latest backup
LATEST_BACKUP=$(ls -t /var/backups/chainfire/*.tar.gz | head -1)
sudo tar -xzf "$LATEST_BACKUP" -C /var/lib/chainfire --strip-components=1
# Fix permissions
sudo chown -R chainfire:chainfire /var/lib/chainfire
# Restart
sudo systemctl start chainfire
If cannot restore (no backup):
# Remove node from cluster and re-add fresh
# From leader node:
chainfire-client --endpoint http://LEADER_IP:2379 member-remove --node-id FAILED_NODE_ID
# On failed node, wipe and rejoin
sudo systemctl stop chainfire
sudo rm -rf /var/lib/chainfire/*
sudo systemctl start chainfire
# Re-add from leader
chainfire-client --endpoint http://LEADER_IP:2379 member-add \
--node-id FAILED_NODE_ID \
--peer-url FAILED_NODE_IP:2380 \
--learner
# Promote after catchup
chainfire-client --endpoint http://LEADER_IP:2379 member-promote --node-id FAILED_NODE_ID
FlareDB Issues
Issue: Store Not Registering with PD
Symptoms:
- New FlareDB store starts but doesn't appear in
cluster-status - Store logs show "failed to register with PD" errors
- PD logs show no registration attempts
Diagnosis:
# 1. Check PD connectivity
ssh FLAREDB_NODE "nc -zv PD_IP 2379"
# 2. Verify PD address in config
ssh FLAREDB_NODE "grep pd_addr /etc/centra-cloud/flaredb.toml"
# 3. Check store logs
ssh FLAREDB_NODE "journalctl -u flaredb -n 100 | grep -i 'pd\|register'"
# 4. Check PD logs
ssh PD_NODE "journalctl -u placement-driver -n 100 | grep -i register"
# 5. Verify store_id is unique
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | .id'
Resolution:
If network issue:
# Open firewall on PD node
ssh PD_NODE "sudo firewall-cmd --permanent --add-port=2379/tcp"
ssh PD_NODE "sudo firewall-cmd --reload"
# Restart store
ssh FLAREDB_NODE "sudo systemctl restart flaredb"
If duplicate store_id:
# Assign new unique store_id
ssh FLAREDB_NODE "sudo vi /etc/centra-cloud/flaredb.toml"
# Change: store_id = <NEW_UNIQUE_ID>
# Wipe old data (contains old store_id)
ssh FLAREDB_NODE "sudo rm -rf /var/lib/flaredb/*"
# Restart
ssh FLAREDB_NODE "sudo systemctl restart flaredb"
If TLS mismatch:
# Ensure PD and store have matching TLS config
# Either both use TLS or both don't
# If PD uses TLS:
ssh FLAREDB_NODE "sudo vi /etc/centra-cloud/flaredb.toml"
# Add/verify:
# [tls]
# cert_file = "/etc/centra-cloud/certs/flaredb-node-N.crt"
# key_file = "/etc/centra-cloud/certs/flaredb-node-N.key"
# ca_file = "/etc/centra-cloud/certs/ca.crt"
# Restart
ssh FLAREDB_NODE "sudo systemctl restart flaredb"
Issue: Region Rebalancing Stuck
Symptoms:
pd/api/v1/stats/regionshows highpending_peerscount- Regions not moving to new stores
- PD logs show "failed to schedule operator" errors
Diagnosis:
# 1. Check region stats
curl http://PD_IP:2379/pd/api/v1/stats/region | jq
# 2. Check store capacity
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | {id, state, available, capacity}'
# 3. Check pending operators
curl http://PD_IP:2379/pd/api/v1/operators | jq
# 4. Check PD scheduler config
curl http://PD_IP:2379/pd/api/v1/config/schedule | jq
Resolution:
If store is down:
# Identify down store
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | select(.state!="Up")'
# Fix or remove down store
ssh DOWN_STORE_NODE "sudo systemctl restart flaredb"
# If cannot recover, remove store:
curl -X DELETE http://PD_IP:2379/pd/api/v1/store/DOWN_STORE_ID
If disk full:
# Identify full stores
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | select((.available / .capacity) < 0.1)'
# Add more storage or scale out with new stores
# See scale-out.md for adding stores
If scheduler disabled:
# Check scheduler status
curl http://PD_IP:2379/pd/api/v1/config/schedule | jq '.schedulers'
# Enable schedulers if disabled
curl -X POST http://PD_IP:2379/pd/api/v1/config/schedule \
-d '{"max-snapshot-count": 3, "max-pending-peer-count": 16}'
Issue: Read/Write Timeout
Symptoms:
- Client operations timeout after 30s
- Logs show "context deadline exceeded"
- No leader election issues visible
Diagnosis:
# 1. Check client timeout config
# Default timeout is 30s
# 2. Check store responsiveness
time flaredb-client --endpoint http://STORE_IP:2379 get test-key
# 3. Check CPU usage on stores
ssh STORE_NODE "top -bn1 | grep flaredb"
# 4. Check slow queries
ssh STORE_NODE "journalctl -u flaredb -n 500 | grep -i 'slow\|timeout'"
# 5. Check disk latency
ssh STORE_NODE "iostat -x 1 10"
Resolution:
If disk I/O bottleneck:
# Same as Chainfire high latency issue
# 1. Verify SSD usage
# 2. Tune RocksDB settings
# 3. Add more stores for read distribution
If CPU bottleneck:
# Check compaction storms
ssh STORE_NODE "journalctl -u flaredb | grep -i compaction | tail -50"
# Throttle compaction if needed
# Add to flaredb config:
[storage]
max_background_compactions = 2 # Reduce from default 4
max_background_flushes = 1 # Reduce from default 2
sudo systemctl restart flaredb
If network partition:
# Check connectivity between store and PD
ssh STORE_NODE "ping -c 10 PD_IP"
# Check for packet loss
# If >1% loss, investigate network infrastructure
TLS/mTLS Issues
Issue: TLS Handshake Failures
Symptoms:
- Logs show "tls: bad certificate" or "certificate verify failed"
- Connections fail immediately
- curl commands fail with SSL errors
Diagnosis:
# 1. Verify certificate files exist
ls -l /etc/centra-cloud/certs/
# 2. Check certificate validity
openssl x509 -in /etc/centra-cloud/certs/chainfire-node-1.crt -noout -dates
# 3. Verify CA matches
openssl x509 -in /etc/centra-cloud/certs/ca.crt -noout -subject
openssl x509 -in /etc/centra-cloud/certs/chainfire-node-1.crt -noout -issuer
# 4. Test TLS connection
openssl s_client -connect NODE_IP:2379 \
-CAfile /etc/centra-cloud/certs/ca.crt \
-cert /etc/centra-cloud/certs/chainfire-node-1.crt \
-key /etc/centra-cloud/certs/chainfire-node-1.key
Resolution:
If certificate expired:
# Regenerate certificates
cd /path/to/centra-cloud
./scripts/generate-dev-certs.sh /etc/centra-cloud/certs
# Distribute to all nodes
for node in node1 node2 node3; do
scp /etc/centra-cloud/certs/* $node:/etc/centra-cloud/certs/
done
# Restart services
for node in node1 node2 node3; do
ssh $node "sudo systemctl restart chainfire"
done
If CA mismatch:
# Ensure all nodes use same CA
# Regenerate all certs from same CA
# On CA-generating node:
./scripts/generate-dev-certs.sh /tmp/new-certs
# Distribute to all nodes
for node in node1 node2 node3; do
scp /tmp/new-certs/* $node:/etc/centra-cloud/certs/
ssh $node "sudo chown -R chainfire:chainfire /etc/centra-cloud/certs"
ssh $node "sudo chmod 600 /etc/centra-cloud/certs/*.key"
done
# Restart all services
for node in node1 node2 node3; do
ssh $node "sudo systemctl restart chainfire"
done
If permissions issue:
# Fix certificate file permissions
sudo chown chainfire:chainfire /etc/centra-cloud/certs/*
sudo chmod 644 /etc/centra-cloud/certs/*.crt
sudo chmod 600 /etc/centra-cloud/certs/*.key
# Restart service
sudo systemctl restart chainfire
Performance Tuning
Chainfire Performance Optimization
For write-heavy workloads:
# /etc/centra-cloud/chainfire.toml
[storage]
# Increase write buffer
write_buffer_size = 134217728 # 128MB
# More write buffers
max_write_buffer_number = 4
# Larger block cache for hot data
block_cache_size = 1073741824 # 1GB
# Reduce compaction frequency
level0_file_num_compaction_trigger = 8 # Default: 4
For read-heavy workloads:
[storage]
# Larger block cache
block_cache_size = 2147483648 # 2GB
# Enable bloom filters
bloom_filter_bits_per_key = 10
# More table cache
max_open_files = 10000 # Default: 1000
For low-latency requirements:
[raft]
# Reduce tick interval
tick_interval_ms = 50 # Default: 100
[storage]
# Enable direct I/O
use_direct_io_for_flush_and_compaction = true
FlareDB Performance Optimization
For high ingestion rate:
# /etc/centra-cloud/flaredb.toml
[storage]
# Larger write buffers
write_buffer_size = 268435456 # 256MB
max_write_buffer_number = 6
# More background jobs
max_background_compactions = 4
max_background_flushes = 2
For large query workloads:
[storage]
# Larger block cache
block_cache_size = 4294967296 # 4GB
# Keep more files open
max_open_files = 20000
Monitoring & Alerts
Key Metrics to Monitor
Chainfire:
raft_index- should advance steadilyraft_term- should be stable (not increasing frequently)- Write latency p50, p95, p99
- Disk I/O utilization
- Network bandwidth between nodes
FlareDB:
- Store state (Up/Down)
- Region count and distribution
- Pending peers count (should be near 0)
- Read/write QPS per store
- Disk space available
Prometheus Queries
# Chainfire write latency
histogram_quantile(0.99, rate(chainfire_write_duration_seconds_bucket[5m]))
# Raft log replication lag
chainfire_raft_index{role="leader"} - chainfire_raft_index{role="follower"}
# FlareDB store health
flaredb_store_state == 1 # 1 = Up, 0 = Down
# Region rebalancing activity
rate(flaredb_pending_peers_total[5m])
Alerting Rules
# Prometheus alerting rules
groups:
- name: chainfire
rules:
- alert: ChainfireNoLeader
expr: chainfire_has_leader == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Chainfire cluster has no leader"
- alert: ChainfireHighWriteLatency
expr: histogram_quantile(0.99, rate(chainfire_write_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Chainfire p99 write latency >500ms"
- alert: ChainfireNodeDown
expr: up{job="chainfire"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Chainfire node {{ $labels.instance }} is down"
- name: flaredb
rules:
- alert: FlareDBStoreDown
expr: flaredb_store_state == 0
for: 2m
labels:
severity: critical
annotations:
summary: "FlareDB store {{ $labels.store_id }} is down"
- alert: FlareDBHighPendingPeers
expr: flaredb_pending_peers_total > 100
for: 10m
labels:
severity: warning
annotations:
summary: "FlareDB has {{ $value }} pending peers (rebalancing stuck?)"
Log Analysis
Common Log Patterns
Chainfire healthy operation:
INFO chainfire_raft: Leader elected, term=3
INFO chainfire_storage: Committed entry, index=12345
INFO chainfire_api: Handled put request, latency=15ms
Chainfire warning signs:
WARN chainfire_raft: Election timeout, no heartbeat from leader
WARN chainfire_storage: RocksDB stall detected, duration=2000ms
ERROR chainfire_network: Failed to connect to peer, addr=node2:2380
FlareDB healthy operation:
INFO flaredb_pd_client: Registered with PD, store_id=1
INFO flaredb_raft: Applied snapshot, index=5000
INFO flaredb_service: Handled query, rows=1000, latency=50ms
FlareDB warning signs:
WARN flaredb_pd_client: Heartbeat to PD failed, retrying...
WARN flaredb_storage: Compaction is slow, duration=30s
ERROR flaredb_raft: Failed to replicate log, peer=store2
Log Aggregation Queries
Using journalctl:
# Find all errors in last hour
journalctl -u chainfire --since "1 hour ago" | grep ERROR
# Count error types
journalctl -u chainfire --since "1 day ago" | grep ERROR | awk '{print $NF}' | sort | uniq -c | sort -rn
# Track leader changes
journalctl -u chainfire | grep "Leader elected" | tail -20
Using grep for pattern matching:
# Find slow operations
journalctl -u chainfire -n 10000 | grep -E 'latency=[0-9]{3,}ms'
# Find connection errors
journalctl -u chainfire -n 5000 | grep -i 'connection refused\|timeout\|unreachable'
# Find replication lag
journalctl -u chainfire | grep -i 'lag\|behind\|catch.*up'
References
- Configuration:
specifications/configuration.md - Backup/Restore:
docs/ops/backup-restore.md - Scale-Out:
docs/ops/scale-out.md - Upgrade:
docs/ops/upgrade.md - RocksDB Tuning: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide