- netboot-base.nix with SSH key auth - Launch scripts for node01/02/03 - Node configuration.nix and disko.nix - Nix modules for first-boot automation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
809 lines
19 KiB
Markdown
809 lines
19 KiB
Markdown
# Troubleshooting Runbook
|
|
|
|
## Overview
|
|
|
|
This runbook provides diagnostic procedures and solutions for common operational issues with Chainfire (distributed KV) and FlareDB (time-series DB).
|
|
|
|
## Quick Diagnostics
|
|
|
|
### Health Check Commands
|
|
|
|
```bash
|
|
# Chainfire cluster health
|
|
chainfire-client --endpoint http://NODE_IP:2379 status
|
|
chainfire-client --endpoint http://NODE_IP:2379 member-list
|
|
|
|
# FlareDB cluster health
|
|
flaredb-client --endpoint http://PD_IP:2379 cluster-status
|
|
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | {id, state, capacity}'
|
|
|
|
# Service status
|
|
systemctl status chainfire
|
|
systemctl status flaredb
|
|
|
|
# Port connectivity
|
|
nc -zv NODE_IP 2379 # API port
|
|
nc -zv NODE_IP 2380 # Raft port
|
|
nc -zv NODE_IP 2381 # Gossip port
|
|
|
|
# Resource usage
|
|
top -bn1 | head -20
|
|
df -h
|
|
iostat -x 1 5
|
|
|
|
# Recent logs
|
|
journalctl -u chainfire -n 100 --no-pager
|
|
journalctl -u flaredb -n 100 --no-pager
|
|
```
|
|
|
|
## Chainfire Issues
|
|
|
|
### Issue: Node Cannot Join Cluster
|
|
|
|
**Symptoms:**
|
|
- `member-add` command hangs or times out
|
|
- New node logs show "connection refused" or "timeout" errors
|
|
- `member-list` does not show the new node
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# 1. Check network connectivity
|
|
nc -zv NEW_NODE_IP 2380
|
|
|
|
# 2. Verify Raft server is listening on new node
|
|
ssh NEW_NODE_IP "ss -tlnp | grep 2380"
|
|
|
|
# 3. Check firewall rules
|
|
ssh NEW_NODE_IP "sudo iptables -L -n | grep 2380"
|
|
|
|
# 4. Verify TLS configuration matches
|
|
ssh NEW_NODE_IP "grep -A5 '\[network.tls\]' /etc/centra-cloud/chainfire.toml"
|
|
|
|
# 5. Check leader logs
|
|
ssh LEADER_NODE "journalctl -u chainfire -n 50 | grep -i 'add.*node'"
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
**If network issue:**
|
|
```bash
|
|
# Open firewall ports on new node
|
|
sudo firewall-cmd --permanent --add-port=2379/tcp
|
|
sudo firewall-cmd --permanent --add-port=2380/tcp
|
|
sudo firewall-cmd --permanent --add-port=2381/tcp
|
|
sudo firewall-cmd --reload
|
|
```
|
|
|
|
**If TLS mismatch:**
|
|
```bash
|
|
# Ensure new node has correct certificates
|
|
sudo ls -l /etc/centra-cloud/certs/
|
|
# Should have: ca.crt, chainfire-node-N.crt, chainfire-node-N.key
|
|
|
|
# Verify certificate is valid
|
|
openssl x509 -in /etc/centra-cloud/certs/chainfire-node-N.crt -noout -text
|
|
```
|
|
|
|
**If bootstrap flag set incorrectly:**
|
|
```bash
|
|
# Edit config on new node
|
|
sudo vi /etc/centra-cloud/chainfire.toml
|
|
|
|
# Ensure:
|
|
# [cluster]
|
|
# bootstrap = false # MUST be false for joining nodes
|
|
|
|
sudo systemctl restart chainfire
|
|
```
|
|
|
|
### Issue: No Leader / Leader Election Fails
|
|
|
|
**Symptoms:**
|
|
- Writes fail with "no leader elected" error
|
|
- `chainfire-client status` shows `leader: none`
|
|
- Logs show repeated "election timeout" messages
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# 1. Check cluster membership
|
|
chainfire-client --endpoint http://NODE1_IP:2379 member-list
|
|
|
|
# 2. Check Raft state on all nodes
|
|
for node in node1 node2 node3; do
|
|
echo "=== $node ==="
|
|
ssh $node "journalctl -u chainfire -n 20 | grep -i 'raft\|leader\|election'"
|
|
done
|
|
|
|
# 3. Check network partition
|
|
for node in node1 node2 node3; do
|
|
for peer in node1 node2 node3; do
|
|
echo "$node -> $peer:"
|
|
ssh $node "ping -c 3 $peer"
|
|
done
|
|
done
|
|
|
|
# 4. Check quorum
|
|
# For 3-node cluster, need 2 nodes (majority)
|
|
RUNNING_NODES=$(for node in node1 node2 node3; do ssh $node "systemctl is-active chainfire" 2>/dev/null; done | grep -c active)
|
|
echo "Running nodes: $RUNNING_NODES (need >= 2 for quorum)"
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
**If <50% nodes are up (no quorum):**
|
|
```bash
|
|
# Start majority of nodes
|
|
ssh node1 "sudo systemctl start chainfire"
|
|
ssh node2 "sudo systemctl start chainfire"
|
|
|
|
# Wait for leader election
|
|
sleep 10
|
|
|
|
# Verify leader elected
|
|
chainfire-client --endpoint http://node1:2379 status | grep leader
|
|
```
|
|
|
|
**If network partition:**
|
|
```bash
|
|
# Check and fix network connectivity
|
|
# Ensure bidirectional connectivity between all nodes
|
|
|
|
# Restart affected nodes
|
|
ssh ISOLATED_NODE "sudo systemctl restart chainfire"
|
|
```
|
|
|
|
**If split-brain (multiple leaders):**
|
|
```bash
|
|
# DANGER: This wipes follower data
|
|
# Stop all nodes
|
|
for node in node1 node2 node3; do
|
|
ssh $node "sudo systemctl stop chainfire"
|
|
done
|
|
|
|
# Keep only the node with highest raft_index
|
|
# Wipe others
|
|
ssh node2 "sudo rm -rf /var/lib/chainfire/*"
|
|
ssh node3 "sudo rm -rf /var/lib/chainfire/*"
|
|
|
|
# Restart leader (node1 in this example)
|
|
ssh node1 "sudo systemctl start chainfire"
|
|
sleep 10
|
|
|
|
# Re-add followers via member-add
|
|
chainfire-client --endpoint http://node1:2379 member-add --node-id 2 --peer-url node2:2380
|
|
chainfire-client --endpoint http://node1:2379 member-add --node-id 3 --peer-url node3:2380
|
|
|
|
# Start followers
|
|
ssh node2 "sudo systemctl start chainfire"
|
|
ssh node3 "sudo systemctl start chainfire"
|
|
```
|
|
|
|
### Issue: High Write Latency
|
|
|
|
**Symptoms:**
|
|
- `chainfire-client put` commands take >100ms
|
|
- Application reports slow writes
|
|
- Metrics show p99 latency >500ms
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# 1. Check disk I/O
|
|
iostat -x 1 10
|
|
# Watch for %util > 80% or await > 20ms
|
|
|
|
# 2. Check Raft replication lag
|
|
chainfire-client --endpoint http://LEADER_IP:2379 status
|
|
# Compare raft_index across nodes
|
|
|
|
# 3. Check network latency between nodes
|
|
for node in node1 node2 node3; do
|
|
echo "=== $node ==="
|
|
ping -c 10 $node
|
|
done
|
|
|
|
# 4. Check CPU usage
|
|
top -bn1 | grep chainfire
|
|
|
|
# 5. Check RocksDB stats
|
|
# Look for stalls in logs
|
|
journalctl -u chainfire -n 500 | grep -i stall
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
**If disk I/O bottleneck:**
|
|
```bash
|
|
# 1. Check data directory is on SSD (not HDD)
|
|
df -h /var/lib/chainfire
|
|
mount | grep /var/lib/chainfire
|
|
|
|
# 2. Tune RocksDB settings (in config)
|
|
[storage]
|
|
# Increase write buffer size
|
|
write_buffer_size = 134217728 # 128MB (default: 64MB)
|
|
# Increase block cache
|
|
block_cache_size = 536870912 # 512MB (default: 256MB)
|
|
|
|
# 3. Enable direct I/O if on dedicated disk
|
|
# Add to config:
|
|
use_direct_io_for_flush_and_compaction = true
|
|
|
|
# 4. Restart service
|
|
sudo systemctl restart chainfire
|
|
```
|
|
|
|
**If network latency:**
|
|
```bash
|
|
# Verify nodes are in same datacenter
|
|
# For cross-datacenter, expect higher latency
|
|
# Consider adding learner nodes instead of voters
|
|
|
|
# Check MTU settings
|
|
ip link show | grep mtu
|
|
# Ensure MTU is consistent across nodes (typically 1500 or 9000 for jumbo frames)
|
|
```
|
|
|
|
**If CPU bottleneck:**
|
|
```bash
|
|
# Scale vertically (add CPU cores)
|
|
# Or scale horizontally (add read replicas as learner nodes)
|
|
|
|
# Tune Raft tick interval (in config)
|
|
[raft]
|
|
tick_interval_ms = 200 # Increase from default 100ms
|
|
```
|
|
|
|
### Issue: Data Inconsistency After Crash
|
|
|
|
**Symptoms:**
|
|
- After node crash/restart, reads return stale data
|
|
- `raft_index` does not advance
|
|
- Logs show "corrupted log entry" errors
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# 1. Check RocksDB integrity
|
|
# Stop service first
|
|
sudo systemctl stop chainfire
|
|
|
|
# Run RocksDB repair
|
|
rocksdb_ldb --db=/var/lib/chainfire repair
|
|
|
|
# Check for corruption
|
|
rocksdb_ldb --db=/var/lib/chainfire checkconsistency
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
**If minor corruption (repair successful):**
|
|
```bash
|
|
# Restart service
|
|
sudo systemctl start chainfire
|
|
|
|
# Let Raft catch up from leader
|
|
# Monitor raft_index
|
|
watch -n 1 "chainfire-client --endpoint http://localhost:2379 status | grep raft_index"
|
|
```
|
|
|
|
**If major corruption (repair failed):**
|
|
```bash
|
|
# Restore from backup
|
|
sudo systemctl stop chainfire
|
|
sudo mv /var/lib/chainfire /var/lib/chainfire.corrupted
|
|
sudo mkdir -p /var/lib/chainfire
|
|
|
|
# Extract latest backup
|
|
LATEST_BACKUP=$(ls -t /var/backups/chainfire/*.tar.gz | head -1)
|
|
sudo tar -xzf "$LATEST_BACKUP" -C /var/lib/chainfire --strip-components=1
|
|
|
|
# Fix permissions
|
|
sudo chown -R chainfire:chainfire /var/lib/chainfire
|
|
|
|
# Restart
|
|
sudo systemctl start chainfire
|
|
```
|
|
|
|
**If cannot restore (no backup):**
|
|
```bash
|
|
# Remove node from cluster and re-add fresh
|
|
# From leader node:
|
|
chainfire-client --endpoint http://LEADER_IP:2379 member-remove --node-id FAILED_NODE_ID
|
|
|
|
# On failed node, wipe and rejoin
|
|
sudo systemctl stop chainfire
|
|
sudo rm -rf /var/lib/chainfire/*
|
|
sudo systemctl start chainfire
|
|
|
|
# Re-add from leader
|
|
chainfire-client --endpoint http://LEADER_IP:2379 member-add \
|
|
--node-id FAILED_NODE_ID \
|
|
--peer-url FAILED_NODE_IP:2380 \
|
|
--learner
|
|
|
|
# Promote after catchup
|
|
chainfire-client --endpoint http://LEADER_IP:2379 member-promote --node-id FAILED_NODE_ID
|
|
```
|
|
|
|
## FlareDB Issues
|
|
|
|
### Issue: Store Not Registering with PD
|
|
|
|
**Symptoms:**
|
|
- New FlareDB store starts but doesn't appear in `cluster-status`
|
|
- Store logs show "failed to register with PD" errors
|
|
- PD logs show no registration attempts
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# 1. Check PD connectivity
|
|
ssh FLAREDB_NODE "nc -zv PD_IP 2379"
|
|
|
|
# 2. Verify PD address in config
|
|
ssh FLAREDB_NODE "grep pd_addr /etc/centra-cloud/flaredb.toml"
|
|
|
|
# 3. Check store logs
|
|
ssh FLAREDB_NODE "journalctl -u flaredb -n 100 | grep -i 'pd\|register'"
|
|
|
|
# 4. Check PD logs
|
|
ssh PD_NODE "journalctl -u placement-driver -n 100 | grep -i register"
|
|
|
|
# 5. Verify store_id is unique
|
|
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | .id'
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
**If network issue:**
|
|
```bash
|
|
# Open firewall on PD node
|
|
ssh PD_NODE "sudo firewall-cmd --permanent --add-port=2379/tcp"
|
|
ssh PD_NODE "sudo firewall-cmd --reload"
|
|
|
|
# Restart store
|
|
ssh FLAREDB_NODE "sudo systemctl restart flaredb"
|
|
```
|
|
|
|
**If duplicate store_id:**
|
|
```bash
|
|
# Assign new unique store_id
|
|
ssh FLAREDB_NODE "sudo vi /etc/centra-cloud/flaredb.toml"
|
|
# Change: store_id = <NEW_UNIQUE_ID>
|
|
|
|
# Wipe old data (contains old store_id)
|
|
ssh FLAREDB_NODE "sudo rm -rf /var/lib/flaredb/*"
|
|
|
|
# Restart
|
|
ssh FLAREDB_NODE "sudo systemctl restart flaredb"
|
|
```
|
|
|
|
**If TLS mismatch:**
|
|
```bash
|
|
# Ensure PD and store have matching TLS config
|
|
# Either both use TLS or both don't
|
|
|
|
# If PD uses TLS:
|
|
ssh FLAREDB_NODE "sudo vi /etc/centra-cloud/flaredb.toml"
|
|
# Add/verify:
|
|
# [tls]
|
|
# cert_file = "/etc/centra-cloud/certs/flaredb-node-N.crt"
|
|
# key_file = "/etc/centra-cloud/certs/flaredb-node-N.key"
|
|
# ca_file = "/etc/centra-cloud/certs/ca.crt"
|
|
|
|
# Restart
|
|
ssh FLAREDB_NODE "sudo systemctl restart flaredb"
|
|
```
|
|
|
|
### Issue: Region Rebalancing Stuck
|
|
|
|
**Symptoms:**
|
|
- `pd/api/v1/stats/region` shows high `pending_peers` count
|
|
- Regions not moving to new stores
|
|
- PD logs show "failed to schedule operator" errors
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# 1. Check region stats
|
|
curl http://PD_IP:2379/pd/api/v1/stats/region | jq
|
|
|
|
# 2. Check store capacity
|
|
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | {id, state, available, capacity}'
|
|
|
|
# 3. Check pending operators
|
|
curl http://PD_IP:2379/pd/api/v1/operators | jq
|
|
|
|
# 4. Check PD scheduler config
|
|
curl http://PD_IP:2379/pd/api/v1/config/schedule | jq
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
**If store is down:**
|
|
```bash
|
|
# Identify down store
|
|
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | select(.state!="Up")'
|
|
|
|
# Fix or remove down store
|
|
ssh DOWN_STORE_NODE "sudo systemctl restart flaredb"
|
|
|
|
# If cannot recover, remove store:
|
|
curl -X DELETE http://PD_IP:2379/pd/api/v1/store/DOWN_STORE_ID
|
|
```
|
|
|
|
**If disk full:**
|
|
```bash
|
|
# Identify full stores
|
|
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | select((.available / .capacity) < 0.1)'
|
|
|
|
# Add more storage or scale out with new stores
|
|
# See scale-out.md for adding stores
|
|
```
|
|
|
|
**If scheduler disabled:**
|
|
```bash
|
|
# Check scheduler status
|
|
curl http://PD_IP:2379/pd/api/v1/config/schedule | jq '.schedulers'
|
|
|
|
# Enable schedulers if disabled
|
|
curl -X POST http://PD_IP:2379/pd/api/v1/config/schedule \
|
|
-d '{"max-snapshot-count": 3, "max-pending-peer-count": 16}'
|
|
```
|
|
|
|
### Issue: Read/Write Timeout
|
|
|
|
**Symptoms:**
|
|
- Client operations timeout after 30s
|
|
- Logs show "context deadline exceeded"
|
|
- No leader election issues visible
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# 1. Check client timeout config
|
|
# Default timeout is 30s
|
|
|
|
# 2. Check store responsiveness
|
|
time flaredb-client --endpoint http://STORE_IP:2379 get test-key
|
|
|
|
# 3. Check CPU usage on stores
|
|
ssh STORE_NODE "top -bn1 | grep flaredb"
|
|
|
|
# 4. Check slow queries
|
|
ssh STORE_NODE "journalctl -u flaredb -n 500 | grep -i 'slow\|timeout'"
|
|
|
|
# 5. Check disk latency
|
|
ssh STORE_NODE "iostat -x 1 10"
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
**If disk I/O bottleneck:**
|
|
```bash
|
|
# Same as Chainfire high latency issue
|
|
# 1. Verify SSD usage
|
|
# 2. Tune RocksDB settings
|
|
# 3. Add more stores for read distribution
|
|
```
|
|
|
|
**If CPU bottleneck:**
|
|
```bash
|
|
# Check compaction storms
|
|
ssh STORE_NODE "journalctl -u flaredb | grep -i compaction | tail -50"
|
|
|
|
# Throttle compaction if needed
|
|
# Add to flaredb config:
|
|
[storage]
|
|
max_background_compactions = 2 # Reduce from default 4
|
|
max_background_flushes = 1 # Reduce from default 2
|
|
|
|
sudo systemctl restart flaredb
|
|
```
|
|
|
|
**If network partition:**
|
|
```bash
|
|
# Check connectivity between store and PD
|
|
ssh STORE_NODE "ping -c 10 PD_IP"
|
|
|
|
# Check for packet loss
|
|
# If >1% loss, investigate network infrastructure
|
|
```
|
|
|
|
## TLS/mTLS Issues
|
|
|
|
### Issue: TLS Handshake Failures
|
|
|
|
**Symptoms:**
|
|
- Logs show "tls: bad certificate" or "certificate verify failed"
|
|
- Connections fail immediately
|
|
- curl commands fail with SSL errors
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# 1. Verify certificate files exist
|
|
ls -l /etc/centra-cloud/certs/
|
|
|
|
# 2. Check certificate validity
|
|
openssl x509 -in /etc/centra-cloud/certs/chainfire-node-1.crt -noout -dates
|
|
|
|
# 3. Verify CA matches
|
|
openssl x509 -in /etc/centra-cloud/certs/ca.crt -noout -subject
|
|
openssl x509 -in /etc/centra-cloud/certs/chainfire-node-1.crt -noout -issuer
|
|
|
|
# 4. Test TLS connection
|
|
openssl s_client -connect NODE_IP:2379 \
|
|
-CAfile /etc/centra-cloud/certs/ca.crt \
|
|
-cert /etc/centra-cloud/certs/chainfire-node-1.crt \
|
|
-key /etc/centra-cloud/certs/chainfire-node-1.key
|
|
```
|
|
|
|
**Resolution:**
|
|
|
|
**If certificate expired:**
|
|
```bash
|
|
# Regenerate certificates
|
|
cd /path/to/centra-cloud
|
|
./scripts/generate-dev-certs.sh /etc/centra-cloud/certs
|
|
|
|
# Distribute to all nodes
|
|
for node in node1 node2 node3; do
|
|
scp /etc/centra-cloud/certs/* $node:/etc/centra-cloud/certs/
|
|
done
|
|
|
|
# Restart services
|
|
for node in node1 node2 node3; do
|
|
ssh $node "sudo systemctl restart chainfire"
|
|
done
|
|
```
|
|
|
|
**If CA mismatch:**
|
|
```bash
|
|
# Ensure all nodes use same CA
|
|
# Regenerate all certs from same CA
|
|
|
|
# On CA-generating node:
|
|
./scripts/generate-dev-certs.sh /tmp/new-certs
|
|
|
|
# Distribute to all nodes
|
|
for node in node1 node2 node3; do
|
|
scp /tmp/new-certs/* $node:/etc/centra-cloud/certs/
|
|
ssh $node "sudo chown -R chainfire:chainfire /etc/centra-cloud/certs"
|
|
ssh $node "sudo chmod 600 /etc/centra-cloud/certs/*.key"
|
|
done
|
|
|
|
# Restart all services
|
|
for node in node1 node2 node3; do
|
|
ssh $node "sudo systemctl restart chainfire"
|
|
done
|
|
```
|
|
|
|
**If permissions issue:**
|
|
```bash
|
|
# Fix certificate file permissions
|
|
sudo chown chainfire:chainfire /etc/centra-cloud/certs/*
|
|
sudo chmod 644 /etc/centra-cloud/certs/*.crt
|
|
sudo chmod 600 /etc/centra-cloud/certs/*.key
|
|
|
|
# Restart service
|
|
sudo systemctl restart chainfire
|
|
```
|
|
|
|
## Performance Tuning
|
|
|
|
### Chainfire Performance Optimization
|
|
|
|
**For write-heavy workloads:**
|
|
```toml
|
|
# /etc/centra-cloud/chainfire.toml
|
|
|
|
[storage]
|
|
# Increase write buffer
|
|
write_buffer_size = 134217728 # 128MB
|
|
|
|
# More write buffers
|
|
max_write_buffer_number = 4
|
|
|
|
# Larger block cache for hot data
|
|
block_cache_size = 1073741824 # 1GB
|
|
|
|
# Reduce compaction frequency
|
|
level0_file_num_compaction_trigger = 8 # Default: 4
|
|
```
|
|
|
|
**For read-heavy workloads:**
|
|
```toml
|
|
[storage]
|
|
# Larger block cache
|
|
block_cache_size = 2147483648 # 2GB
|
|
|
|
# Enable bloom filters
|
|
bloom_filter_bits_per_key = 10
|
|
|
|
# More table cache
|
|
max_open_files = 10000 # Default: 1000
|
|
```
|
|
|
|
**For low-latency requirements:**
|
|
```toml
|
|
[raft]
|
|
# Reduce tick interval
|
|
tick_interval_ms = 50 # Default: 100
|
|
|
|
[storage]
|
|
# Enable direct I/O
|
|
use_direct_io_for_flush_and_compaction = true
|
|
```
|
|
|
|
### FlareDB Performance Optimization
|
|
|
|
**For high ingestion rate:**
|
|
```toml
|
|
# /etc/centra-cloud/flaredb.toml
|
|
|
|
[storage]
|
|
# Larger write buffers
|
|
write_buffer_size = 268435456 # 256MB
|
|
max_write_buffer_number = 6
|
|
|
|
# More background jobs
|
|
max_background_compactions = 4
|
|
max_background_flushes = 2
|
|
```
|
|
|
|
**For large query workloads:**
|
|
```toml
|
|
[storage]
|
|
# Larger block cache
|
|
block_cache_size = 4294967296 # 4GB
|
|
|
|
# Keep more files open
|
|
max_open_files = 20000
|
|
```
|
|
|
|
## Monitoring & Alerts
|
|
|
|
### Key Metrics to Monitor
|
|
|
|
**Chainfire:**
|
|
- `raft_index` - should advance steadily
|
|
- `raft_term` - should be stable (not increasing frequently)
|
|
- Write latency p50, p95, p99
|
|
- Disk I/O utilization
|
|
- Network bandwidth between nodes
|
|
|
|
**FlareDB:**
|
|
- Store state (Up/Down)
|
|
- Region count and distribution
|
|
- Pending peers count (should be near 0)
|
|
- Read/write QPS per store
|
|
- Disk space available
|
|
|
|
### Prometheus Queries
|
|
|
|
```promql
|
|
# Chainfire write latency
|
|
histogram_quantile(0.99, rate(chainfire_write_duration_seconds_bucket[5m]))
|
|
|
|
# Raft log replication lag
|
|
chainfire_raft_index{role="leader"} - chainfire_raft_index{role="follower"}
|
|
|
|
# FlareDB store health
|
|
flaredb_store_state == 1 # 1 = Up, 0 = Down
|
|
|
|
# Region rebalancing activity
|
|
rate(flaredb_pending_peers_total[5m])
|
|
```
|
|
|
|
### Alerting Rules
|
|
|
|
```yaml
|
|
# Prometheus alerting rules
|
|
|
|
groups:
|
|
- name: chainfire
|
|
rules:
|
|
- alert: ChainfireNoLeader
|
|
expr: chainfire_has_leader == 0
|
|
for: 1m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Chainfire cluster has no leader"
|
|
|
|
- alert: ChainfireHighWriteLatency
|
|
expr: histogram_quantile(0.99, rate(chainfire_write_duration_seconds_bucket[5m])) > 0.5
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Chainfire p99 write latency >500ms"
|
|
|
|
- alert: ChainfireNodeDown
|
|
expr: up{job="chainfire"} == 0
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Chainfire node {{ $labels.instance }} is down"
|
|
|
|
- name: flaredb
|
|
rules:
|
|
- alert: FlareDBStoreDown
|
|
expr: flaredb_store_state == 0
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "FlareDB store {{ $labels.store_id }} is down"
|
|
|
|
- alert: FlareDBHighPendingPeers
|
|
expr: flaredb_pending_peers_total > 100
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "FlareDB has {{ $value }} pending peers (rebalancing stuck?)"
|
|
```
|
|
|
|
## Log Analysis
|
|
|
|
### Common Log Patterns
|
|
|
|
**Chainfire healthy operation:**
|
|
```
|
|
INFO chainfire_raft: Leader elected, term=3
|
|
INFO chainfire_storage: Committed entry, index=12345
|
|
INFO chainfire_api: Handled put request, latency=15ms
|
|
```
|
|
|
|
**Chainfire warning signs:**
|
|
```
|
|
WARN chainfire_raft: Election timeout, no heartbeat from leader
|
|
WARN chainfire_storage: RocksDB stall detected, duration=2000ms
|
|
ERROR chainfire_network: Failed to connect to peer, addr=node2:2380
|
|
```
|
|
|
|
**FlareDB healthy operation:**
|
|
```
|
|
INFO flaredb_pd_client: Registered with PD, store_id=1
|
|
INFO flaredb_raft: Applied snapshot, index=5000
|
|
INFO flaredb_service: Handled query, rows=1000, latency=50ms
|
|
```
|
|
|
|
**FlareDB warning signs:**
|
|
```
|
|
WARN flaredb_pd_client: Heartbeat to PD failed, retrying...
|
|
WARN flaredb_storage: Compaction is slow, duration=30s
|
|
ERROR flaredb_raft: Failed to replicate log, peer=store2
|
|
```
|
|
|
|
### Log Aggregation Queries
|
|
|
|
**Using journalctl:**
|
|
```bash
|
|
# Find all errors in last hour
|
|
journalctl -u chainfire --since "1 hour ago" | grep ERROR
|
|
|
|
# Count error types
|
|
journalctl -u chainfire --since "1 day ago" | grep ERROR | awk '{print $NF}' | sort | uniq -c | sort -rn
|
|
|
|
# Track leader changes
|
|
journalctl -u chainfire | grep "Leader elected" | tail -20
|
|
```
|
|
|
|
**Using grep for pattern matching:**
|
|
```bash
|
|
# Find slow operations
|
|
journalctl -u chainfire -n 10000 | grep -E 'latency=[0-9]{3,}ms'
|
|
|
|
# Find connection errors
|
|
journalctl -u chainfire -n 5000 | grep -i 'connection refused\|timeout\|unreachable'
|
|
|
|
# Find replication lag
|
|
journalctl -u chainfire | grep -i 'lag\|behind\|catch.*up'
|
|
```
|
|
|
|
## References
|
|
|
|
- Configuration: `specifications/configuration.md`
|
|
- Backup/Restore: `docs/ops/backup-restore.md`
|
|
- Scale-Out: `docs/ops/scale-out.md`
|
|
- Upgrade: `docs/ops/upgrade.md`
|
|
- RocksDB Tuning: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide
|