# Troubleshooting Runbook ## Overview This runbook provides diagnostic procedures and solutions for common operational issues with Chainfire (distributed KV) and FlareDB (time-series DB). ## Quick Diagnostics ### Health Check Commands ```bash # Chainfire cluster health chainfire-client --endpoint http://NODE_IP:2379 status chainfire-client --endpoint http://NODE_IP:2379 member-list # FlareDB cluster health flaredb-client --endpoint http://PD_IP:2379 cluster-status curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | {id, state, capacity}' # Service status systemctl status chainfire systemctl status flaredb # Port connectivity nc -zv NODE_IP 2379 # API port nc -zv NODE_IP 2380 # Raft port nc -zv NODE_IP 2381 # Gossip port # Resource usage top -bn1 | head -20 df -h iostat -x 1 5 # Recent logs journalctl -u chainfire -n 100 --no-pager journalctl -u flaredb -n 100 --no-pager ``` ## Chainfire Issues ### Issue: Node Cannot Join Cluster **Symptoms:** - `member-add` command hangs or times out - New node logs show "connection refused" or "timeout" errors - `member-list` does not show the new node **Diagnosis:** ```bash # 1. Check network connectivity nc -zv NEW_NODE_IP 2380 # 2. Verify Raft server is listening on new node ssh NEW_NODE_IP "ss -tlnp | grep 2380" # 3. Check firewall rules ssh NEW_NODE_IP "sudo iptables -L -n | grep 2380" # 4. Verify TLS configuration matches ssh NEW_NODE_IP "grep -A5 '\[network.tls\]' /etc/centra-cloud/chainfire.toml" # 5. Check leader logs ssh LEADER_NODE "journalctl -u chainfire -n 50 | grep -i 'add.*node'" ``` **Resolution:** **If network issue:** ```bash # Open firewall ports on new node sudo firewall-cmd --permanent --add-port=2379/tcp sudo firewall-cmd --permanent --add-port=2380/tcp sudo firewall-cmd --permanent --add-port=2381/tcp sudo firewall-cmd --reload ``` **If TLS mismatch:** ```bash # Ensure new node has correct certificates sudo ls -l /etc/centra-cloud/certs/ # Should have: ca.crt, chainfire-node-N.crt, chainfire-node-N.key # Verify certificate is valid openssl x509 -in /etc/centra-cloud/certs/chainfire-node-N.crt -noout -text ``` **If bootstrap flag set incorrectly:** ```bash # Edit config on new node sudo vi /etc/centra-cloud/chainfire.toml # Ensure: # [cluster] # bootstrap = false # MUST be false for joining nodes sudo systemctl restart chainfire ``` ### Issue: No Leader / Leader Election Fails **Symptoms:** - Writes fail with "no leader elected" error - `chainfire-client status` shows `leader: none` - Logs show repeated "election timeout" messages **Diagnosis:** ```bash # 1. Check cluster membership chainfire-client --endpoint http://NODE1_IP:2379 member-list # 2. Check Raft state on all nodes for node in node1 node2 node3; do echo "=== $node ===" ssh $node "journalctl -u chainfire -n 20 | grep -i 'raft\|leader\|election'" done # 3. Check network partition for node in node1 node2 node3; do for peer in node1 node2 node3; do echo "$node -> $peer:" ssh $node "ping -c 3 $peer" done done # 4. Check quorum # For 3-node cluster, need 2 nodes (majority) RUNNING_NODES=$(for node in node1 node2 node3; do ssh $node "systemctl is-active chainfire" 2>/dev/null; done | grep -c active) echo "Running nodes: $RUNNING_NODES (need >= 2 for quorum)" ``` **Resolution:** **If <50% nodes are up (no quorum):** ```bash # Start majority of nodes ssh node1 "sudo systemctl start chainfire" ssh node2 "sudo systemctl start chainfire" # Wait for leader election sleep 10 # Verify leader elected chainfire-client --endpoint http://node1:2379 status | grep leader ``` **If network partition:** ```bash # Check and fix network connectivity # Ensure bidirectional connectivity between all nodes # Restart affected nodes ssh ISOLATED_NODE "sudo systemctl restart chainfire" ``` **If split-brain (multiple leaders):** ```bash # DANGER: This wipes follower data # Stop all nodes for node in node1 node2 node3; do ssh $node "sudo systemctl stop chainfire" done # Keep only the node with highest raft_index # Wipe others ssh node2 "sudo rm -rf /var/lib/chainfire/*" ssh node3 "sudo rm -rf /var/lib/chainfire/*" # Restart leader (node1 in this example) ssh node1 "sudo systemctl start chainfire" sleep 10 # Re-add followers via member-add chainfire-client --endpoint http://node1:2379 member-add --node-id 2 --peer-url node2:2380 chainfire-client --endpoint http://node1:2379 member-add --node-id 3 --peer-url node3:2380 # Start followers ssh node2 "sudo systemctl start chainfire" ssh node3 "sudo systemctl start chainfire" ``` ### Issue: High Write Latency **Symptoms:** - `chainfire-client put` commands take >100ms - Application reports slow writes - Metrics show p99 latency >500ms **Diagnosis:** ```bash # 1. Check disk I/O iostat -x 1 10 # Watch for %util > 80% or await > 20ms # 2. Check Raft replication lag chainfire-client --endpoint http://LEADER_IP:2379 status # Compare raft_index across nodes # 3. Check network latency between nodes for node in node1 node2 node3; do echo "=== $node ===" ping -c 10 $node done # 4. Check CPU usage top -bn1 | grep chainfire # 5. Check RocksDB stats # Look for stalls in logs journalctl -u chainfire -n 500 | grep -i stall ``` **Resolution:** **If disk I/O bottleneck:** ```bash # 1. Check data directory is on SSD (not HDD) df -h /var/lib/chainfire mount | grep /var/lib/chainfire # 2. Tune RocksDB settings (in config) [storage] # Increase write buffer size write_buffer_size = 134217728 # 128MB (default: 64MB) # Increase block cache block_cache_size = 536870912 # 512MB (default: 256MB) # 3. Enable direct I/O if on dedicated disk # Add to config: use_direct_io_for_flush_and_compaction = true # 4. Restart service sudo systemctl restart chainfire ``` **If network latency:** ```bash # Verify nodes are in same datacenter # For cross-datacenter, expect higher latency # Consider adding learner nodes instead of voters # Check MTU settings ip link show | grep mtu # Ensure MTU is consistent across nodes (typically 1500 or 9000 for jumbo frames) ``` **If CPU bottleneck:** ```bash # Scale vertically (add CPU cores) # Or scale horizontally (add read replicas as learner nodes) # Tune Raft tick interval (in config) [raft] tick_interval_ms = 200 # Increase from default 100ms ``` ### Issue: Data Inconsistency After Crash **Symptoms:** - After node crash/restart, reads return stale data - `raft_index` does not advance - Logs show "corrupted log entry" errors **Diagnosis:** ```bash # 1. Check RocksDB integrity # Stop service first sudo systemctl stop chainfire # Run RocksDB repair rocksdb_ldb --db=/var/lib/chainfire repair # Check for corruption rocksdb_ldb --db=/var/lib/chainfire checkconsistency ``` **Resolution:** **If minor corruption (repair successful):** ```bash # Restart service sudo systemctl start chainfire # Let Raft catch up from leader # Monitor raft_index watch -n 1 "chainfire-client --endpoint http://localhost:2379 status | grep raft_index" ``` **If major corruption (repair failed):** ```bash # Restore from backup sudo systemctl stop chainfire sudo mv /var/lib/chainfire /var/lib/chainfire.corrupted sudo mkdir -p /var/lib/chainfire # Extract latest backup LATEST_BACKUP=$(ls -t /var/backups/chainfire/*.tar.gz | head -1) sudo tar -xzf "$LATEST_BACKUP" -C /var/lib/chainfire --strip-components=1 # Fix permissions sudo chown -R chainfire:chainfire /var/lib/chainfire # Restart sudo systemctl start chainfire ``` **If cannot restore (no backup):** ```bash # Remove node from cluster and re-add fresh # From leader node: chainfire-client --endpoint http://LEADER_IP:2379 member-remove --node-id FAILED_NODE_ID # On failed node, wipe and rejoin sudo systemctl stop chainfire sudo rm -rf /var/lib/chainfire/* sudo systemctl start chainfire # Re-add from leader chainfire-client --endpoint http://LEADER_IP:2379 member-add \ --node-id FAILED_NODE_ID \ --peer-url FAILED_NODE_IP:2380 \ --learner # Promote after catchup chainfire-client --endpoint http://LEADER_IP:2379 member-promote --node-id FAILED_NODE_ID ``` ## FlareDB Issues ### Issue: Store Not Registering with PD **Symptoms:** - New FlareDB store starts but doesn't appear in `cluster-status` - Store logs show "failed to register with PD" errors - PD logs show no registration attempts **Diagnosis:** ```bash # 1. Check PD connectivity ssh FLAREDB_NODE "nc -zv PD_IP 2379" # 2. Verify PD address in config ssh FLAREDB_NODE "grep pd_addr /etc/centra-cloud/flaredb.toml" # 3. Check store logs ssh FLAREDB_NODE "journalctl -u flaredb -n 100 | grep -i 'pd\|register'" # 4. Check PD logs ssh PD_NODE "journalctl -u placement-driver -n 100 | grep -i register" # 5. Verify store_id is unique curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | .id' ``` **Resolution:** **If network issue:** ```bash # Open firewall on PD node ssh PD_NODE "sudo firewall-cmd --permanent --add-port=2379/tcp" ssh PD_NODE "sudo firewall-cmd --reload" # Restart store ssh FLAREDB_NODE "sudo systemctl restart flaredb" ``` **If duplicate store_id:** ```bash # Assign new unique store_id ssh FLAREDB_NODE "sudo vi /etc/centra-cloud/flaredb.toml" # Change: store_id = # Wipe old data (contains old store_id) ssh FLAREDB_NODE "sudo rm -rf /var/lib/flaredb/*" # Restart ssh FLAREDB_NODE "sudo systemctl restart flaredb" ``` **If TLS mismatch:** ```bash # Ensure PD and store have matching TLS config # Either both use TLS or both don't # If PD uses TLS: ssh FLAREDB_NODE "sudo vi /etc/centra-cloud/flaredb.toml" # Add/verify: # [tls] # cert_file = "/etc/centra-cloud/certs/flaredb-node-N.crt" # key_file = "/etc/centra-cloud/certs/flaredb-node-N.key" # ca_file = "/etc/centra-cloud/certs/ca.crt" # Restart ssh FLAREDB_NODE "sudo systemctl restart flaredb" ``` ### Issue: Region Rebalancing Stuck **Symptoms:** - `pd/api/v1/stats/region` shows high `pending_peers` count - Regions not moving to new stores - PD logs show "failed to schedule operator" errors **Diagnosis:** ```bash # 1. Check region stats curl http://PD_IP:2379/pd/api/v1/stats/region | jq # 2. Check store capacity curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | {id, state, available, capacity}' # 3. Check pending operators curl http://PD_IP:2379/pd/api/v1/operators | jq # 4. Check PD scheduler config curl http://PD_IP:2379/pd/api/v1/config/schedule | jq ``` **Resolution:** **If store is down:** ```bash # Identify down store curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | select(.state!="Up")' # Fix or remove down store ssh DOWN_STORE_NODE "sudo systemctl restart flaredb" # If cannot recover, remove store: curl -X DELETE http://PD_IP:2379/pd/api/v1/store/DOWN_STORE_ID ``` **If disk full:** ```bash # Identify full stores curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | select((.available / .capacity) < 0.1)' # Add more storage or scale out with new stores # See scale-out.md for adding stores ``` **If scheduler disabled:** ```bash # Check scheduler status curl http://PD_IP:2379/pd/api/v1/config/schedule | jq '.schedulers' # Enable schedulers if disabled curl -X POST http://PD_IP:2379/pd/api/v1/config/schedule \ -d '{"max-snapshot-count": 3, "max-pending-peer-count": 16}' ``` ### Issue: Read/Write Timeout **Symptoms:** - Client operations timeout after 30s - Logs show "context deadline exceeded" - No leader election issues visible **Diagnosis:** ```bash # 1. Check client timeout config # Default timeout is 30s # 2. Check store responsiveness time flaredb-client --endpoint http://STORE_IP:2379 get test-key # 3. Check CPU usage on stores ssh STORE_NODE "top -bn1 | grep flaredb" # 4. Check slow queries ssh STORE_NODE "journalctl -u flaredb -n 500 | grep -i 'slow\|timeout'" # 5. Check disk latency ssh STORE_NODE "iostat -x 1 10" ``` **Resolution:** **If disk I/O bottleneck:** ```bash # Same as Chainfire high latency issue # 1. Verify SSD usage # 2. Tune RocksDB settings # 3. Add more stores for read distribution ``` **If CPU bottleneck:** ```bash # Check compaction storms ssh STORE_NODE "journalctl -u flaredb | grep -i compaction | tail -50" # Throttle compaction if needed # Add to flaredb config: [storage] max_background_compactions = 2 # Reduce from default 4 max_background_flushes = 1 # Reduce from default 2 sudo systemctl restart flaredb ``` **If network partition:** ```bash # Check connectivity between store and PD ssh STORE_NODE "ping -c 10 PD_IP" # Check for packet loss # If >1% loss, investigate network infrastructure ``` ## TLS/mTLS Issues ### Issue: TLS Handshake Failures **Symptoms:** - Logs show "tls: bad certificate" or "certificate verify failed" - Connections fail immediately - curl commands fail with SSL errors **Diagnosis:** ```bash # 1. Verify certificate files exist ls -l /etc/centra-cloud/certs/ # 2. Check certificate validity openssl x509 -in /etc/centra-cloud/certs/chainfire-node-1.crt -noout -dates # 3. Verify CA matches openssl x509 -in /etc/centra-cloud/certs/ca.crt -noout -subject openssl x509 -in /etc/centra-cloud/certs/chainfire-node-1.crt -noout -issuer # 4. Test TLS connection openssl s_client -connect NODE_IP:2379 \ -CAfile /etc/centra-cloud/certs/ca.crt \ -cert /etc/centra-cloud/certs/chainfire-node-1.crt \ -key /etc/centra-cloud/certs/chainfire-node-1.key ``` **Resolution:** **If certificate expired:** ```bash # Regenerate certificates cd /path/to/centra-cloud ./scripts/generate-dev-certs.sh /etc/centra-cloud/certs # Distribute to all nodes for node in node1 node2 node3; do scp /etc/centra-cloud/certs/* $node:/etc/centra-cloud/certs/ done # Restart services for node in node1 node2 node3; do ssh $node "sudo systemctl restart chainfire" done ``` **If CA mismatch:** ```bash # Ensure all nodes use same CA # Regenerate all certs from same CA # On CA-generating node: ./scripts/generate-dev-certs.sh /tmp/new-certs # Distribute to all nodes for node in node1 node2 node3; do scp /tmp/new-certs/* $node:/etc/centra-cloud/certs/ ssh $node "sudo chown -R chainfire:chainfire /etc/centra-cloud/certs" ssh $node "sudo chmod 600 /etc/centra-cloud/certs/*.key" done # Restart all services for node in node1 node2 node3; do ssh $node "sudo systemctl restart chainfire" done ``` **If permissions issue:** ```bash # Fix certificate file permissions sudo chown chainfire:chainfire /etc/centra-cloud/certs/* sudo chmod 644 /etc/centra-cloud/certs/*.crt sudo chmod 600 /etc/centra-cloud/certs/*.key # Restart service sudo systemctl restart chainfire ``` ## Performance Tuning ### Chainfire Performance Optimization **For write-heavy workloads:** ```toml # /etc/centra-cloud/chainfire.toml [storage] # Increase write buffer write_buffer_size = 134217728 # 128MB # More write buffers max_write_buffer_number = 4 # Larger block cache for hot data block_cache_size = 1073741824 # 1GB # Reduce compaction frequency level0_file_num_compaction_trigger = 8 # Default: 4 ``` **For read-heavy workloads:** ```toml [storage] # Larger block cache block_cache_size = 2147483648 # 2GB # Enable bloom filters bloom_filter_bits_per_key = 10 # More table cache max_open_files = 10000 # Default: 1000 ``` **For low-latency requirements:** ```toml [raft] # Reduce tick interval tick_interval_ms = 50 # Default: 100 [storage] # Enable direct I/O use_direct_io_for_flush_and_compaction = true ``` ### FlareDB Performance Optimization **For high ingestion rate:** ```toml # /etc/centra-cloud/flaredb.toml [storage] # Larger write buffers write_buffer_size = 268435456 # 256MB max_write_buffer_number = 6 # More background jobs max_background_compactions = 4 max_background_flushes = 2 ``` **For large query workloads:** ```toml [storage] # Larger block cache block_cache_size = 4294967296 # 4GB # Keep more files open max_open_files = 20000 ``` ## Monitoring & Alerts ### Key Metrics to Monitor **Chainfire:** - `raft_index` - should advance steadily - `raft_term` - should be stable (not increasing frequently) - Write latency p50, p95, p99 - Disk I/O utilization - Network bandwidth between nodes **FlareDB:** - Store state (Up/Down) - Region count and distribution - Pending peers count (should be near 0) - Read/write QPS per store - Disk space available ### Prometheus Queries ```promql # Chainfire write latency histogram_quantile(0.99, rate(chainfire_write_duration_seconds_bucket[5m])) # Raft log replication lag chainfire_raft_index{role="leader"} - chainfire_raft_index{role="follower"} # FlareDB store health flaredb_store_state == 1 # 1 = Up, 0 = Down # Region rebalancing activity rate(flaredb_pending_peers_total[5m]) ``` ### Alerting Rules ```yaml # Prometheus alerting rules groups: - name: chainfire rules: - alert: ChainfireNoLeader expr: chainfire_has_leader == 0 for: 1m labels: severity: critical annotations: summary: "Chainfire cluster has no leader" - alert: ChainfireHighWriteLatency expr: histogram_quantile(0.99, rate(chainfire_write_duration_seconds_bucket[5m])) > 0.5 for: 5m labels: severity: warning annotations: summary: "Chainfire p99 write latency >500ms" - alert: ChainfireNodeDown expr: up{job="chainfire"} == 0 for: 2m labels: severity: critical annotations: summary: "Chainfire node {{ $labels.instance }} is down" - name: flaredb rules: - alert: FlareDBStoreDown expr: flaredb_store_state == 0 for: 2m labels: severity: critical annotations: summary: "FlareDB store {{ $labels.store_id }} is down" - alert: FlareDBHighPendingPeers expr: flaredb_pending_peers_total > 100 for: 10m labels: severity: warning annotations: summary: "FlareDB has {{ $value }} pending peers (rebalancing stuck?)" ``` ## Log Analysis ### Common Log Patterns **Chainfire healthy operation:** ``` INFO chainfire_raft: Leader elected, term=3 INFO chainfire_storage: Committed entry, index=12345 INFO chainfire_api: Handled put request, latency=15ms ``` **Chainfire warning signs:** ``` WARN chainfire_raft: Election timeout, no heartbeat from leader WARN chainfire_storage: RocksDB stall detected, duration=2000ms ERROR chainfire_network: Failed to connect to peer, addr=node2:2380 ``` **FlareDB healthy operation:** ``` INFO flaredb_pd_client: Registered with PD, store_id=1 INFO flaredb_raft: Applied snapshot, index=5000 INFO flaredb_service: Handled query, rows=1000, latency=50ms ``` **FlareDB warning signs:** ``` WARN flaredb_pd_client: Heartbeat to PD failed, retrying... WARN flaredb_storage: Compaction is slow, duration=30s ERROR flaredb_raft: Failed to replicate log, peer=store2 ``` ### Log Aggregation Queries **Using journalctl:** ```bash # Find all errors in last hour journalctl -u chainfire --since "1 hour ago" | grep ERROR # Count error types journalctl -u chainfire --since "1 day ago" | grep ERROR | awk '{print $NF}' | sort | uniq -c | sort -rn # Track leader changes journalctl -u chainfire | grep "Leader elected" | tail -20 ``` **Using grep for pattern matching:** ```bash # Find slow operations journalctl -u chainfire -n 10000 | grep -E 'latency=[0-9]{3,}ms' # Find connection errors journalctl -u chainfire -n 5000 | grep -i 'connection refused\|timeout\|unreachable' # Find replication lag journalctl -u chainfire | grep -i 'lag\|behind\|catch.*up' ``` ## References - Configuration: `specifications/configuration.md` - Backup/Restore: `docs/ops/backup-restore.md` - Scale-Out: `docs/ops/scale-out.md` - Upgrade: `docs/ops/upgrade.md` - RocksDB Tuning: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide