# Backup & Restore Runbook ## Overview This runbook covers backup and restore procedures for Chainfire (distributed KV) and FlareDB (time-series DB) persistent data stored in RocksDB. ## Prerequisites ### Backup Requirements - ✅ Sufficient disk space for snapshot (check data dir size + 20% margin) - ✅ Write access to backup destination directory - ✅ Node is healthy and reachable ### Restore Requirements - ✅ Backup snapshot file available - ✅ Target node stopped (for full restore) - ✅ Data directory permissions correct (`chown` as service user) ## Chainfire Backup ### Method 1: Hot Backup (RocksDB Checkpoint - Recommended) **Advantages:** No downtime, consistent snapshot ```bash # Create checkpoint backup while Chainfire is running BACKUP_DIR="/var/backups/chainfire/$(date +%Y%m%d-%H%M%S)" sudo mkdir -p "$BACKUP_DIR" # Trigger checkpoint via admin API (if exposed) curl -X POST http://CHAINFIRE_IP:2379/admin/checkpoint \ -d "{\"path\": \"$BACKUP_DIR\"}" # OR use RocksDB checkpoint CLI rocksdb_checkpoint --db=/var/lib/chainfire \ --checkpoint_dir="$BACKUP_DIR" # Verify checkpoint ls -lh "$BACKUP_DIR" # Should contain: CURRENT, MANIFEST-*, *.sst, *.log files ``` ### Method 2: Cold Backup (File Copy) **Advantages:** Simple, no special tools **Disadvantages:** Requires service stop ```bash # Stop Chainfire service sudo systemctl stop chainfire # Create backup BACKUP_DIR="/var/backups/chainfire/$(date +%Y%m%d-%H%M%S)" sudo mkdir -p "$BACKUP_DIR" sudo rsync -av /var/lib/chainfire/ "$BACKUP_DIR/" # Restart service sudo systemctl start chainfire # Verify backup du -sh "$BACKUP_DIR" ``` ### Automated Backup Script Create `/usr/local/bin/backup-chainfire.sh`: ```bash #!/bin/bash set -euo pipefail DATA_DIR="/var/lib/chainfire" BACKUP_ROOT="/var/backups/chainfire" RETENTION_DAYS=7 # Create backup BACKUP_DIR="$BACKUP_ROOT/$(date +%Y%m%d-%H%M%S)" mkdir -p "$BACKUP_DIR" # Use checkpoint (hot backup) rocksdb_checkpoint --db="$DATA_DIR" --checkpoint_dir="$BACKUP_DIR" # Compress backup tar -czf "$BACKUP_DIR.tar.gz" -C "$BACKUP_ROOT" "$(basename $BACKUP_DIR)" rm -rf "$BACKUP_DIR" # Clean old backups find "$BACKUP_ROOT" -name "*.tar.gz" -mtime +$RETENTION_DAYS -delete echo "Backup complete: $BACKUP_DIR.tar.gz" ``` **Schedule with cron:** ```bash # Add to crontab 0 2 * * * /usr/local/bin/backup-chainfire.sh >> /var/log/chainfire-backup.log 2>&1 ``` ## Chainfire Restore ### Full Restore from Backup ```bash # Stop Chainfire service sudo systemctl stop chainfire # Backup current data (safety) sudo mv /var/lib/chainfire /var/lib/chainfire.bak.$(date +%s) # Extract backup RESTORE_FROM="/var/backups/chainfire/20251210-020000.tar.gz" sudo mkdir -p /var/lib/chainfire sudo tar -xzf "$RESTORE_FROM" -C /var/lib/chainfire --strip-components=1 # Fix permissions sudo chown -R chainfire:chainfire /var/lib/chainfire sudo chmod -R 750 /var/lib/chainfire # Start service sudo systemctl start chainfire # Verify restore chainfire-client --endpoint http://localhost:2379 status # Check raft_index matches expected value from backup time ``` ### Point-in-Time Recovery (PITR) **Note:** RocksDB does not natively support PITR. Use Raft log replay or backup-at-interval strategy. ```bash # List available backups ls -lht /var/backups/chainfire/ # Choose backup closest to desired recovery point RESTORE_FROM="/var/backups/chainfire/20251210-140000.tar.gz" # Follow Full Restore steps above ``` ## FlareDB Backup ### Hot Backup (RocksDB Checkpoint) ```bash # Create checkpoint backup BACKUP_DIR="/var/backups/flaredb/$(date +%Y%m%d-%H%M%S)" sudo mkdir -p "$BACKUP_DIR" # Trigger checkpoint rocksdb_checkpoint --db=/var/lib/flaredb \ --checkpoint_dir="$BACKUP_DIR" # Compress tar -czf "$BACKUP_DIR.tar.gz" -C /var/backups/flaredb "$(basename $BACKUP_DIR)" rm -rf "$BACKUP_DIR" echo "FlareDB backup: $BACKUP_DIR.tar.gz" ``` ### Namespace-Specific Backup FlareDB stores data in RocksDB column families per namespace: ```bash # Backup specific namespace (requires RocksDB CLI tools) rocksdb_backup --db=/var/lib/flaredb \ --backup_dir=/var/backups/flaredb/namespace-metrics-$(date +%Y%m%d) \ --column_family=metrics # List column families rocksdb_ldb --db=/var/lib/flaredb list_column_families ``` ## FlareDB Restore ### Full Restore ```bash # Stop FlareDB service sudo systemctl stop flaredb # Backup current data sudo mv /var/lib/flaredb /var/lib/flaredb.bak.$(date +%s) # Extract backup RESTORE_FROM="/var/backups/flaredb/20251210-020000.tar.gz" sudo mkdir -p /var/lib/flaredb sudo tar -xzf "$RESTORE_FROM" -C /var/lib/flaredb --strip-components=1 # Fix permissions sudo chown -R flaredb:flaredb /var/lib/flaredb # Start service sudo systemctl start flaredb # Verify flaredb-client --endpoint http://localhost:2379 cluster-status ``` ## Multi-Node Cluster Considerations ### Backup Strategy for Raft Clusters **Important:** For Chainfire/FlareDB Raft clusters, backup from the **leader node** for most consistent snapshot. ```bash # Identify leader LEADER=$(chainfire-client --endpoint http://NODE1_IP:2379 status | grep leader | awk '{print $2}') # Backup from leader node ssh "node-$LEADER" "/usr/local/bin/backup-chainfire.sh" ``` ### Restore to Multi-Node Cluster **Option A: Restore Single Node (Raft will replicate)** 1. Restore backup to one node (e.g., leader) 2. Other nodes will catch up via Raft replication 3. Monitor replication lag: `raft_index` should converge **Option B: Restore All Nodes (Disaster Recovery)** ```bash # Stop all nodes for node in node1 node2 node3; do ssh $node "sudo systemctl stop chainfire" done # Restore same backup to all nodes BACKUP="/var/backups/chainfire/20251210-020000.tar.gz" for node in node1 node2 node3; do scp "$BACKUP" "$node:/tmp/restore.tar.gz" ssh $node "sudo tar -xzf /tmp/restore.tar.gz -C /var/lib/chainfire --strip-components=1" ssh $node "sudo chown -R chainfire:chainfire /var/lib/chainfire" done # Start leader first ssh node1 "sudo systemctl start chainfire" sleep 10 # Start followers for node in node2 node3; do ssh $node "sudo systemctl start chainfire" done # Verify cluster chainfire-client --endpoint http://node1:2379 member-list ``` ## Verification Steps ### Post-Backup Verification ```bash # Check backup file integrity tar -tzf /var/backups/chainfire/BACKUP.tar.gz | head -20 # Verify backup size (should match data dir size approximately) du -sh /var/lib/chainfire du -sh /var/backups/chainfire/BACKUP.tar.gz # Test restore in isolated environment (optional) # Use separate VM/container to restore and verify data integrity ``` ### Post-Restore Verification ```bash # Check service health sudo systemctl status chainfire sudo systemctl status flaredb # Verify data integrity chainfire-client --endpoint http://localhost:2379 status # Check: raft_index, raft_term, leader # Test read operations chainfire-client --endpoint http://localhost:2379 get test-key # Check logs for errors journalctl -u chainfire -n 100 --no-pager ``` ## Troubleshooting ### Issue: Backup fails with "No space left on device" **Resolution:** ```bash # Check available space df -h /var/backups # Clean old backups find /var/backups/chainfire -name "*.tar.gz" -mtime +7 -delete # Or move backups to external storage rsync -av --remove-source-files /var/backups/chainfire/ backup-server:/backups/chainfire/ ``` ### Issue: Restore fails with permission denied **Resolution:** ```bash # Fix ownership sudo chown -R chainfire:chainfire /var/lib/chainfire # Fix SELinux context (if applicable) sudo restorecon -R /var/lib/chainfire ``` ### Issue: After restore, cluster has split-brain **Symptoms:** - Multiple nodes claim to be leader - `member-list` shows inconsistent state **Resolution:** ```bash # Stop all nodes for node in node1 node2 node3; do ssh $node "sudo systemctl stop chainfire"; done # Wipe data on followers (keep leader data) for node in node2 node3; do ssh $node "sudo rm -rf /var/lib/chainfire/*" done # Restart leader (bootstraps cluster) ssh node1 "sudo systemctl start chainfire" sleep 10 # Re-add followers via member-add chainfire-client --endpoint http://node1:2379 member-add --node-id 2 --peer-url node2:2380 chainfire-client --endpoint http://node1:2379 member-add --node-id 3 --peer-url node3:2380 # Start followers for node in node2 node3; do ssh $node "sudo systemctl start chainfire"; done ``` ## References - RocksDB Backup: https://github.com/facebook/rocksdb/wiki/Checkpoints - Configuration: `specifications/configuration.md` - Storage Implementation: `chainfire/crates/chainfire-storage/`