- netboot-base.nix with SSH key auth - Launch scripts for node01/02/03 - Node configuration.nix and disko.nix - Nix modules for first-boot automation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
345 lines
8.4 KiB
Markdown
345 lines
8.4 KiB
Markdown
# Backup & Restore Runbook
|
|
|
|
## Overview
|
|
|
|
This runbook covers backup and restore procedures for Chainfire (distributed KV) and FlareDB (time-series DB) persistent data stored in RocksDB.
|
|
|
|
## Prerequisites
|
|
|
|
### Backup Requirements
|
|
- ✅ Sufficient disk space for snapshot (check data dir size + 20% margin)
|
|
- ✅ Write access to backup destination directory
|
|
- ✅ Node is healthy and reachable
|
|
|
|
### Restore Requirements
|
|
- ✅ Backup snapshot file available
|
|
- ✅ Target node stopped (for full restore)
|
|
- ✅ Data directory permissions correct (`chown` as service user)
|
|
|
|
## Chainfire Backup
|
|
|
|
### Method 1: Hot Backup (RocksDB Checkpoint - Recommended)
|
|
|
|
**Advantages:** No downtime, consistent snapshot
|
|
|
|
```bash
|
|
# Create checkpoint backup while Chainfire is running
|
|
BACKUP_DIR="/var/backups/chainfire/$(date +%Y%m%d-%H%M%S)"
|
|
sudo mkdir -p "$BACKUP_DIR"
|
|
|
|
# Trigger checkpoint via admin API (if exposed)
|
|
curl -X POST http://CHAINFIRE_IP:2379/admin/checkpoint \
|
|
-d "{\"path\": \"$BACKUP_DIR\"}"
|
|
|
|
# OR use RocksDB checkpoint CLI
|
|
rocksdb_checkpoint --db=/var/lib/chainfire \
|
|
--checkpoint_dir="$BACKUP_DIR"
|
|
|
|
# Verify checkpoint
|
|
ls -lh "$BACKUP_DIR"
|
|
# Should contain: CURRENT, MANIFEST-*, *.sst, *.log files
|
|
```
|
|
|
|
### Method 2: Cold Backup (File Copy)
|
|
|
|
**Advantages:** Simple, no special tools
|
|
**Disadvantages:** Requires service stop
|
|
|
|
```bash
|
|
# Stop Chainfire service
|
|
sudo systemctl stop chainfire
|
|
|
|
# Create backup
|
|
BACKUP_DIR="/var/backups/chainfire/$(date +%Y%m%d-%H%M%S)"
|
|
sudo mkdir -p "$BACKUP_DIR"
|
|
sudo rsync -av /var/lib/chainfire/ "$BACKUP_DIR/"
|
|
|
|
# Restart service
|
|
sudo systemctl start chainfire
|
|
|
|
# Verify backup
|
|
du -sh "$BACKUP_DIR"
|
|
```
|
|
|
|
### Automated Backup Script
|
|
|
|
Create `/usr/local/bin/backup-chainfire.sh`:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
set -euo pipefail
|
|
|
|
DATA_DIR="/var/lib/chainfire"
|
|
BACKUP_ROOT="/var/backups/chainfire"
|
|
RETENTION_DAYS=7
|
|
|
|
# Create backup
|
|
BACKUP_DIR="$BACKUP_ROOT/$(date +%Y%m%d-%H%M%S)"
|
|
mkdir -p "$BACKUP_DIR"
|
|
|
|
# Use checkpoint (hot backup)
|
|
rocksdb_checkpoint --db="$DATA_DIR" --checkpoint_dir="$BACKUP_DIR"
|
|
|
|
# Compress backup
|
|
tar -czf "$BACKUP_DIR.tar.gz" -C "$BACKUP_ROOT" "$(basename $BACKUP_DIR)"
|
|
rm -rf "$BACKUP_DIR"
|
|
|
|
# Clean old backups
|
|
find "$BACKUP_ROOT" -name "*.tar.gz" -mtime +$RETENTION_DAYS -delete
|
|
|
|
echo "Backup complete: $BACKUP_DIR.tar.gz"
|
|
```
|
|
|
|
**Schedule with cron:**
|
|
```bash
|
|
# Add to crontab
|
|
0 2 * * * /usr/local/bin/backup-chainfire.sh >> /var/log/chainfire-backup.log 2>&1
|
|
```
|
|
|
|
## Chainfire Restore
|
|
|
|
### Full Restore from Backup
|
|
|
|
```bash
|
|
# Stop Chainfire service
|
|
sudo systemctl stop chainfire
|
|
|
|
# Backup current data (safety)
|
|
sudo mv /var/lib/chainfire /var/lib/chainfire.bak.$(date +%s)
|
|
|
|
# Extract backup
|
|
RESTORE_FROM="/var/backups/chainfire/20251210-020000.tar.gz"
|
|
sudo mkdir -p /var/lib/chainfire
|
|
sudo tar -xzf "$RESTORE_FROM" -C /var/lib/chainfire --strip-components=1
|
|
|
|
# Fix permissions
|
|
sudo chown -R chainfire:chainfire /var/lib/chainfire
|
|
sudo chmod -R 750 /var/lib/chainfire
|
|
|
|
# Start service
|
|
sudo systemctl start chainfire
|
|
|
|
# Verify restore
|
|
chainfire-client --endpoint http://localhost:2379 status
|
|
# Check raft_index matches expected value from backup time
|
|
```
|
|
|
|
### Point-in-Time Recovery (PITR)
|
|
|
|
**Note:** RocksDB does not natively support PITR. Use Raft log replay or backup-at-interval strategy.
|
|
|
|
```bash
|
|
# List available backups
|
|
ls -lht /var/backups/chainfire/
|
|
|
|
# Choose backup closest to desired recovery point
|
|
RESTORE_FROM="/var/backups/chainfire/20251210-140000.tar.gz"
|
|
|
|
# Follow Full Restore steps above
|
|
```
|
|
|
|
## FlareDB Backup
|
|
|
|
### Hot Backup (RocksDB Checkpoint)
|
|
|
|
```bash
|
|
# Create checkpoint backup
|
|
BACKUP_DIR="/var/backups/flaredb/$(date +%Y%m%d-%H%M%S)"
|
|
sudo mkdir -p "$BACKUP_DIR"
|
|
|
|
# Trigger checkpoint
|
|
rocksdb_checkpoint --db=/var/lib/flaredb \
|
|
--checkpoint_dir="$BACKUP_DIR"
|
|
|
|
# Compress
|
|
tar -czf "$BACKUP_DIR.tar.gz" -C /var/backups/flaredb "$(basename $BACKUP_DIR)"
|
|
rm -rf "$BACKUP_DIR"
|
|
|
|
echo "FlareDB backup: $BACKUP_DIR.tar.gz"
|
|
```
|
|
|
|
### Namespace-Specific Backup
|
|
|
|
FlareDB stores data in RocksDB column families per namespace:
|
|
|
|
```bash
|
|
# Backup specific namespace (requires RocksDB CLI tools)
|
|
rocksdb_backup --db=/var/lib/flaredb \
|
|
--backup_dir=/var/backups/flaredb/namespace-metrics-$(date +%Y%m%d) \
|
|
--column_family=metrics
|
|
|
|
# List column families
|
|
rocksdb_ldb --db=/var/lib/flaredb list_column_families
|
|
```
|
|
|
|
## FlareDB Restore
|
|
|
|
### Full Restore
|
|
|
|
```bash
|
|
# Stop FlareDB service
|
|
sudo systemctl stop flaredb
|
|
|
|
# Backup current data
|
|
sudo mv /var/lib/flaredb /var/lib/flaredb.bak.$(date +%s)
|
|
|
|
# Extract backup
|
|
RESTORE_FROM="/var/backups/flaredb/20251210-020000.tar.gz"
|
|
sudo mkdir -p /var/lib/flaredb
|
|
sudo tar -xzf "$RESTORE_FROM" -C /var/lib/flaredb --strip-components=1
|
|
|
|
# Fix permissions
|
|
sudo chown -R flaredb:flaredb /var/lib/flaredb
|
|
|
|
# Start service
|
|
sudo systemctl start flaredb
|
|
|
|
# Verify
|
|
flaredb-client --endpoint http://localhost:2379 cluster-status
|
|
```
|
|
|
|
## Multi-Node Cluster Considerations
|
|
|
|
### Backup Strategy for Raft Clusters
|
|
|
|
**Important:** For Chainfire/FlareDB Raft clusters, backup from the **leader node** for most consistent snapshot.
|
|
|
|
```bash
|
|
# Identify leader
|
|
LEADER=$(chainfire-client --endpoint http://NODE1_IP:2379 status | grep leader | awk '{print $2}')
|
|
|
|
# Backup from leader node
|
|
ssh "node-$LEADER" "/usr/local/bin/backup-chainfire.sh"
|
|
```
|
|
|
|
### Restore to Multi-Node Cluster
|
|
|
|
**Option A: Restore Single Node (Raft will replicate)**
|
|
|
|
1. Restore backup to one node (e.g., leader)
|
|
2. Other nodes will catch up via Raft replication
|
|
3. Monitor replication lag: `raft_index` should converge
|
|
|
|
**Option B: Restore All Nodes (Disaster Recovery)**
|
|
|
|
```bash
|
|
# Stop all nodes
|
|
for node in node1 node2 node3; do
|
|
ssh $node "sudo systemctl stop chainfire"
|
|
done
|
|
|
|
# Restore same backup to all nodes
|
|
BACKUP="/var/backups/chainfire/20251210-020000.tar.gz"
|
|
for node in node1 node2 node3; do
|
|
scp "$BACKUP" "$node:/tmp/restore.tar.gz"
|
|
ssh $node "sudo tar -xzf /tmp/restore.tar.gz -C /var/lib/chainfire --strip-components=1"
|
|
ssh $node "sudo chown -R chainfire:chainfire /var/lib/chainfire"
|
|
done
|
|
|
|
# Start leader first
|
|
ssh node1 "sudo systemctl start chainfire"
|
|
sleep 10
|
|
|
|
# Start followers
|
|
for node in node2 node3; do
|
|
ssh $node "sudo systemctl start chainfire"
|
|
done
|
|
|
|
# Verify cluster
|
|
chainfire-client --endpoint http://node1:2379 member-list
|
|
```
|
|
|
|
## Verification Steps
|
|
|
|
### Post-Backup Verification
|
|
|
|
```bash
|
|
# Check backup file integrity
|
|
tar -tzf /var/backups/chainfire/BACKUP.tar.gz | head -20
|
|
|
|
# Verify backup size (should match data dir size approximately)
|
|
du -sh /var/lib/chainfire
|
|
du -sh /var/backups/chainfire/BACKUP.tar.gz
|
|
|
|
# Test restore in isolated environment (optional)
|
|
# Use separate VM/container to restore and verify data integrity
|
|
```
|
|
|
|
### Post-Restore Verification
|
|
|
|
```bash
|
|
# Check service health
|
|
sudo systemctl status chainfire
|
|
sudo systemctl status flaredb
|
|
|
|
# Verify data integrity
|
|
chainfire-client --endpoint http://localhost:2379 status
|
|
# Check: raft_index, raft_term, leader
|
|
|
|
# Test read operations
|
|
chainfire-client --endpoint http://localhost:2379 get test-key
|
|
|
|
# Check logs for errors
|
|
journalctl -u chainfire -n 100 --no-pager
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Issue: Backup fails with "No space left on device"
|
|
|
|
**Resolution:**
|
|
```bash
|
|
# Check available space
|
|
df -h /var/backups
|
|
|
|
# Clean old backups
|
|
find /var/backups/chainfire -name "*.tar.gz" -mtime +7 -delete
|
|
|
|
# Or move backups to external storage
|
|
rsync -av --remove-source-files /var/backups/chainfire/ backup-server:/backups/chainfire/
|
|
```
|
|
|
|
### Issue: Restore fails with permission denied
|
|
|
|
**Resolution:**
|
|
```bash
|
|
# Fix ownership
|
|
sudo chown -R chainfire:chainfire /var/lib/chainfire
|
|
|
|
# Fix SELinux context (if applicable)
|
|
sudo restorecon -R /var/lib/chainfire
|
|
```
|
|
|
|
### Issue: After restore, cluster has split-brain
|
|
|
|
**Symptoms:**
|
|
- Multiple nodes claim to be leader
|
|
- `member-list` shows inconsistent state
|
|
|
|
**Resolution:**
|
|
```bash
|
|
# Stop all nodes
|
|
for node in node1 node2 node3; do ssh $node "sudo systemctl stop chainfire"; done
|
|
|
|
# Wipe data on followers (keep leader data)
|
|
for node in node2 node3; do
|
|
ssh $node "sudo rm -rf /var/lib/chainfire/*"
|
|
done
|
|
|
|
# Restart leader (bootstraps cluster)
|
|
ssh node1 "sudo systemctl start chainfire"
|
|
sleep 10
|
|
|
|
# Re-add followers via member-add
|
|
chainfire-client --endpoint http://node1:2379 member-add --node-id 2 --peer-url node2:2380
|
|
chainfire-client --endpoint http://node1:2379 member-add --node-id 3 --peer-url node3:2380
|
|
|
|
# Start followers
|
|
for node in node2 node3; do ssh $node "sudo systemctl start chainfire"; done
|
|
```
|
|
|
|
## References
|
|
|
|
- RocksDB Backup: https://github.com/facebook/rocksdb/wiki/Checkpoints
|
|
- Configuration: `specifications/configuration.md`
|
|
- Storage Implementation: `chainfire/crates/chainfire-storage/`
|