photoncloud-monorepo/docs/ops/backup-restore.md
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

345 lines
8.4 KiB
Markdown

# Backup & Restore Runbook
## Overview
This runbook covers backup and restore procedures for Chainfire (distributed KV) and FlareDB (time-series DB) persistent data stored in RocksDB.
## Prerequisites
### Backup Requirements
- ✅ Sufficient disk space for snapshot (check data dir size + 20% margin)
- ✅ Write access to backup destination directory
- ✅ Node is healthy and reachable
### Restore Requirements
- ✅ Backup snapshot file available
- ✅ Target node stopped (for full restore)
- ✅ Data directory permissions correct (`chown` as service user)
## Chainfire Backup
### Method 1: Hot Backup (RocksDB Checkpoint - Recommended)
**Advantages:** No downtime, consistent snapshot
```bash
# Create checkpoint backup while Chainfire is running
BACKUP_DIR="/var/backups/chainfire/$(date +%Y%m%d-%H%M%S)"
sudo mkdir -p "$BACKUP_DIR"
# Trigger checkpoint via admin API (if exposed)
curl -X POST http://CHAINFIRE_IP:2379/admin/checkpoint \
-d "{\"path\": \"$BACKUP_DIR\"}"
# OR use RocksDB checkpoint CLI
rocksdb_checkpoint --db=/var/lib/chainfire \
--checkpoint_dir="$BACKUP_DIR"
# Verify checkpoint
ls -lh "$BACKUP_DIR"
# Should contain: CURRENT, MANIFEST-*, *.sst, *.log files
```
### Method 2: Cold Backup (File Copy)
**Advantages:** Simple, no special tools
**Disadvantages:** Requires service stop
```bash
# Stop Chainfire service
sudo systemctl stop chainfire
# Create backup
BACKUP_DIR="/var/backups/chainfire/$(date +%Y%m%d-%H%M%S)"
sudo mkdir -p "$BACKUP_DIR"
sudo rsync -av /var/lib/chainfire/ "$BACKUP_DIR/"
# Restart service
sudo systemctl start chainfire
# Verify backup
du -sh "$BACKUP_DIR"
```
### Automated Backup Script
Create `/usr/local/bin/backup-chainfire.sh`:
```bash
#!/bin/bash
set -euo pipefail
DATA_DIR="/var/lib/chainfire"
BACKUP_ROOT="/var/backups/chainfire"
RETENTION_DAYS=7
# Create backup
BACKUP_DIR="$BACKUP_ROOT/$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BACKUP_DIR"
# Use checkpoint (hot backup)
rocksdb_checkpoint --db="$DATA_DIR" --checkpoint_dir="$BACKUP_DIR"
# Compress backup
tar -czf "$BACKUP_DIR.tar.gz" -C "$BACKUP_ROOT" "$(basename $BACKUP_DIR)"
rm -rf "$BACKUP_DIR"
# Clean old backups
find "$BACKUP_ROOT" -name "*.tar.gz" -mtime +$RETENTION_DAYS -delete
echo "Backup complete: $BACKUP_DIR.tar.gz"
```
**Schedule with cron:**
```bash
# Add to crontab
0 2 * * * /usr/local/bin/backup-chainfire.sh >> /var/log/chainfire-backup.log 2>&1
```
## Chainfire Restore
### Full Restore from Backup
```bash
# Stop Chainfire service
sudo systemctl stop chainfire
# Backup current data (safety)
sudo mv /var/lib/chainfire /var/lib/chainfire.bak.$(date +%s)
# Extract backup
RESTORE_FROM="/var/backups/chainfire/20251210-020000.tar.gz"
sudo mkdir -p /var/lib/chainfire
sudo tar -xzf "$RESTORE_FROM" -C /var/lib/chainfire --strip-components=1
# Fix permissions
sudo chown -R chainfire:chainfire /var/lib/chainfire
sudo chmod -R 750 /var/lib/chainfire
# Start service
sudo systemctl start chainfire
# Verify restore
chainfire-client --endpoint http://localhost:2379 status
# Check raft_index matches expected value from backup time
```
### Point-in-Time Recovery (PITR)
**Note:** RocksDB does not natively support PITR. Use Raft log replay or backup-at-interval strategy.
```bash
# List available backups
ls -lht /var/backups/chainfire/
# Choose backup closest to desired recovery point
RESTORE_FROM="/var/backups/chainfire/20251210-140000.tar.gz"
# Follow Full Restore steps above
```
## FlareDB Backup
### Hot Backup (RocksDB Checkpoint)
```bash
# Create checkpoint backup
BACKUP_DIR="/var/backups/flaredb/$(date +%Y%m%d-%H%M%S)"
sudo mkdir -p "$BACKUP_DIR"
# Trigger checkpoint
rocksdb_checkpoint --db=/var/lib/flaredb \
--checkpoint_dir="$BACKUP_DIR"
# Compress
tar -czf "$BACKUP_DIR.tar.gz" -C /var/backups/flaredb "$(basename $BACKUP_DIR)"
rm -rf "$BACKUP_DIR"
echo "FlareDB backup: $BACKUP_DIR.tar.gz"
```
### Namespace-Specific Backup
FlareDB stores data in RocksDB column families per namespace:
```bash
# Backup specific namespace (requires RocksDB CLI tools)
rocksdb_backup --db=/var/lib/flaredb \
--backup_dir=/var/backups/flaredb/namespace-metrics-$(date +%Y%m%d) \
--column_family=metrics
# List column families
rocksdb_ldb --db=/var/lib/flaredb list_column_families
```
## FlareDB Restore
### Full Restore
```bash
# Stop FlareDB service
sudo systemctl stop flaredb
# Backup current data
sudo mv /var/lib/flaredb /var/lib/flaredb.bak.$(date +%s)
# Extract backup
RESTORE_FROM="/var/backups/flaredb/20251210-020000.tar.gz"
sudo mkdir -p /var/lib/flaredb
sudo tar -xzf "$RESTORE_FROM" -C /var/lib/flaredb --strip-components=1
# Fix permissions
sudo chown -R flaredb:flaredb /var/lib/flaredb
# Start service
sudo systemctl start flaredb
# Verify
flaredb-client --endpoint http://localhost:2379 cluster-status
```
## Multi-Node Cluster Considerations
### Backup Strategy for Raft Clusters
**Important:** For Chainfire/FlareDB Raft clusters, backup from the **leader node** for most consistent snapshot.
```bash
# Identify leader
LEADER=$(chainfire-client --endpoint http://NODE1_IP:2379 status | grep leader | awk '{print $2}')
# Backup from leader node
ssh "node-$LEADER" "/usr/local/bin/backup-chainfire.sh"
```
### Restore to Multi-Node Cluster
**Option A: Restore Single Node (Raft will replicate)**
1. Restore backup to one node (e.g., leader)
2. Other nodes will catch up via Raft replication
3. Monitor replication lag: `raft_index` should converge
**Option B: Restore All Nodes (Disaster Recovery)**
```bash
# Stop all nodes
for node in node1 node2 node3; do
ssh $node "sudo systemctl stop chainfire"
done
# Restore same backup to all nodes
BACKUP="/var/backups/chainfire/20251210-020000.tar.gz"
for node in node1 node2 node3; do
scp "$BACKUP" "$node:/tmp/restore.tar.gz"
ssh $node "sudo tar -xzf /tmp/restore.tar.gz -C /var/lib/chainfire --strip-components=1"
ssh $node "sudo chown -R chainfire:chainfire /var/lib/chainfire"
done
# Start leader first
ssh node1 "sudo systemctl start chainfire"
sleep 10
# Start followers
for node in node2 node3; do
ssh $node "sudo systemctl start chainfire"
done
# Verify cluster
chainfire-client --endpoint http://node1:2379 member-list
```
## Verification Steps
### Post-Backup Verification
```bash
# Check backup file integrity
tar -tzf /var/backups/chainfire/BACKUP.tar.gz | head -20
# Verify backup size (should match data dir size approximately)
du -sh /var/lib/chainfire
du -sh /var/backups/chainfire/BACKUP.tar.gz
# Test restore in isolated environment (optional)
# Use separate VM/container to restore and verify data integrity
```
### Post-Restore Verification
```bash
# Check service health
sudo systemctl status chainfire
sudo systemctl status flaredb
# Verify data integrity
chainfire-client --endpoint http://localhost:2379 status
# Check: raft_index, raft_term, leader
# Test read operations
chainfire-client --endpoint http://localhost:2379 get test-key
# Check logs for errors
journalctl -u chainfire -n 100 --no-pager
```
## Troubleshooting
### Issue: Backup fails with "No space left on device"
**Resolution:**
```bash
# Check available space
df -h /var/backups
# Clean old backups
find /var/backups/chainfire -name "*.tar.gz" -mtime +7 -delete
# Or move backups to external storage
rsync -av --remove-source-files /var/backups/chainfire/ backup-server:/backups/chainfire/
```
### Issue: Restore fails with permission denied
**Resolution:**
```bash
# Fix ownership
sudo chown -R chainfire:chainfire /var/lib/chainfire
# Fix SELinux context (if applicable)
sudo restorecon -R /var/lib/chainfire
```
### Issue: After restore, cluster has split-brain
**Symptoms:**
- Multiple nodes claim to be leader
- `member-list` shows inconsistent state
**Resolution:**
```bash
# Stop all nodes
for node in node1 node2 node3; do ssh $node "sudo systemctl stop chainfire"; done
# Wipe data on followers (keep leader data)
for node in node2 node3; do
ssh $node "sudo rm -rf /var/lib/chainfire/*"
done
# Restart leader (bootstraps cluster)
ssh node1 "sudo systemctl start chainfire"
sleep 10
# Re-add followers via member-add
chainfire-client --endpoint http://node1:2379 member-add --node-id 2 --peer-url node2:2380
chainfire-client --endpoint http://node1:2379 member-add --node-id 3 --peer-url node3:2380
# Start followers
for node in node2 node3; do ssh $node "sudo systemctl start chainfire"; done
```
## References
- RocksDB Backup: https://github.com/facebook/rocksdb/wiki/Checkpoints
- Configuration: `specifications/configuration.md`
- Storage Implementation: `chainfire/crates/chainfire-storage/`