photoncloud-monorepo/docs/ops/backup-restore.md
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

8.4 KiB

Backup & Restore Runbook

Overview

This runbook covers backup and restore procedures for Chainfire (distributed KV) and FlareDB (time-series DB) persistent data stored in RocksDB.

Prerequisites

Backup Requirements

  • Sufficient disk space for snapshot (check data dir size + 20% margin)
  • Write access to backup destination directory
  • Node is healthy and reachable

Restore Requirements

  • Backup snapshot file available
  • Target node stopped (for full restore)
  • Data directory permissions correct (chown as service user)

Chainfire Backup

Advantages: No downtime, consistent snapshot

# Create checkpoint backup while Chainfire is running
BACKUP_DIR="/var/backups/chainfire/$(date +%Y%m%d-%H%M%S)"
sudo mkdir -p "$BACKUP_DIR"

# Trigger checkpoint via admin API (if exposed)
curl -X POST http://CHAINFIRE_IP:2379/admin/checkpoint \
  -d "{\"path\": \"$BACKUP_DIR\"}"

# OR use RocksDB checkpoint CLI
rocksdb_checkpoint --db=/var/lib/chainfire \
  --checkpoint_dir="$BACKUP_DIR"

# Verify checkpoint
ls -lh "$BACKUP_DIR"
# Should contain: CURRENT, MANIFEST-*, *.sst, *.log files

Method 2: Cold Backup (File Copy)

Advantages: Simple, no special tools Disadvantages: Requires service stop

# Stop Chainfire service
sudo systemctl stop chainfire

# Create backup
BACKUP_DIR="/var/backups/chainfire/$(date +%Y%m%d-%H%M%S)"
sudo mkdir -p "$BACKUP_DIR"
sudo rsync -av /var/lib/chainfire/ "$BACKUP_DIR/"

# Restart service
sudo systemctl start chainfire

# Verify backup
du -sh "$BACKUP_DIR"

Automated Backup Script

Create /usr/local/bin/backup-chainfire.sh:

#!/bin/bash
set -euo pipefail

DATA_DIR="/var/lib/chainfire"
BACKUP_ROOT="/var/backups/chainfire"
RETENTION_DAYS=7

# Create backup
BACKUP_DIR="$BACKUP_ROOT/$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BACKUP_DIR"

# Use checkpoint (hot backup)
rocksdb_checkpoint --db="$DATA_DIR" --checkpoint_dir="$BACKUP_DIR"

# Compress backup
tar -czf "$BACKUP_DIR.tar.gz" -C "$BACKUP_ROOT" "$(basename $BACKUP_DIR)"
rm -rf "$BACKUP_DIR"

# Clean old backups
find "$BACKUP_ROOT" -name "*.tar.gz" -mtime +$RETENTION_DAYS -delete

echo "Backup complete: $BACKUP_DIR.tar.gz"

Schedule with cron:

# Add to crontab
0 2 * * * /usr/local/bin/backup-chainfire.sh >> /var/log/chainfire-backup.log 2>&1

Chainfire Restore

Full Restore from Backup

# Stop Chainfire service
sudo systemctl stop chainfire

# Backup current data (safety)
sudo mv /var/lib/chainfire /var/lib/chainfire.bak.$(date +%s)

# Extract backup
RESTORE_FROM="/var/backups/chainfire/20251210-020000.tar.gz"
sudo mkdir -p /var/lib/chainfire
sudo tar -xzf "$RESTORE_FROM" -C /var/lib/chainfire --strip-components=1

# Fix permissions
sudo chown -R chainfire:chainfire /var/lib/chainfire
sudo chmod -R 750 /var/lib/chainfire

# Start service
sudo systemctl start chainfire

# Verify restore
chainfire-client --endpoint http://localhost:2379 status
# Check raft_index matches expected value from backup time

Point-in-Time Recovery (PITR)

Note: RocksDB does not natively support PITR. Use Raft log replay or backup-at-interval strategy.

# List available backups
ls -lht /var/backups/chainfire/

# Choose backup closest to desired recovery point
RESTORE_FROM="/var/backups/chainfire/20251210-140000.tar.gz"

# Follow Full Restore steps above

FlareDB Backup

Hot Backup (RocksDB Checkpoint)

# Create checkpoint backup
BACKUP_DIR="/var/backups/flaredb/$(date +%Y%m%d-%H%M%S)"
sudo mkdir -p "$BACKUP_DIR"

# Trigger checkpoint
rocksdb_checkpoint --db=/var/lib/flaredb \
  --checkpoint_dir="$BACKUP_DIR"

# Compress
tar -czf "$BACKUP_DIR.tar.gz" -C /var/backups/flaredb "$(basename $BACKUP_DIR)"
rm -rf "$BACKUP_DIR"

echo "FlareDB backup: $BACKUP_DIR.tar.gz"

Namespace-Specific Backup

FlareDB stores data in RocksDB column families per namespace:

# Backup specific namespace (requires RocksDB CLI tools)
rocksdb_backup --db=/var/lib/flaredb \
  --backup_dir=/var/backups/flaredb/namespace-metrics-$(date +%Y%m%d) \
  --column_family=metrics

# List column families
rocksdb_ldb --db=/var/lib/flaredb list_column_families

FlareDB Restore

Full Restore

# Stop FlareDB service
sudo systemctl stop flaredb

# Backup current data
sudo mv /var/lib/flaredb /var/lib/flaredb.bak.$(date +%s)

# Extract backup
RESTORE_FROM="/var/backups/flaredb/20251210-020000.tar.gz"
sudo mkdir -p /var/lib/flaredb
sudo tar -xzf "$RESTORE_FROM" -C /var/lib/flaredb --strip-components=1

# Fix permissions
sudo chown -R flaredb:flaredb /var/lib/flaredb

# Start service
sudo systemctl start flaredb

# Verify
flaredb-client --endpoint http://localhost:2379 cluster-status

Multi-Node Cluster Considerations

Backup Strategy for Raft Clusters

Important: For Chainfire/FlareDB Raft clusters, backup from the leader node for most consistent snapshot.

# Identify leader
LEADER=$(chainfire-client --endpoint http://NODE1_IP:2379 status | grep leader | awk '{print $2}')

# Backup from leader node
ssh "node-$LEADER" "/usr/local/bin/backup-chainfire.sh"

Restore to Multi-Node Cluster

Option A: Restore Single Node (Raft will replicate)

  1. Restore backup to one node (e.g., leader)
  2. Other nodes will catch up via Raft replication
  3. Monitor replication lag: raft_index should converge

Option B: Restore All Nodes (Disaster Recovery)

# Stop all nodes
for node in node1 node2 node3; do
  ssh $node "sudo systemctl stop chainfire"
done

# Restore same backup to all nodes
BACKUP="/var/backups/chainfire/20251210-020000.tar.gz"
for node in node1 node2 node3; do
  scp "$BACKUP" "$node:/tmp/restore.tar.gz"
  ssh $node "sudo tar -xzf /tmp/restore.tar.gz -C /var/lib/chainfire --strip-components=1"
  ssh $node "sudo chown -R chainfire:chainfire /var/lib/chainfire"
done

# Start leader first
ssh node1 "sudo systemctl start chainfire"
sleep 10

# Start followers
for node in node2 node3; do
  ssh $node "sudo systemctl start chainfire"
done

# Verify cluster
chainfire-client --endpoint http://node1:2379 member-list

Verification Steps

Post-Backup Verification

# Check backup file integrity
tar -tzf /var/backups/chainfire/BACKUP.tar.gz | head -20

# Verify backup size (should match data dir size approximately)
du -sh /var/lib/chainfire
du -sh /var/backups/chainfire/BACKUP.tar.gz

# Test restore in isolated environment (optional)
# Use separate VM/container to restore and verify data integrity

Post-Restore Verification

# Check service health
sudo systemctl status chainfire
sudo systemctl status flaredb

# Verify data integrity
chainfire-client --endpoint http://localhost:2379 status
# Check: raft_index, raft_term, leader

# Test read operations
chainfire-client --endpoint http://localhost:2379 get test-key

# Check logs for errors
journalctl -u chainfire -n 100 --no-pager

Troubleshooting

Issue: Backup fails with "No space left on device"

Resolution:

# Check available space
df -h /var/backups

# Clean old backups
find /var/backups/chainfire -name "*.tar.gz" -mtime +7 -delete

# Or move backups to external storage
rsync -av --remove-source-files /var/backups/chainfire/ backup-server:/backups/chainfire/

Issue: Restore fails with permission denied

Resolution:

# Fix ownership
sudo chown -R chainfire:chainfire /var/lib/chainfire

# Fix SELinux context (if applicable)
sudo restorecon -R /var/lib/chainfire

Issue: After restore, cluster has split-brain

Symptoms:

  • Multiple nodes claim to be leader
  • member-list shows inconsistent state

Resolution:

# Stop all nodes
for node in node1 node2 node3; do ssh $node "sudo systemctl stop chainfire"; done

# Wipe data on followers (keep leader data)
for node in node2 node3; do
  ssh $node "sudo rm -rf /var/lib/chainfire/*"
done

# Restart leader (bootstraps cluster)
ssh node1 "sudo systemctl start chainfire"
sleep 10

# Re-add followers via member-add
chainfire-client --endpoint http://node1:2379 member-add --node-id 2 --peer-url node2:2380
chainfire-client --endpoint http://node1:2379 member-add --node-id 3 --peer-url node3:2380

# Start followers
for node in node2 node3; do ssh $node "sudo systemctl start chainfire"; done

References