- netboot-base.nix with SSH key auth - Launch scripts for node01/02/03 - Node configuration.nix and disko.nix - Nix modules for first-boot automation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
532 lines
14 KiB
Markdown
532 lines
14 KiB
Markdown
# Rolling Upgrade Runbook
|
|
|
|
## Overview
|
|
|
|
This runbook covers rolling upgrade procedures for Chainfire and FlareDB clusters to minimize downtime and maintain data availability during version upgrades.
|
|
|
|
## Prerequisites
|
|
|
|
### Pre-Upgrade Checklist
|
|
- ✅ New version tested in staging environment
|
|
- ✅ Backup of all nodes completed (see `backup-restore.md`)
|
|
- ✅ Release notes reviewed for breaking changes
|
|
- ✅ Rollback plan prepared
|
|
- ✅ Maintenance window scheduled (if required)
|
|
|
|
### Compatibility Requirements
|
|
- ✅ New version is compatible with current version (check release notes)
|
|
- ✅ Proto changes are backward-compatible (if applicable)
|
|
- ✅ Database schema migrations documented
|
|
|
|
### Infrastructure
|
|
- ✅ New binary built and available on all nodes
|
|
- ✅ Sufficient disk space for new binaries and data
|
|
- ✅ Monitoring and alerting functional
|
|
|
|
## Chainfire Rolling Upgrade
|
|
|
|
### Pre-Upgrade Checks
|
|
|
|
```bash
|
|
# Check cluster health
|
|
chainfire-client --endpoint http://LEADER_IP:2379 status
|
|
|
|
# Verify all nodes are healthy
|
|
chainfire-client --endpoint http://LEADER_IP:2379 member-list
|
|
|
|
# Check current version
|
|
chainfire-server --version
|
|
|
|
# Verify no ongoing operations
|
|
chainfire-client --endpoint http://LEADER_IP:2379 status | grep raft_index
|
|
# Wait for index to stabilize (no rapid changes)
|
|
|
|
# Create backup
|
|
/usr/local/bin/backup-chainfire.sh
|
|
```
|
|
|
|
### Upgrade Sequence
|
|
|
|
**Important:** Upgrade followers first, then the leader last to minimize leadership changes.
|
|
|
|
#### Step 1: Identify Leader
|
|
|
|
```bash
|
|
# Get cluster status
|
|
chainfire-client --endpoint http://NODE1_IP:2379 status
|
|
|
|
# Note the leader node ID
|
|
LEADER_ID=$(chainfire-client --endpoint http://NODE1_IP:2379 status | grep 'leader:' | awk '{print $2}')
|
|
echo "Leader is node $LEADER_ID"
|
|
```
|
|
|
|
#### Step 2: Upgrade Follower Nodes
|
|
|
|
**For each follower node (non-leader):**
|
|
|
|
```bash
|
|
# SSH to follower node
|
|
ssh follower-node-2
|
|
|
|
# Download new binary
|
|
sudo wget -O /usr/local/bin/chainfire-server.new \
|
|
https://releases.centra.cloud/chainfire-server-v0.2.0
|
|
|
|
# Verify checksum
|
|
echo "EXPECTED_SHA256 /usr/local/bin/chainfire-server.new" | sha256sum -c
|
|
|
|
# Make executable
|
|
sudo chmod +x /usr/local/bin/chainfire-server.new
|
|
|
|
# Stop service
|
|
sudo systemctl stop chainfire
|
|
|
|
# Backup old binary
|
|
sudo cp /usr/local/bin/chainfire-server /usr/local/bin/chainfire-server.bak
|
|
|
|
# Replace binary
|
|
sudo mv /usr/local/bin/chainfire-server.new /usr/local/bin/chainfire-server
|
|
|
|
# Start service
|
|
sudo systemctl start chainfire
|
|
|
|
# Verify upgrade
|
|
chainfire-server --version
|
|
# Should show new version
|
|
|
|
# Check node rejoined cluster
|
|
chainfire-client --endpoint http://localhost:2379 status
|
|
# Verify: raft_index is catching up
|
|
|
|
# Wait for catchup
|
|
while true; do
|
|
LEADER_INDEX=$(chainfire-client --endpoint http://LEADER_IP:2379 status | grep raft_index | awk '{print $2}')
|
|
FOLLOWER_INDEX=$(chainfire-client --endpoint http://localhost:2379 status | grep raft_index | awk '{print $2}')
|
|
DIFF=$((LEADER_INDEX - FOLLOWER_INDEX))
|
|
|
|
if [ $DIFF -lt 10 ]; then
|
|
echo "Follower caught up (diff: $DIFF)"
|
|
break
|
|
fi
|
|
|
|
echo "Waiting for catchup... (diff: $DIFF)"
|
|
sleep 5
|
|
done
|
|
```
|
|
|
|
**Wait 5 minutes between follower upgrades** to ensure stability.
|
|
|
|
#### Step 3: Upgrade Leader Node
|
|
|
|
```bash
|
|
# SSH to leader node
|
|
ssh leader-node-1
|
|
|
|
# Download new binary
|
|
sudo wget -O /usr/local/bin/chainfire-server.new \
|
|
https://releases.centra.cloud/chainfire-server-v0.2.0
|
|
|
|
# Verify checksum
|
|
echo "EXPECTED_SHA256 /usr/local/bin/chainfire-server.new" | sha256sum -c
|
|
|
|
# Make executable
|
|
sudo chmod +x /usr/local/bin/chainfire-server.new
|
|
|
|
# Stop service (triggers leader election)
|
|
sudo systemctl stop chainfire
|
|
|
|
# Backup old binary
|
|
sudo cp /usr/local/bin/chainfire-server /usr/local/bin/chainfire-server.bak
|
|
|
|
# Replace binary
|
|
sudo mv /usr/local/bin/chainfire-server.new /usr/local/bin/chainfire-server
|
|
|
|
# Start service
|
|
sudo systemctl start chainfire
|
|
|
|
# Verify new leader elected
|
|
chainfire-client --endpoint http://FOLLOWER_IP:2379 status | grep leader
|
|
# Leader should be one of the upgraded followers
|
|
|
|
# Verify this node rejoined
|
|
chainfire-client --endpoint http://localhost:2379 status
|
|
```
|
|
|
|
### Post-Upgrade Verification
|
|
|
|
```bash
|
|
# Check all nodes are on new version
|
|
for node in node1 node2 node3; do
|
|
echo "=== $node ==="
|
|
ssh $node "chainfire-server --version"
|
|
done
|
|
|
|
# Verify cluster health
|
|
chainfire-client --endpoint http://ANY_NODE_IP:2379 member-list
|
|
# All nodes should show IsLearner=false, Status=healthy
|
|
|
|
# Test write operation
|
|
chainfire-client --endpoint http://ANY_NODE_IP:2379 \
|
|
put upgrade-test "upgraded-at-$(date +%s)"
|
|
|
|
# Test read operation
|
|
chainfire-client --endpoint http://ANY_NODE_IP:2379 \
|
|
get upgrade-test
|
|
|
|
# Check logs for errors
|
|
for node in node1 node2 node3; do
|
|
echo "=== $node logs ==="
|
|
ssh $node "journalctl -u chainfire -n 50 --no-pager | grep -i error"
|
|
done
|
|
```
|
|
|
|
## FlareDB Rolling Upgrade
|
|
|
|
### Pre-Upgrade Checks
|
|
|
|
```bash
|
|
# Check cluster status
|
|
flaredb-client --endpoint http://PD_IP:2379 cluster-status
|
|
|
|
# Verify all stores are online
|
|
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | {id, state}'
|
|
|
|
# Check current version
|
|
flaredb-server --version
|
|
|
|
# Create backup
|
|
BACKUP_DIR="/var/backups/flaredb/$(date +%Y%m%d-%H%M%S)"
|
|
rocksdb_checkpoint --db=/var/lib/flaredb --checkpoint_dir="$BACKUP_DIR"
|
|
```
|
|
|
|
### Upgrade Sequence
|
|
|
|
**FlareDB supports hot upgrades** due to PD-managed placement. Upgrade stores one at a time.
|
|
|
|
#### For Each FlareDB Store:
|
|
|
|
```bash
|
|
# SSH to store node
|
|
ssh flaredb-node-1
|
|
|
|
# Download new binary
|
|
sudo wget -O /usr/local/bin/flaredb-server.new \
|
|
https://releases.centra.cloud/flaredb-server-v0.2.0
|
|
|
|
# Verify checksum
|
|
echo "EXPECTED_SHA256 /usr/local/bin/flaredb-server.new" | sha256sum -c
|
|
|
|
# Make executable
|
|
sudo chmod +x /usr/local/bin/flaredb-server.new
|
|
|
|
# Stop service
|
|
sudo systemctl stop flaredb
|
|
|
|
# Backup old binary
|
|
sudo cp /usr/local/bin/flaredb-server /usr/local/bin/flaredb-server.bak
|
|
|
|
# Replace binary
|
|
sudo mv /usr/local/bin/flaredb-server.new /usr/local/bin/flaredb-server
|
|
|
|
# Start service
|
|
sudo systemctl start flaredb
|
|
|
|
# Verify store comes back online
|
|
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | select(.id==STORE_ID) | .state'
|
|
# Should show: "Up"
|
|
|
|
# Check version
|
|
flaredb-server --version
|
|
```
|
|
|
|
**Wait for rebalancing to complete** before upgrading next store:
|
|
|
|
```bash
|
|
# Check region health
|
|
curl http://PD_IP:2379/pd/api/v1/stats/region | jq '.count'
|
|
|
|
# Wait until no pending peers
|
|
while true; do
|
|
PENDING=$(curl -s http://PD_IP:2379/pd/api/v1/stats/region | jq '.pending_peers')
|
|
if [ "$PENDING" -eq 0 ]; then
|
|
echo "No pending peers, safe to continue"
|
|
break
|
|
fi
|
|
echo "Waiting for rebalancing... (pending: $PENDING)"
|
|
sleep 10
|
|
done
|
|
```
|
|
|
|
### Post-Upgrade Verification
|
|
|
|
```bash
|
|
# Check all stores are on new version
|
|
for node in flaredb-node-{1..3}; do
|
|
echo "=== $node ==="
|
|
ssh $node "flaredb-server --version"
|
|
done
|
|
|
|
# Verify cluster health
|
|
flaredb-client --endpoint http://PD_IP:2379 cluster-status
|
|
|
|
# Test write operation
|
|
flaredb-client --endpoint http://ANY_STORE_IP:2379 \
|
|
put upgrade-test "upgraded-at-$(date +%s)"
|
|
|
|
# Test read operation
|
|
flaredb-client --endpoint http://ANY_STORE_IP:2379 \
|
|
get upgrade-test
|
|
|
|
# Check logs for errors
|
|
for node in flaredb-node-{1..3}; do
|
|
echo "=== $node logs ==="
|
|
ssh $node "journalctl -u flaredb -n 50 --no-pager | grep -i error"
|
|
done
|
|
```
|
|
|
|
## Automated Upgrade Script
|
|
|
|
Create `/usr/local/bin/rolling-upgrade-chainfire.sh`:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
set -euo pipefail
|
|
|
|
NEW_VERSION="$1"
|
|
BINARY_URL="https://releases.centra.cloud/chainfire-server-${NEW_VERSION}"
|
|
EXPECTED_SHA256="$2"
|
|
|
|
NODES=("node1" "node2" "node3")
|
|
LEADER_IP="node1" # Will be detected dynamically
|
|
|
|
# Detect leader
|
|
echo "Detecting leader..."
|
|
LEADER_ID=$(chainfire-client --endpoint http://${LEADER_IP}:2379 status | grep 'leader:' | awk '{print $2}')
|
|
echo "Leader is node $LEADER_ID"
|
|
|
|
# Upgrade followers first
|
|
for node in "${NODES[@]}"; do
|
|
NODE_ID=$(ssh $node "grep 'id =' /etc/centra-cloud/chainfire.toml | head -1 | awk '{print \$3}'")
|
|
|
|
if [ "$NODE_ID" == "$LEADER_ID" ]; then
|
|
echo "Skipping $node (leader) for now"
|
|
LEADER_NODE=$node
|
|
continue
|
|
fi
|
|
|
|
echo "=== Upgrading $node (follower) ==="
|
|
|
|
# Download and verify
|
|
ssh $node "sudo wget -q -O /usr/local/bin/chainfire-server.new '$BINARY_URL'"
|
|
ssh $node "echo '$EXPECTED_SHA256 /usr/local/bin/chainfire-server.new' | sha256sum -c"
|
|
|
|
# Replace binary
|
|
ssh $node "sudo systemctl stop chainfire"
|
|
ssh $node "sudo cp /usr/local/bin/chainfire-server /usr/local/bin/chainfire-server.bak"
|
|
ssh $node "sudo mv /usr/local/bin/chainfire-server.new /usr/local/bin/chainfire-server"
|
|
ssh $node "sudo chmod +x /usr/local/bin/chainfire-server"
|
|
ssh $node "sudo systemctl start chainfire"
|
|
|
|
# Wait for catchup
|
|
echo "Waiting for $node to catch up..."
|
|
sleep 30
|
|
|
|
# Verify
|
|
NEW_VER=$(ssh $node "chainfire-server --version")
|
|
echo "$node upgraded to: $NEW_VER"
|
|
done
|
|
|
|
# Upgrade leader last
|
|
echo "=== Upgrading $LEADER_NODE (leader) ==="
|
|
ssh $LEADER_NODE "sudo wget -q -O /usr/local/bin/chainfire-server.new '$BINARY_URL'"
|
|
ssh $LEADER_NODE "echo '$EXPECTED_SHA256 /usr/local/bin/chainfire-server.new' | sha256sum -c"
|
|
ssh $LEADER_NODE "sudo systemctl stop chainfire"
|
|
ssh $LEADER_NODE "sudo cp /usr/local/bin/chainfire-server /usr/local/bin/chainfire-server.bak"
|
|
ssh $LEADER_NODE "sudo mv /usr/local/bin/chainfire-server.new /usr/local/bin/chainfire-server"
|
|
ssh $LEADER_NODE "sudo chmod +x /usr/local/bin/chainfire-server"
|
|
ssh $LEADER_NODE "sudo systemctl start chainfire"
|
|
|
|
echo "=== Upgrade complete ==="
|
|
echo "Verifying cluster health..."
|
|
|
|
sleep 10
|
|
chainfire-client --endpoint http://${NODES[0]}:2379 member-list
|
|
|
|
echo "All nodes upgraded successfully!"
|
|
```
|
|
|
|
**Usage:**
|
|
```bash
|
|
chmod +x /usr/local/bin/rolling-upgrade-chainfire.sh
|
|
/usr/local/bin/rolling-upgrade-chainfire.sh v0.2.0 <sha256-checksum>
|
|
```
|
|
|
|
## Rollback Procedure
|
|
|
|
If upgrade fails or causes issues, rollback to previous version:
|
|
|
|
### Rollback Single Node
|
|
|
|
```bash
|
|
# SSH to problematic node
|
|
ssh failing-node
|
|
|
|
# Stop service
|
|
sudo systemctl stop chainfire
|
|
|
|
# Restore old binary
|
|
sudo cp /usr/local/bin/chainfire-server.bak /usr/local/bin/chainfire-server
|
|
|
|
# Start service
|
|
sudo systemctl start chainfire
|
|
|
|
# Verify
|
|
chainfire-server --version
|
|
chainfire-client --endpoint http://localhost:2379 status
|
|
```
|
|
|
|
### Rollback Entire Cluster
|
|
|
|
```bash
|
|
# Rollback all nodes (reverse order: leader first, then followers)
|
|
for node in node1 node2 node3; do
|
|
echo "=== Rolling back $node ==="
|
|
ssh $node "sudo systemctl stop chainfire"
|
|
ssh $node "sudo cp /usr/local/bin/chainfire-server.bak /usr/local/bin/chainfire-server"
|
|
ssh $node "sudo systemctl start chainfire"
|
|
sleep 10
|
|
done
|
|
|
|
# Verify cluster health
|
|
chainfire-client --endpoint http://node1:2379 member-list
|
|
```
|
|
|
|
### Restore from Backup (Disaster Recovery)
|
|
|
|
If rollback fails, restore from backup (see `backup-restore.md`):
|
|
|
|
```bash
|
|
# Stop all nodes
|
|
for node in node1 node2 node3; do
|
|
ssh $node "sudo systemctl stop chainfire"
|
|
done
|
|
|
|
# Restore backup to all nodes
|
|
BACKUP="/var/backups/chainfire/20251210-020000.tar.gz"
|
|
for node in node1 node2 node3; do
|
|
scp "$BACKUP" "$node:/tmp/restore.tar.gz"
|
|
ssh $node "sudo rm -rf /var/lib/chainfire/*"
|
|
ssh $node "sudo tar -xzf /tmp/restore.tar.gz -C /var/lib/chainfire --strip-components=1"
|
|
ssh $node "sudo chown -R chainfire:chainfire /var/lib/chainfire"
|
|
done
|
|
|
|
# Restore old binaries
|
|
for node in node1 node2 node3; do
|
|
ssh $node "sudo cp /usr/local/bin/chainfire-server.bak /usr/local/bin/chainfire-server"
|
|
done
|
|
|
|
# Start leader first
|
|
ssh node1 "sudo systemctl start chainfire"
|
|
sleep 10
|
|
|
|
# Start followers
|
|
for node in node2 node3; do
|
|
ssh $node "sudo systemctl start chainfire"
|
|
done
|
|
|
|
# Verify
|
|
chainfire-client --endpoint http://node1:2379 member-list
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Issue: Node fails to start after upgrade
|
|
|
|
**Symptoms:**
|
|
- `systemctl status chainfire` shows failed state
|
|
- Logs show "incompatible data format" errors
|
|
|
|
**Resolution:**
|
|
```bash
|
|
# Check logs
|
|
journalctl -u chainfire -n 100 --no-pager
|
|
|
|
# If data format incompatible, restore from backup
|
|
sudo systemctl stop chainfire
|
|
sudo mv /var/lib/chainfire /var/lib/chainfire.failed
|
|
sudo tar -xzf /var/backups/chainfire/LATEST.tar.gz -C /var/lib/chainfire --strip-components=1
|
|
sudo chown -R chainfire:chainfire /var/lib/chainfire
|
|
sudo systemctl start chainfire
|
|
```
|
|
|
|
### Issue: Cluster loses quorum during upgrade
|
|
|
|
**Symptoms:**
|
|
- Writes fail with "no leader" errors
|
|
- Multiple nodes show different leaders
|
|
|
|
**Resolution:**
|
|
```bash
|
|
# Immediately rollback in-progress upgrade
|
|
ssh UPGRADED_NODE "sudo systemctl stop chainfire"
|
|
ssh UPGRADED_NODE "sudo cp /usr/local/bin/chainfire-server.bak /usr/local/bin/chainfire-server"
|
|
ssh UPGRADED_NODE "sudo systemctl start chainfire"
|
|
|
|
# Wait for cluster to stabilize
|
|
sleep 30
|
|
|
|
# Verify quorum restored
|
|
chainfire-client --endpoint http://node1:2379 status
|
|
```
|
|
|
|
### Issue: Performance degradation after upgrade
|
|
|
|
**Symptoms:**
|
|
- Increased write latency
|
|
- Higher CPU/memory usage
|
|
|
|
**Resolution:**
|
|
```bash
|
|
# Check resource usage
|
|
for node in node1 node2 node3; do
|
|
echo "=== $node ==="
|
|
ssh $node "top -bn1 | head -20"
|
|
done
|
|
|
|
# Check Raft metrics
|
|
chainfire-client --endpoint http://node1:2379 status
|
|
|
|
# If severe, consider rollback
|
|
# If acceptable, monitor for 24 hours before proceeding
|
|
```
|
|
|
|
## Maintenance Windows
|
|
|
|
### Zero-Downtime Upgrade (Recommended)
|
|
|
|
For clusters with 3+ nodes and applications using client-side retry:
|
|
- No maintenance window required
|
|
- Upgrade during normal business hours
|
|
- Monitor closely
|
|
|
|
### Scheduled Maintenance Window
|
|
|
|
For critical production systems or <3 node clusters:
|
|
```bash
|
|
# 1. Notify users 24 hours in advance
|
|
# 2. Schedule 2-hour maintenance window
|
|
# 3. Set service to read-only mode (if supported):
|
|
chainfire-client --endpoint http://LEADER_IP:2379 set-read-only true
|
|
|
|
# 4. Perform upgrade (faster without writes)
|
|
|
|
# 5. Disable read-only mode:
|
|
chainfire-client --endpoint http://LEADER_IP:2379 set-read-only false
|
|
```
|
|
|
|
## References
|
|
|
|
- Configuration: `specifications/configuration.md`
|
|
- Backup/Restore: `docs/ops/backup-restore.md`
|
|
- Scale-Out: `docs/ops/scale-out.md`
|
|
- Release Notes: https://github.com/centra-cloud/chainfire/releases
|