# Rolling Upgrade Runbook ## Overview This runbook covers rolling upgrade procedures for Chainfire and FlareDB clusters to minimize downtime and maintain data availability during version upgrades. ## Prerequisites ### Pre-Upgrade Checklist - ✅ New version tested in staging environment - ✅ Backup of all nodes completed (see `backup-restore.md`) - ✅ Release notes reviewed for breaking changes - ✅ Rollback plan prepared - ✅ Maintenance window scheduled (if required) ### Compatibility Requirements - ✅ New version is compatible with current version (check release notes) - ✅ Proto changes are backward-compatible (if applicable) - ✅ Database schema migrations documented ### Infrastructure - ✅ New binary built and available on all nodes - ✅ Sufficient disk space for new binaries and data - ✅ Monitoring and alerting functional ## Chainfire Rolling Upgrade ### Pre-Upgrade Checks ```bash # Check cluster health chainfire-client --endpoint http://LEADER_IP:2379 status # Verify all nodes are healthy chainfire-client --endpoint http://LEADER_IP:2379 member-list # Check current version chainfire-server --version # Verify no ongoing operations chainfire-client --endpoint http://LEADER_IP:2379 status | grep raft_index # Wait for index to stabilize (no rapid changes) # Create backup /usr/local/bin/backup-chainfire.sh ``` ### Upgrade Sequence **Important:** Upgrade followers first, then the leader last to minimize leadership changes. #### Step 1: Identify Leader ```bash # Get cluster status chainfire-client --endpoint http://NODE1_IP:2379 status # Note the leader node ID LEADER_ID=$(chainfire-client --endpoint http://NODE1_IP:2379 status | grep 'leader:' | awk '{print $2}') echo "Leader is node $LEADER_ID" ``` #### Step 2: Upgrade Follower Nodes **For each follower node (non-leader):** ```bash # SSH to follower node ssh follower-node-2 # Download new binary sudo wget -O /usr/local/bin/chainfire-server.new \ https://releases.centra.cloud/chainfire-server-v0.2.0 # Verify checksum echo "EXPECTED_SHA256 /usr/local/bin/chainfire-server.new" | sha256sum -c # Make executable sudo chmod +x /usr/local/bin/chainfire-server.new # Stop service sudo systemctl stop chainfire # Backup old binary sudo cp /usr/local/bin/chainfire-server /usr/local/bin/chainfire-server.bak # Replace binary sudo mv /usr/local/bin/chainfire-server.new /usr/local/bin/chainfire-server # Start service sudo systemctl start chainfire # Verify upgrade chainfire-server --version # Should show new version # Check node rejoined cluster chainfire-client --endpoint http://localhost:2379 status # Verify: raft_index is catching up # Wait for catchup while true; do LEADER_INDEX=$(chainfire-client --endpoint http://LEADER_IP:2379 status | grep raft_index | awk '{print $2}') FOLLOWER_INDEX=$(chainfire-client --endpoint http://localhost:2379 status | grep raft_index | awk '{print $2}') DIFF=$((LEADER_INDEX - FOLLOWER_INDEX)) if [ $DIFF -lt 10 ]; then echo "Follower caught up (diff: $DIFF)" break fi echo "Waiting for catchup... (diff: $DIFF)" sleep 5 done ``` **Wait 5 minutes between follower upgrades** to ensure stability. #### Step 3: Upgrade Leader Node ```bash # SSH to leader node ssh leader-node-1 # Download new binary sudo wget -O /usr/local/bin/chainfire-server.new \ https://releases.centra.cloud/chainfire-server-v0.2.0 # Verify checksum echo "EXPECTED_SHA256 /usr/local/bin/chainfire-server.new" | sha256sum -c # Make executable sudo chmod +x /usr/local/bin/chainfire-server.new # Stop service (triggers leader election) sudo systemctl stop chainfire # Backup old binary sudo cp /usr/local/bin/chainfire-server /usr/local/bin/chainfire-server.bak # Replace binary sudo mv /usr/local/bin/chainfire-server.new /usr/local/bin/chainfire-server # Start service sudo systemctl start chainfire # Verify new leader elected chainfire-client --endpoint http://FOLLOWER_IP:2379 status | grep leader # Leader should be one of the upgraded followers # Verify this node rejoined chainfire-client --endpoint http://localhost:2379 status ``` ### Post-Upgrade Verification ```bash # Check all nodes are on new version for node in node1 node2 node3; do echo "=== $node ===" ssh $node "chainfire-server --version" done # Verify cluster health chainfire-client --endpoint http://ANY_NODE_IP:2379 member-list # All nodes should show IsLearner=false, Status=healthy # Test write operation chainfire-client --endpoint http://ANY_NODE_IP:2379 \ put upgrade-test "upgraded-at-$(date +%s)" # Test read operation chainfire-client --endpoint http://ANY_NODE_IP:2379 \ get upgrade-test # Check logs for errors for node in node1 node2 node3; do echo "=== $node logs ===" ssh $node "journalctl -u chainfire -n 50 --no-pager | grep -i error" done ``` ## FlareDB Rolling Upgrade ### Pre-Upgrade Checks ```bash # Check cluster status flaredb-client --endpoint http://PD_IP:2379 cluster-status # Verify all stores are online curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | {id, state}' # Check current version flaredb-server --version # Create backup BACKUP_DIR="/var/backups/flaredb/$(date +%Y%m%d-%H%M%S)" rocksdb_checkpoint --db=/var/lib/flaredb --checkpoint_dir="$BACKUP_DIR" ``` ### Upgrade Sequence **FlareDB supports hot upgrades** due to PD-managed placement. Upgrade stores one at a time. #### For Each FlareDB Store: ```bash # SSH to store node ssh flaredb-node-1 # Download new binary sudo wget -O /usr/local/bin/flaredb-server.new \ https://releases.centra.cloud/flaredb-server-v0.2.0 # Verify checksum echo "EXPECTED_SHA256 /usr/local/bin/flaredb-server.new" | sha256sum -c # Make executable sudo chmod +x /usr/local/bin/flaredb-server.new # Stop service sudo systemctl stop flaredb # Backup old binary sudo cp /usr/local/bin/flaredb-server /usr/local/bin/flaredb-server.bak # Replace binary sudo mv /usr/local/bin/flaredb-server.new /usr/local/bin/flaredb-server # Start service sudo systemctl start flaredb # Verify store comes back online curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | select(.id==STORE_ID) | .state' # Should show: "Up" # Check version flaredb-server --version ``` **Wait for rebalancing to complete** before upgrading next store: ```bash # Check region health curl http://PD_IP:2379/pd/api/v1/stats/region | jq '.count' # Wait until no pending peers while true; do PENDING=$(curl -s http://PD_IP:2379/pd/api/v1/stats/region | jq '.pending_peers') if [ "$PENDING" -eq 0 ]; then echo "No pending peers, safe to continue" break fi echo "Waiting for rebalancing... (pending: $PENDING)" sleep 10 done ``` ### Post-Upgrade Verification ```bash # Check all stores are on new version for node in flaredb-node-{1..3}; do echo "=== $node ===" ssh $node "flaredb-server --version" done # Verify cluster health flaredb-client --endpoint http://PD_IP:2379 cluster-status # Test write operation flaredb-client --endpoint http://ANY_STORE_IP:2379 \ put upgrade-test "upgraded-at-$(date +%s)" # Test read operation flaredb-client --endpoint http://ANY_STORE_IP:2379 \ get upgrade-test # Check logs for errors for node in flaredb-node-{1..3}; do echo "=== $node logs ===" ssh $node "journalctl -u flaredb -n 50 --no-pager | grep -i error" done ``` ## Automated Upgrade Script Create `/usr/local/bin/rolling-upgrade-chainfire.sh`: ```bash #!/bin/bash set -euo pipefail NEW_VERSION="$1" BINARY_URL="https://releases.centra.cloud/chainfire-server-${NEW_VERSION}" EXPECTED_SHA256="$2" NODES=("node1" "node2" "node3") LEADER_IP="node1" # Will be detected dynamically # Detect leader echo "Detecting leader..." LEADER_ID=$(chainfire-client --endpoint http://${LEADER_IP}:2379 status | grep 'leader:' | awk '{print $2}') echo "Leader is node $LEADER_ID" # Upgrade followers first for node in "${NODES[@]}"; do NODE_ID=$(ssh $node "grep 'id =' /etc/centra-cloud/chainfire.toml | head -1 | awk '{print \$3}'") if [ "$NODE_ID" == "$LEADER_ID" ]; then echo "Skipping $node (leader) for now" LEADER_NODE=$node continue fi echo "=== Upgrading $node (follower) ===" # Download and verify ssh $node "sudo wget -q -O /usr/local/bin/chainfire-server.new '$BINARY_URL'" ssh $node "echo '$EXPECTED_SHA256 /usr/local/bin/chainfire-server.new' | sha256sum -c" # Replace binary ssh $node "sudo systemctl stop chainfire" ssh $node "sudo cp /usr/local/bin/chainfire-server /usr/local/bin/chainfire-server.bak" ssh $node "sudo mv /usr/local/bin/chainfire-server.new /usr/local/bin/chainfire-server" ssh $node "sudo chmod +x /usr/local/bin/chainfire-server" ssh $node "sudo systemctl start chainfire" # Wait for catchup echo "Waiting for $node to catch up..." sleep 30 # Verify NEW_VER=$(ssh $node "chainfire-server --version") echo "$node upgraded to: $NEW_VER" done # Upgrade leader last echo "=== Upgrading $LEADER_NODE (leader) ===" ssh $LEADER_NODE "sudo wget -q -O /usr/local/bin/chainfire-server.new '$BINARY_URL'" ssh $LEADER_NODE "echo '$EXPECTED_SHA256 /usr/local/bin/chainfire-server.new' | sha256sum -c" ssh $LEADER_NODE "sudo systemctl stop chainfire" ssh $LEADER_NODE "sudo cp /usr/local/bin/chainfire-server /usr/local/bin/chainfire-server.bak" ssh $LEADER_NODE "sudo mv /usr/local/bin/chainfire-server.new /usr/local/bin/chainfire-server" ssh $LEADER_NODE "sudo chmod +x /usr/local/bin/chainfire-server" ssh $LEADER_NODE "sudo systemctl start chainfire" echo "=== Upgrade complete ===" echo "Verifying cluster health..." sleep 10 chainfire-client --endpoint http://${NODES[0]}:2379 member-list echo "All nodes upgraded successfully!" ``` **Usage:** ```bash chmod +x /usr/local/bin/rolling-upgrade-chainfire.sh /usr/local/bin/rolling-upgrade-chainfire.sh v0.2.0 ``` ## Rollback Procedure If upgrade fails or causes issues, rollback to previous version: ### Rollback Single Node ```bash # SSH to problematic node ssh failing-node # Stop service sudo systemctl stop chainfire # Restore old binary sudo cp /usr/local/bin/chainfire-server.bak /usr/local/bin/chainfire-server # Start service sudo systemctl start chainfire # Verify chainfire-server --version chainfire-client --endpoint http://localhost:2379 status ``` ### Rollback Entire Cluster ```bash # Rollback all nodes (reverse order: leader first, then followers) for node in node1 node2 node3; do echo "=== Rolling back $node ===" ssh $node "sudo systemctl stop chainfire" ssh $node "sudo cp /usr/local/bin/chainfire-server.bak /usr/local/bin/chainfire-server" ssh $node "sudo systemctl start chainfire" sleep 10 done # Verify cluster health chainfire-client --endpoint http://node1:2379 member-list ``` ### Restore from Backup (Disaster Recovery) If rollback fails, restore from backup (see `backup-restore.md`): ```bash # Stop all nodes for node in node1 node2 node3; do ssh $node "sudo systemctl stop chainfire" done # Restore backup to all nodes BACKUP="/var/backups/chainfire/20251210-020000.tar.gz" for node in node1 node2 node3; do scp "$BACKUP" "$node:/tmp/restore.tar.gz" ssh $node "sudo rm -rf /var/lib/chainfire/*" ssh $node "sudo tar -xzf /tmp/restore.tar.gz -C /var/lib/chainfire --strip-components=1" ssh $node "sudo chown -R chainfire:chainfire /var/lib/chainfire" done # Restore old binaries for node in node1 node2 node3; do ssh $node "sudo cp /usr/local/bin/chainfire-server.bak /usr/local/bin/chainfire-server" done # Start leader first ssh node1 "sudo systemctl start chainfire" sleep 10 # Start followers for node in node2 node3; do ssh $node "sudo systemctl start chainfire" done # Verify chainfire-client --endpoint http://node1:2379 member-list ``` ## Troubleshooting ### Issue: Node fails to start after upgrade **Symptoms:** - `systemctl status chainfire` shows failed state - Logs show "incompatible data format" errors **Resolution:** ```bash # Check logs journalctl -u chainfire -n 100 --no-pager # If data format incompatible, restore from backup sudo systemctl stop chainfire sudo mv /var/lib/chainfire /var/lib/chainfire.failed sudo tar -xzf /var/backups/chainfire/LATEST.tar.gz -C /var/lib/chainfire --strip-components=1 sudo chown -R chainfire:chainfire /var/lib/chainfire sudo systemctl start chainfire ``` ### Issue: Cluster loses quorum during upgrade **Symptoms:** - Writes fail with "no leader" errors - Multiple nodes show different leaders **Resolution:** ```bash # Immediately rollback in-progress upgrade ssh UPGRADED_NODE "sudo systemctl stop chainfire" ssh UPGRADED_NODE "sudo cp /usr/local/bin/chainfire-server.bak /usr/local/bin/chainfire-server" ssh UPGRADED_NODE "sudo systemctl start chainfire" # Wait for cluster to stabilize sleep 30 # Verify quorum restored chainfire-client --endpoint http://node1:2379 status ``` ### Issue: Performance degradation after upgrade **Symptoms:** - Increased write latency - Higher CPU/memory usage **Resolution:** ```bash # Check resource usage for node in node1 node2 node3; do echo "=== $node ===" ssh $node "top -bn1 | head -20" done # Check Raft metrics chainfire-client --endpoint http://node1:2379 status # If severe, consider rollback # If acceptable, monitor for 24 hours before proceeding ``` ## Maintenance Windows ### Zero-Downtime Upgrade (Recommended) For clusters with 3+ nodes and applications using client-side retry: - No maintenance window required - Upgrade during normal business hours - Monitor closely ### Scheduled Maintenance Window For critical production systems or <3 node clusters: ```bash # 1. Notify users 24 hours in advance # 2. Schedule 2-hour maintenance window # 3. Set service to read-only mode (if supported): chainfire-client --endpoint http://LEADER_IP:2379 set-read-only true # 4. Perform upgrade (faster without writes) # 5. Disable read-only mode: chainfire-client --endpoint http://LEADER_IP:2379 set-read-only false ``` ## References - Configuration: `specifications/configuration.md` - Backup/Restore: `docs/ops/backup-restore.md` - Scale-Out: `docs/ops/scale-out.md` - Release Notes: https://github.com/centra-cloud/chainfire/releases