photoncloud-monorepo/docs/ops/upgrade.md
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

14 KiB

Rolling Upgrade Runbook

Overview

This runbook covers rolling upgrade procedures for Chainfire and FlareDB clusters to minimize downtime and maintain data availability during version upgrades.

Prerequisites

Pre-Upgrade Checklist

  • New version tested in staging environment
  • Backup of all nodes completed (see backup-restore.md)
  • Release notes reviewed for breaking changes
  • Rollback plan prepared
  • Maintenance window scheduled (if required)

Compatibility Requirements

  • New version is compatible with current version (check release notes)
  • Proto changes are backward-compatible (if applicable)
  • Database schema migrations documented

Infrastructure

  • New binary built and available on all nodes
  • Sufficient disk space for new binaries and data
  • Monitoring and alerting functional

Chainfire Rolling Upgrade

Pre-Upgrade Checks

# Check cluster health
chainfire-client --endpoint http://LEADER_IP:2379 status

# Verify all nodes are healthy
chainfire-client --endpoint http://LEADER_IP:2379 member-list

# Check current version
chainfire-server --version

# Verify no ongoing operations
chainfire-client --endpoint http://LEADER_IP:2379 status | grep raft_index
# Wait for index to stabilize (no rapid changes)

# Create backup
/usr/local/bin/backup-chainfire.sh

Upgrade Sequence

Important: Upgrade followers first, then the leader last to minimize leadership changes.

Step 1: Identify Leader

# Get cluster status
chainfire-client --endpoint http://NODE1_IP:2379 status

# Note the leader node ID
LEADER_ID=$(chainfire-client --endpoint http://NODE1_IP:2379 status | grep 'leader:' | awk '{print $2}')
echo "Leader is node $LEADER_ID"

Step 2: Upgrade Follower Nodes

For each follower node (non-leader):

# SSH to follower node
ssh follower-node-2

# Download new binary
sudo wget -O /usr/local/bin/chainfire-server.new \
  https://releases.centra.cloud/chainfire-server-v0.2.0

# Verify checksum
echo "EXPECTED_SHA256 /usr/local/bin/chainfire-server.new" | sha256sum -c

# Make executable
sudo chmod +x /usr/local/bin/chainfire-server.new

# Stop service
sudo systemctl stop chainfire

# Backup old binary
sudo cp /usr/local/bin/chainfire-server /usr/local/bin/chainfire-server.bak

# Replace binary
sudo mv /usr/local/bin/chainfire-server.new /usr/local/bin/chainfire-server

# Start service
sudo systemctl start chainfire

# Verify upgrade
chainfire-server --version
# Should show new version

# Check node rejoined cluster
chainfire-client --endpoint http://localhost:2379 status
# Verify: raft_index is catching up

# Wait for catchup
while true; do
  LEADER_INDEX=$(chainfire-client --endpoint http://LEADER_IP:2379 status | grep raft_index | awk '{print $2}')
  FOLLOWER_INDEX=$(chainfire-client --endpoint http://localhost:2379 status | grep raft_index | awk '{print $2}')
  DIFF=$((LEADER_INDEX - FOLLOWER_INDEX))

  if [ $DIFF -lt 10 ]; then
    echo "Follower caught up (diff: $DIFF)"
    break
  fi

  echo "Waiting for catchup... (diff: $DIFF)"
  sleep 5
done

Wait 5 minutes between follower upgrades to ensure stability.

Step 3: Upgrade Leader Node

# SSH to leader node
ssh leader-node-1

# Download new binary
sudo wget -O /usr/local/bin/chainfire-server.new \
  https://releases.centra.cloud/chainfire-server-v0.2.0

# Verify checksum
echo "EXPECTED_SHA256 /usr/local/bin/chainfire-server.new" | sha256sum -c

# Make executable
sudo chmod +x /usr/local/bin/chainfire-server.new

# Stop service (triggers leader election)
sudo systemctl stop chainfire

# Backup old binary
sudo cp /usr/local/bin/chainfire-server /usr/local/bin/chainfire-server.bak

# Replace binary
sudo mv /usr/local/bin/chainfire-server.new /usr/local/bin/chainfire-server

# Start service
sudo systemctl start chainfire

# Verify new leader elected
chainfire-client --endpoint http://FOLLOWER_IP:2379 status | grep leader
# Leader should be one of the upgraded followers

# Verify this node rejoined
chainfire-client --endpoint http://localhost:2379 status

Post-Upgrade Verification

# Check all nodes are on new version
for node in node1 node2 node3; do
  echo "=== $node ==="
  ssh $node "chainfire-server --version"
done

# Verify cluster health
chainfire-client --endpoint http://ANY_NODE_IP:2379 member-list
# All nodes should show IsLearner=false, Status=healthy

# Test write operation
chainfire-client --endpoint http://ANY_NODE_IP:2379 \
  put upgrade-test "upgraded-at-$(date +%s)"

# Test read operation
chainfire-client --endpoint http://ANY_NODE_IP:2379 \
  get upgrade-test

# Check logs for errors
for node in node1 node2 node3; do
  echo "=== $node logs ==="
  ssh $node "journalctl -u chainfire -n 50 --no-pager | grep -i error"
done

FlareDB Rolling Upgrade

Pre-Upgrade Checks

# Check cluster status
flaredb-client --endpoint http://PD_IP:2379 cluster-status

# Verify all stores are online
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | {id, state}'

# Check current version
flaredb-server --version

# Create backup
BACKUP_DIR="/var/backups/flaredb/$(date +%Y%m%d-%H%M%S)"
rocksdb_checkpoint --db=/var/lib/flaredb --checkpoint_dir="$BACKUP_DIR"

Upgrade Sequence

FlareDB supports hot upgrades due to PD-managed placement. Upgrade stores one at a time.

For Each FlareDB Store:

# SSH to store node
ssh flaredb-node-1

# Download new binary
sudo wget -O /usr/local/bin/flaredb-server.new \
  https://releases.centra.cloud/flaredb-server-v0.2.0

# Verify checksum
echo "EXPECTED_SHA256 /usr/local/bin/flaredb-server.new" | sha256sum -c

# Make executable
sudo chmod +x /usr/local/bin/flaredb-server.new

# Stop service
sudo systemctl stop flaredb

# Backup old binary
sudo cp /usr/local/bin/flaredb-server /usr/local/bin/flaredb-server.bak

# Replace binary
sudo mv /usr/local/bin/flaredb-server.new /usr/local/bin/flaredb-server

# Start service
sudo systemctl start flaredb

# Verify store comes back online
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | select(.id==STORE_ID) | .state'
# Should show: "Up"

# Check version
flaredb-server --version

Wait for rebalancing to complete before upgrading next store:

# Check region health
curl http://PD_IP:2379/pd/api/v1/stats/region | jq '.count'

# Wait until no pending peers
while true; do
  PENDING=$(curl -s http://PD_IP:2379/pd/api/v1/stats/region | jq '.pending_peers')
  if [ "$PENDING" -eq 0 ]; then
    echo "No pending peers, safe to continue"
    break
  fi
  echo "Waiting for rebalancing... (pending: $PENDING)"
  sleep 10
done

Post-Upgrade Verification

# Check all stores are on new version
for node in flaredb-node-{1..3}; do
  echo "=== $node ==="
  ssh $node "flaredb-server --version"
done

# Verify cluster health
flaredb-client --endpoint http://PD_IP:2379 cluster-status

# Test write operation
flaredb-client --endpoint http://ANY_STORE_IP:2379 \
  put upgrade-test "upgraded-at-$(date +%s)"

# Test read operation
flaredb-client --endpoint http://ANY_STORE_IP:2379 \
  get upgrade-test

# Check logs for errors
for node in flaredb-node-{1..3}; do
  echo "=== $node logs ==="
  ssh $node "journalctl -u flaredb -n 50 --no-pager | grep -i error"
done

Automated Upgrade Script

Create /usr/local/bin/rolling-upgrade-chainfire.sh:

#!/bin/bash
set -euo pipefail

NEW_VERSION="$1"
BINARY_URL="https://releases.centra.cloud/chainfire-server-${NEW_VERSION}"
EXPECTED_SHA256="$2"

NODES=("node1" "node2" "node3")
LEADER_IP="node1"  # Will be detected dynamically

# Detect leader
echo "Detecting leader..."
LEADER_ID=$(chainfire-client --endpoint http://${LEADER_IP}:2379 status | grep 'leader:' | awk '{print $2}')
echo "Leader is node $LEADER_ID"

# Upgrade followers first
for node in "${NODES[@]}"; do
  NODE_ID=$(ssh $node "grep 'id =' /etc/centra-cloud/chainfire.toml | head -1 | awk '{print \$3}'")

  if [ "$NODE_ID" == "$LEADER_ID" ]; then
    echo "Skipping $node (leader) for now"
    LEADER_NODE=$node
    continue
  fi

  echo "=== Upgrading $node (follower) ==="

  # Download and verify
  ssh $node "sudo wget -q -O /usr/local/bin/chainfire-server.new '$BINARY_URL'"
  ssh $node "echo '$EXPECTED_SHA256 /usr/local/bin/chainfire-server.new' | sha256sum -c"

  # Replace binary
  ssh $node "sudo systemctl stop chainfire"
  ssh $node "sudo cp /usr/local/bin/chainfire-server /usr/local/bin/chainfire-server.bak"
  ssh $node "sudo mv /usr/local/bin/chainfire-server.new /usr/local/bin/chainfire-server"
  ssh $node "sudo chmod +x /usr/local/bin/chainfire-server"
  ssh $node "sudo systemctl start chainfire"

  # Wait for catchup
  echo "Waiting for $node to catch up..."
  sleep 30

  # Verify
  NEW_VER=$(ssh $node "chainfire-server --version")
  echo "$node upgraded to: $NEW_VER"
done

# Upgrade leader last
echo "=== Upgrading $LEADER_NODE (leader) ==="
ssh $LEADER_NODE "sudo wget -q -O /usr/local/bin/chainfire-server.new '$BINARY_URL'"
ssh $LEADER_NODE "echo '$EXPECTED_SHA256 /usr/local/bin/chainfire-server.new' | sha256sum -c"
ssh $LEADER_NODE "sudo systemctl stop chainfire"
ssh $LEADER_NODE "sudo cp /usr/local/bin/chainfire-server /usr/local/bin/chainfire-server.bak"
ssh $LEADER_NODE "sudo mv /usr/local/bin/chainfire-server.new /usr/local/bin/chainfire-server"
ssh $LEADER_NODE "sudo chmod +x /usr/local/bin/chainfire-server"
ssh $LEADER_NODE "sudo systemctl start chainfire"

echo "=== Upgrade complete ==="
echo "Verifying cluster health..."

sleep 10
chainfire-client --endpoint http://${NODES[0]}:2379 member-list

echo "All nodes upgraded successfully!"

Usage:

chmod +x /usr/local/bin/rolling-upgrade-chainfire.sh
/usr/local/bin/rolling-upgrade-chainfire.sh v0.2.0 <sha256-checksum>

Rollback Procedure

If upgrade fails or causes issues, rollback to previous version:

Rollback Single Node

# SSH to problematic node
ssh failing-node

# Stop service
sudo systemctl stop chainfire

# Restore old binary
sudo cp /usr/local/bin/chainfire-server.bak /usr/local/bin/chainfire-server

# Start service
sudo systemctl start chainfire

# Verify
chainfire-server --version
chainfire-client --endpoint http://localhost:2379 status

Rollback Entire Cluster

# Rollback all nodes (reverse order: leader first, then followers)
for node in node1 node2 node3; do
  echo "=== Rolling back $node ==="
  ssh $node "sudo systemctl stop chainfire"
  ssh $node "sudo cp /usr/local/bin/chainfire-server.bak /usr/local/bin/chainfire-server"
  ssh $node "sudo systemctl start chainfire"
  sleep 10
done

# Verify cluster health
chainfire-client --endpoint http://node1:2379 member-list

Restore from Backup (Disaster Recovery)

If rollback fails, restore from backup (see backup-restore.md):

# Stop all nodes
for node in node1 node2 node3; do
  ssh $node "sudo systemctl stop chainfire"
done

# Restore backup to all nodes
BACKUP="/var/backups/chainfire/20251210-020000.tar.gz"
for node in node1 node2 node3; do
  scp "$BACKUP" "$node:/tmp/restore.tar.gz"
  ssh $node "sudo rm -rf /var/lib/chainfire/*"
  ssh $node "sudo tar -xzf /tmp/restore.tar.gz -C /var/lib/chainfire --strip-components=1"
  ssh $node "sudo chown -R chainfire:chainfire /var/lib/chainfire"
done

# Restore old binaries
for node in node1 node2 node3; do
  ssh $node "sudo cp /usr/local/bin/chainfire-server.bak /usr/local/bin/chainfire-server"
done

# Start leader first
ssh node1 "sudo systemctl start chainfire"
sleep 10

# Start followers
for node in node2 node3; do
  ssh $node "sudo systemctl start chainfire"
done

# Verify
chainfire-client --endpoint http://node1:2379 member-list

Troubleshooting

Issue: Node fails to start after upgrade

Symptoms:

  • systemctl status chainfire shows failed state
  • Logs show "incompatible data format" errors

Resolution:

# Check logs
journalctl -u chainfire -n 100 --no-pager

# If data format incompatible, restore from backup
sudo systemctl stop chainfire
sudo mv /var/lib/chainfire /var/lib/chainfire.failed
sudo tar -xzf /var/backups/chainfire/LATEST.tar.gz -C /var/lib/chainfire --strip-components=1
sudo chown -R chainfire:chainfire /var/lib/chainfire
sudo systemctl start chainfire

Issue: Cluster loses quorum during upgrade

Symptoms:

  • Writes fail with "no leader" errors
  • Multiple nodes show different leaders

Resolution:

# Immediately rollback in-progress upgrade
ssh UPGRADED_NODE "sudo systemctl stop chainfire"
ssh UPGRADED_NODE "sudo cp /usr/local/bin/chainfire-server.bak /usr/local/bin/chainfire-server"
ssh UPGRADED_NODE "sudo systemctl start chainfire"

# Wait for cluster to stabilize
sleep 30

# Verify quorum restored
chainfire-client --endpoint http://node1:2379 status

Issue: Performance degradation after upgrade

Symptoms:

  • Increased write latency
  • Higher CPU/memory usage

Resolution:

# Check resource usage
for node in node1 node2 node3; do
  echo "=== $node ==="
  ssh $node "top -bn1 | head -20"
done

# Check Raft metrics
chainfire-client --endpoint http://node1:2379 status

# If severe, consider rollback
# If acceptable, monitor for 24 hours before proceeding

Maintenance Windows

For clusters with 3+ nodes and applications using client-side retry:

  • No maintenance window required
  • Upgrade during normal business hours
  • Monitor closely

Scheduled Maintenance Window

For critical production systems or <3 node clusters:

# 1. Notify users 24 hours in advance
# 2. Schedule 2-hour maintenance window
# 3. Set service to read-only mode (if supported):
chainfire-client --endpoint http://LEADER_IP:2379 set-read-only true

# 4. Perform upgrade (faster without writes)

# 5. Disable read-only mode:
chainfire-client --endpoint http://LEADER_IP:2379 set-read-only false

References