photoncloud-monorepo/docs/ops/upgrade.md
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

532 lines
14 KiB
Markdown

# Rolling Upgrade Runbook
## Overview
This runbook covers rolling upgrade procedures for Chainfire and FlareDB clusters to minimize downtime and maintain data availability during version upgrades.
## Prerequisites
### Pre-Upgrade Checklist
- ✅ New version tested in staging environment
- ✅ Backup of all nodes completed (see `backup-restore.md`)
- ✅ Release notes reviewed for breaking changes
- ✅ Rollback plan prepared
- ✅ Maintenance window scheduled (if required)
### Compatibility Requirements
- ✅ New version is compatible with current version (check release notes)
- ✅ Proto changes are backward-compatible (if applicable)
- ✅ Database schema migrations documented
### Infrastructure
- ✅ New binary built and available on all nodes
- ✅ Sufficient disk space for new binaries and data
- ✅ Monitoring and alerting functional
## Chainfire Rolling Upgrade
### Pre-Upgrade Checks
```bash
# Check cluster health
chainfire-client --endpoint http://LEADER_IP:2379 status
# Verify all nodes are healthy
chainfire-client --endpoint http://LEADER_IP:2379 member-list
# Check current version
chainfire-server --version
# Verify no ongoing operations
chainfire-client --endpoint http://LEADER_IP:2379 status | grep raft_index
# Wait for index to stabilize (no rapid changes)
# Create backup
/usr/local/bin/backup-chainfire.sh
```
### Upgrade Sequence
**Important:** Upgrade followers first, then the leader last to minimize leadership changes.
#### Step 1: Identify Leader
```bash
# Get cluster status
chainfire-client --endpoint http://NODE1_IP:2379 status
# Note the leader node ID
LEADER_ID=$(chainfire-client --endpoint http://NODE1_IP:2379 status | grep 'leader:' | awk '{print $2}')
echo "Leader is node $LEADER_ID"
```
#### Step 2: Upgrade Follower Nodes
**For each follower node (non-leader):**
```bash
# SSH to follower node
ssh follower-node-2
# Download new binary
sudo wget -O /usr/local/bin/chainfire-server.new \
https://releases.centra.cloud/chainfire-server-v0.2.0
# Verify checksum
echo "EXPECTED_SHA256 /usr/local/bin/chainfire-server.new" | sha256sum -c
# Make executable
sudo chmod +x /usr/local/bin/chainfire-server.new
# Stop service
sudo systemctl stop chainfire
# Backup old binary
sudo cp /usr/local/bin/chainfire-server /usr/local/bin/chainfire-server.bak
# Replace binary
sudo mv /usr/local/bin/chainfire-server.new /usr/local/bin/chainfire-server
# Start service
sudo systemctl start chainfire
# Verify upgrade
chainfire-server --version
# Should show new version
# Check node rejoined cluster
chainfire-client --endpoint http://localhost:2379 status
# Verify: raft_index is catching up
# Wait for catchup
while true; do
LEADER_INDEX=$(chainfire-client --endpoint http://LEADER_IP:2379 status | grep raft_index | awk '{print $2}')
FOLLOWER_INDEX=$(chainfire-client --endpoint http://localhost:2379 status | grep raft_index | awk '{print $2}')
DIFF=$((LEADER_INDEX - FOLLOWER_INDEX))
if [ $DIFF -lt 10 ]; then
echo "Follower caught up (diff: $DIFF)"
break
fi
echo "Waiting for catchup... (diff: $DIFF)"
sleep 5
done
```
**Wait 5 minutes between follower upgrades** to ensure stability.
#### Step 3: Upgrade Leader Node
```bash
# SSH to leader node
ssh leader-node-1
# Download new binary
sudo wget -O /usr/local/bin/chainfire-server.new \
https://releases.centra.cloud/chainfire-server-v0.2.0
# Verify checksum
echo "EXPECTED_SHA256 /usr/local/bin/chainfire-server.new" | sha256sum -c
# Make executable
sudo chmod +x /usr/local/bin/chainfire-server.new
# Stop service (triggers leader election)
sudo systemctl stop chainfire
# Backup old binary
sudo cp /usr/local/bin/chainfire-server /usr/local/bin/chainfire-server.bak
# Replace binary
sudo mv /usr/local/bin/chainfire-server.new /usr/local/bin/chainfire-server
# Start service
sudo systemctl start chainfire
# Verify new leader elected
chainfire-client --endpoint http://FOLLOWER_IP:2379 status | grep leader
# Leader should be one of the upgraded followers
# Verify this node rejoined
chainfire-client --endpoint http://localhost:2379 status
```
### Post-Upgrade Verification
```bash
# Check all nodes are on new version
for node in node1 node2 node3; do
echo "=== $node ==="
ssh $node "chainfire-server --version"
done
# Verify cluster health
chainfire-client --endpoint http://ANY_NODE_IP:2379 member-list
# All nodes should show IsLearner=false, Status=healthy
# Test write operation
chainfire-client --endpoint http://ANY_NODE_IP:2379 \
put upgrade-test "upgraded-at-$(date +%s)"
# Test read operation
chainfire-client --endpoint http://ANY_NODE_IP:2379 \
get upgrade-test
# Check logs for errors
for node in node1 node2 node3; do
echo "=== $node logs ==="
ssh $node "journalctl -u chainfire -n 50 --no-pager | grep -i error"
done
```
## FlareDB Rolling Upgrade
### Pre-Upgrade Checks
```bash
# Check cluster status
flaredb-client --endpoint http://PD_IP:2379 cluster-status
# Verify all stores are online
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | {id, state}'
# Check current version
flaredb-server --version
# Create backup
BACKUP_DIR="/var/backups/flaredb/$(date +%Y%m%d-%H%M%S)"
rocksdb_checkpoint --db=/var/lib/flaredb --checkpoint_dir="$BACKUP_DIR"
```
### Upgrade Sequence
**FlareDB supports hot upgrades** due to PD-managed placement. Upgrade stores one at a time.
#### For Each FlareDB Store:
```bash
# SSH to store node
ssh flaredb-node-1
# Download new binary
sudo wget -O /usr/local/bin/flaredb-server.new \
https://releases.centra.cloud/flaredb-server-v0.2.0
# Verify checksum
echo "EXPECTED_SHA256 /usr/local/bin/flaredb-server.new" | sha256sum -c
# Make executable
sudo chmod +x /usr/local/bin/flaredb-server.new
# Stop service
sudo systemctl stop flaredb
# Backup old binary
sudo cp /usr/local/bin/flaredb-server /usr/local/bin/flaredb-server.bak
# Replace binary
sudo mv /usr/local/bin/flaredb-server.new /usr/local/bin/flaredb-server
# Start service
sudo systemctl start flaredb
# Verify store comes back online
curl http://PD_IP:2379/pd/api/v1/stores | jq '.stores[] | select(.id==STORE_ID) | .state'
# Should show: "Up"
# Check version
flaredb-server --version
```
**Wait for rebalancing to complete** before upgrading next store:
```bash
# Check region health
curl http://PD_IP:2379/pd/api/v1/stats/region | jq '.count'
# Wait until no pending peers
while true; do
PENDING=$(curl -s http://PD_IP:2379/pd/api/v1/stats/region | jq '.pending_peers')
if [ "$PENDING" -eq 0 ]; then
echo "No pending peers, safe to continue"
break
fi
echo "Waiting for rebalancing... (pending: $PENDING)"
sleep 10
done
```
### Post-Upgrade Verification
```bash
# Check all stores are on new version
for node in flaredb-node-{1..3}; do
echo "=== $node ==="
ssh $node "flaredb-server --version"
done
# Verify cluster health
flaredb-client --endpoint http://PD_IP:2379 cluster-status
# Test write operation
flaredb-client --endpoint http://ANY_STORE_IP:2379 \
put upgrade-test "upgraded-at-$(date +%s)"
# Test read operation
flaredb-client --endpoint http://ANY_STORE_IP:2379 \
get upgrade-test
# Check logs for errors
for node in flaredb-node-{1..3}; do
echo "=== $node logs ==="
ssh $node "journalctl -u flaredb -n 50 --no-pager | grep -i error"
done
```
## Automated Upgrade Script
Create `/usr/local/bin/rolling-upgrade-chainfire.sh`:
```bash
#!/bin/bash
set -euo pipefail
NEW_VERSION="$1"
BINARY_URL="https://releases.centra.cloud/chainfire-server-${NEW_VERSION}"
EXPECTED_SHA256="$2"
NODES=("node1" "node2" "node3")
LEADER_IP="node1" # Will be detected dynamically
# Detect leader
echo "Detecting leader..."
LEADER_ID=$(chainfire-client --endpoint http://${LEADER_IP}:2379 status | grep 'leader:' | awk '{print $2}')
echo "Leader is node $LEADER_ID"
# Upgrade followers first
for node in "${NODES[@]}"; do
NODE_ID=$(ssh $node "grep 'id =' /etc/centra-cloud/chainfire.toml | head -1 | awk '{print \$3}'")
if [ "$NODE_ID" == "$LEADER_ID" ]; then
echo "Skipping $node (leader) for now"
LEADER_NODE=$node
continue
fi
echo "=== Upgrading $node (follower) ==="
# Download and verify
ssh $node "sudo wget -q -O /usr/local/bin/chainfire-server.new '$BINARY_URL'"
ssh $node "echo '$EXPECTED_SHA256 /usr/local/bin/chainfire-server.new' | sha256sum -c"
# Replace binary
ssh $node "sudo systemctl stop chainfire"
ssh $node "sudo cp /usr/local/bin/chainfire-server /usr/local/bin/chainfire-server.bak"
ssh $node "sudo mv /usr/local/bin/chainfire-server.new /usr/local/bin/chainfire-server"
ssh $node "sudo chmod +x /usr/local/bin/chainfire-server"
ssh $node "sudo systemctl start chainfire"
# Wait for catchup
echo "Waiting for $node to catch up..."
sleep 30
# Verify
NEW_VER=$(ssh $node "chainfire-server --version")
echo "$node upgraded to: $NEW_VER"
done
# Upgrade leader last
echo "=== Upgrading $LEADER_NODE (leader) ==="
ssh $LEADER_NODE "sudo wget -q -O /usr/local/bin/chainfire-server.new '$BINARY_URL'"
ssh $LEADER_NODE "echo '$EXPECTED_SHA256 /usr/local/bin/chainfire-server.new' | sha256sum -c"
ssh $LEADER_NODE "sudo systemctl stop chainfire"
ssh $LEADER_NODE "sudo cp /usr/local/bin/chainfire-server /usr/local/bin/chainfire-server.bak"
ssh $LEADER_NODE "sudo mv /usr/local/bin/chainfire-server.new /usr/local/bin/chainfire-server"
ssh $LEADER_NODE "sudo chmod +x /usr/local/bin/chainfire-server"
ssh $LEADER_NODE "sudo systemctl start chainfire"
echo "=== Upgrade complete ==="
echo "Verifying cluster health..."
sleep 10
chainfire-client --endpoint http://${NODES[0]}:2379 member-list
echo "All nodes upgraded successfully!"
```
**Usage:**
```bash
chmod +x /usr/local/bin/rolling-upgrade-chainfire.sh
/usr/local/bin/rolling-upgrade-chainfire.sh v0.2.0 <sha256-checksum>
```
## Rollback Procedure
If upgrade fails or causes issues, rollback to previous version:
### Rollback Single Node
```bash
# SSH to problematic node
ssh failing-node
# Stop service
sudo systemctl stop chainfire
# Restore old binary
sudo cp /usr/local/bin/chainfire-server.bak /usr/local/bin/chainfire-server
# Start service
sudo systemctl start chainfire
# Verify
chainfire-server --version
chainfire-client --endpoint http://localhost:2379 status
```
### Rollback Entire Cluster
```bash
# Rollback all nodes (reverse order: leader first, then followers)
for node in node1 node2 node3; do
echo "=== Rolling back $node ==="
ssh $node "sudo systemctl stop chainfire"
ssh $node "sudo cp /usr/local/bin/chainfire-server.bak /usr/local/bin/chainfire-server"
ssh $node "sudo systemctl start chainfire"
sleep 10
done
# Verify cluster health
chainfire-client --endpoint http://node1:2379 member-list
```
### Restore from Backup (Disaster Recovery)
If rollback fails, restore from backup (see `backup-restore.md`):
```bash
# Stop all nodes
for node in node1 node2 node3; do
ssh $node "sudo systemctl stop chainfire"
done
# Restore backup to all nodes
BACKUP="/var/backups/chainfire/20251210-020000.tar.gz"
for node in node1 node2 node3; do
scp "$BACKUP" "$node:/tmp/restore.tar.gz"
ssh $node "sudo rm -rf /var/lib/chainfire/*"
ssh $node "sudo tar -xzf /tmp/restore.tar.gz -C /var/lib/chainfire --strip-components=1"
ssh $node "sudo chown -R chainfire:chainfire /var/lib/chainfire"
done
# Restore old binaries
for node in node1 node2 node3; do
ssh $node "sudo cp /usr/local/bin/chainfire-server.bak /usr/local/bin/chainfire-server"
done
# Start leader first
ssh node1 "sudo systemctl start chainfire"
sleep 10
# Start followers
for node in node2 node3; do
ssh $node "sudo systemctl start chainfire"
done
# Verify
chainfire-client --endpoint http://node1:2379 member-list
```
## Troubleshooting
### Issue: Node fails to start after upgrade
**Symptoms:**
- `systemctl status chainfire` shows failed state
- Logs show "incompatible data format" errors
**Resolution:**
```bash
# Check logs
journalctl -u chainfire -n 100 --no-pager
# If data format incompatible, restore from backup
sudo systemctl stop chainfire
sudo mv /var/lib/chainfire /var/lib/chainfire.failed
sudo tar -xzf /var/backups/chainfire/LATEST.tar.gz -C /var/lib/chainfire --strip-components=1
sudo chown -R chainfire:chainfire /var/lib/chainfire
sudo systemctl start chainfire
```
### Issue: Cluster loses quorum during upgrade
**Symptoms:**
- Writes fail with "no leader" errors
- Multiple nodes show different leaders
**Resolution:**
```bash
# Immediately rollback in-progress upgrade
ssh UPGRADED_NODE "sudo systemctl stop chainfire"
ssh UPGRADED_NODE "sudo cp /usr/local/bin/chainfire-server.bak /usr/local/bin/chainfire-server"
ssh UPGRADED_NODE "sudo systemctl start chainfire"
# Wait for cluster to stabilize
sleep 30
# Verify quorum restored
chainfire-client --endpoint http://node1:2379 status
```
### Issue: Performance degradation after upgrade
**Symptoms:**
- Increased write latency
- Higher CPU/memory usage
**Resolution:**
```bash
# Check resource usage
for node in node1 node2 node3; do
echo "=== $node ==="
ssh $node "top -bn1 | head -20"
done
# Check Raft metrics
chainfire-client --endpoint http://node1:2379 status
# If severe, consider rollback
# If acceptable, monitor for 24 hours before proceeding
```
## Maintenance Windows
### Zero-Downtime Upgrade (Recommended)
For clusters with 3+ nodes and applications using client-side retry:
- No maintenance window required
- Upgrade during normal business hours
- Monitor closely
### Scheduled Maintenance Window
For critical production systems or <3 node clusters:
```bash
# 1. Notify users 24 hours in advance
# 2. Schedule 2-hour maintenance window
# 3. Set service to read-only mode (if supported):
chainfire-client --endpoint http://LEADER_IP:2379 set-read-only true
# 4. Perform upgrade (faster without writes)
# 5. Disable read-only mode:
chainfire-client --endpoint http://LEADER_IP:2379 set-read-only false
```
## References
- Configuration: `specifications/configuration.md`
- Backup/Restore: `docs/ops/backup-restore.md`
- Scale-Out: `docs/ops/scale-out.md`
- Release Notes: https://github.com/centra-cloud/chainfire/releases