photoncloud-monorepo/docs/ops/scale-out.md
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

6.7 KiB

Scale-Out Runbook

Overview

This runbook covers adding new nodes to Chainfire (distributed KV) and FlareDB (time-series DB) clusters to increase capacity and fault tolerance.

Prerequisites

Infrastructure

  • New server/VM provisioned with network access to existing cluster
  • Ports open: API (2379), Raft (2380), Gossip (2381)
  • NixOS or compatible environment with Rust toolchain

Certificates (if TLS enabled)

# Generate TLS certificates for new node
./scripts/generate-dev-certs.sh /etc/centra-cloud/certs

# Copy to new node
scp -r /etc/centra-cloud/certs/chainfire-node-N.{crt,key} new-node:/etc/centra-cloud/certs/
scp /etc/centra-cloud/certs/ca.crt new-node:/etc/centra-cloud/certs/

Configuration

  • Node ID assigned (must be unique cluster-wide)
  • Config file prepared (/etc/centra-cloud/chainfire.toml or /etc/centra-cloud/flaredb.toml)

Chainfire Scale-Out

Step 1: Prepare New Node Configuration

Create /etc/centra-cloud/chainfire.toml on the new node:

[node]
id = 4  # NEW NODE ID (must be unique)
name = "chainfire-node-4"
role = "control_plane"

[cluster]
id = 1
bootstrap = false  # IMPORTANT: Do not bootstrap
initial_members = []  # Leave empty for join flow

[network]
api_addr = "0.0.0.0:2379"
raft_addr = "0.0.0.0:2380"
gossip_addr = "0.0.0.0:2381"

[network.tls]  # Optional, if TLS enabled
cert_file = "/etc/centra-cloud/certs/chainfire-node-4.crt"
key_file = "/etc/centra-cloud/certs/chainfire-node-4.key"
ca_file = "/etc/centra-cloud/certs/ca.crt"
require_client_cert = true

[storage]
data_dir = "/var/lib/chainfire"

[raft]
role = "voter"  # or "learner" for non-voting replica

Step 2: Start New Node Server

# On new node
cd /path/to/chainfire
nix develop -c cargo run --release --bin chainfire-server -- \
  --config /etc/centra-cloud/chainfire.toml

# Verify server is listening
netstat -tlnp | grep -E '2379|2380'

Step 3: Add Node to Cluster via Leader

# On existing cluster node or via chainfire-client
chainfire-client --endpoint http://LEADER_IP:2379 \
  member-add \
  --node-id 4 \
  --peer-url NEW_NODE_IP:2380 \
  --voter  # or --learner

# Expected output:
# Node added: id=4, peer_urls=["NEW_NODE_IP:2380"]

Step 4: Verification

# Check cluster membership
chainfire-client --endpoint http://LEADER_IP:2379 member-list

# Expected output should include new node:
# ID=4, Name=chainfire-node-4, PeerURLs=[NEW_NODE_IP:2380], IsLearner=false

# Check new node status
chainfire-client --endpoint http://NEW_NODE_IP:2379 status

# Verify:
# - leader: (should show leader node ID, e.g., 1)
# - raft_term: (should match leader)
# - raft_index: (should be catching up to leader's index)

Step 5: Promote Learner to Voter (if added as learner)

# If node was added as learner, promote after data sync
chainfire-client --endpoint http://LEADER_IP:2379 \
  member-promote \
  --node-id 4

# Verify voting status
chainfire-client --endpoint http://LEADER_IP:2379 member-list
# IsLearner should now be false

FlareDB Scale-Out

Step 1: Prepare New Node Configuration

Create /etc/centra-cloud/flaredb.toml on the new node:

store_id = 4  # NEW STORE ID (must be unique)
addr = "0.0.0.0:2379"
data_dir = "/var/lib/flaredb"
pd_addr = "PD_SERVER_IP:2379"  # Placement Driver address
log_level = "info"

[tls]  # Optional, if TLS enabled
cert_file = "/etc/centra-cloud/certs/flaredb-node-4.crt"
key_file = "/etc/centra-cloud/certs/flaredb-node-4.key"
ca_file = "/etc/centra-cloud/certs/ca.crt"
require_client_cert = true

[peers]
# Empty for new node - will be populated by PD

[namespace_modes]
default = "eventual"  # or "strong"

Step 2: Start New FlareDB Node

# On new node
cd /path/to/flaredb
nix develop -c cargo run --release --bin flaredb-server -- \
  --config /etc/centra-cloud/flaredb.toml

# Verify server is listening
netstat -tlnp | grep 2379

Step 3: Register with Placement Driver

# PD should auto-discover the new store
# Check PD logs for registration:
journalctl -u placement-driver -f | grep "store_id=4"

# Verify store registration
curl http://PD_SERVER_IP:2379/pd/api/v1/stores

# Expected: store_id=4 should appear in list

Step 4: Verification

# Check cluster status
flaredb-client --endpoint http://PD_SERVER_IP:2379 cluster-status

# Verify new store is online:
# store_id=4, state=Up, capacity=..., available=...

# Test write/read
flaredb-client --endpoint http://NEW_NODE_IP:2379 \
  put test-key test-value
flaredb-client --endpoint http://NEW_NODE_IP:2379 \
  get test-key
# Should return: test-value

Troubleshooting

Issue: Node fails to join cluster

Symptoms:

  • member-add command hangs or times out
  • New node logs show "connection refused" errors

Resolution:

  1. Verify network connectivity:

    # From leader node
    nc -zv NEW_NODE_IP 2380
    
  2. Check firewall rules:

    # On new node
    sudo iptables -L -n | grep 2380
    
  3. Verify Raft server is listening:

    # On new node
    ss -tlnp | grep 2380
    
  4. Check TLS configuration mismatch:

    # Ensure TLS settings match between nodes
    # If leader has TLS enabled, new node must too
    

Issue: New node stuck as learner

Symptoms:

  • member-list shows IsLearner=true after expected promotion time
  • Raft index not catching up

Resolution:

  1. Check replication lag:

    # Compare leader vs new node
    chainfire-client --endpoint http://LEADER_IP:2379 status | grep raft_index
    chainfire-client --endpoint http://NEW_NODE_IP:2379 status | grep raft_index
    
  2. If lag is large, wait for catchup before promoting

  3. If stuck, check new node logs for errors:

    journalctl -u chainfire -n 100
    

Issue: Cluster performance degradation after adding node

Symptoms:

  • Increased write latency after new node joins
  • Leader election instability

Resolution:

  1. Check node resources (CPU, memory, disk I/O):

    # On new node
    top
    iostat -x 1
    
  2. Verify network latency between nodes:

    # From leader to new node
    ping -c 100 NEW_NODE_IP
    # Latency should be < 10ms for same datacenter
    
  3. Consider adding as learner first, then promoting after stable

Rollback Procedure

If scale-out causes issues, remove the new node:

# Remove node from cluster
chainfire-client --endpoint http://LEADER_IP:2379 \
  member-remove \
  --node-id 4

# Stop server on new node
systemctl stop chainfire

# Clean up data (if needed)
rm -rf /var/lib/chainfire/*

References

  • Configuration: specifications/configuration.md
  • TLS Setup: docs/ops/troubleshooting.md#tls-issues
  • Cluster API: chainfire/proto/chainfire.proto (Cluster service)