- netboot-base.nix with SSH key auth - Launch scripts for node01/02/03 - Node configuration.nix and disko.nix - Nix modules for first-boot automation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
6.7 KiB
6.7 KiB
Scale-Out Runbook
Overview
This runbook covers adding new nodes to Chainfire (distributed KV) and FlareDB (time-series DB) clusters to increase capacity and fault tolerance.
Prerequisites
Infrastructure
- ✅ New server/VM provisioned with network access to existing cluster
- ✅ Ports open: API (2379), Raft (2380), Gossip (2381)
- ✅ NixOS or compatible environment with Rust toolchain
Certificates (if TLS enabled)
# Generate TLS certificates for new node
./scripts/generate-dev-certs.sh /etc/centra-cloud/certs
# Copy to new node
scp -r /etc/centra-cloud/certs/chainfire-node-N.{crt,key} new-node:/etc/centra-cloud/certs/
scp /etc/centra-cloud/certs/ca.crt new-node:/etc/centra-cloud/certs/
Configuration
- ✅ Node ID assigned (must be unique cluster-wide)
- ✅ Config file prepared (
/etc/centra-cloud/chainfire.tomlor/etc/centra-cloud/flaredb.toml)
Chainfire Scale-Out
Step 1: Prepare New Node Configuration
Create /etc/centra-cloud/chainfire.toml on the new node:
[node]
id = 4 # NEW NODE ID (must be unique)
name = "chainfire-node-4"
role = "control_plane"
[cluster]
id = 1
bootstrap = false # IMPORTANT: Do not bootstrap
initial_members = [] # Leave empty for join flow
[network]
api_addr = "0.0.0.0:2379"
raft_addr = "0.0.0.0:2380"
gossip_addr = "0.0.0.0:2381"
[network.tls] # Optional, if TLS enabled
cert_file = "/etc/centra-cloud/certs/chainfire-node-4.crt"
key_file = "/etc/centra-cloud/certs/chainfire-node-4.key"
ca_file = "/etc/centra-cloud/certs/ca.crt"
require_client_cert = true
[storage]
data_dir = "/var/lib/chainfire"
[raft]
role = "voter" # or "learner" for non-voting replica
Step 2: Start New Node Server
# On new node
cd /path/to/chainfire
nix develop -c cargo run --release --bin chainfire-server -- \
--config /etc/centra-cloud/chainfire.toml
# Verify server is listening
netstat -tlnp | grep -E '2379|2380'
Step 3: Add Node to Cluster via Leader
# On existing cluster node or via chainfire-client
chainfire-client --endpoint http://LEADER_IP:2379 \
member-add \
--node-id 4 \
--peer-url NEW_NODE_IP:2380 \
--voter # or --learner
# Expected output:
# Node added: id=4, peer_urls=["NEW_NODE_IP:2380"]
Step 4: Verification
# Check cluster membership
chainfire-client --endpoint http://LEADER_IP:2379 member-list
# Expected output should include new node:
# ID=4, Name=chainfire-node-4, PeerURLs=[NEW_NODE_IP:2380], IsLearner=false
# Check new node status
chainfire-client --endpoint http://NEW_NODE_IP:2379 status
# Verify:
# - leader: (should show leader node ID, e.g., 1)
# - raft_term: (should match leader)
# - raft_index: (should be catching up to leader's index)
Step 5: Promote Learner to Voter (if added as learner)
# If node was added as learner, promote after data sync
chainfire-client --endpoint http://LEADER_IP:2379 \
member-promote \
--node-id 4
# Verify voting status
chainfire-client --endpoint http://LEADER_IP:2379 member-list
# IsLearner should now be false
FlareDB Scale-Out
Step 1: Prepare New Node Configuration
Create /etc/centra-cloud/flaredb.toml on the new node:
store_id = 4 # NEW STORE ID (must be unique)
addr = "0.0.0.0:2379"
data_dir = "/var/lib/flaredb"
pd_addr = "PD_SERVER_IP:2379" # Placement Driver address
log_level = "info"
[tls] # Optional, if TLS enabled
cert_file = "/etc/centra-cloud/certs/flaredb-node-4.crt"
key_file = "/etc/centra-cloud/certs/flaredb-node-4.key"
ca_file = "/etc/centra-cloud/certs/ca.crt"
require_client_cert = true
[peers]
# Empty for new node - will be populated by PD
[namespace_modes]
default = "eventual" # or "strong"
Step 2: Start New FlareDB Node
# On new node
cd /path/to/flaredb
nix develop -c cargo run --release --bin flaredb-server -- \
--config /etc/centra-cloud/flaredb.toml
# Verify server is listening
netstat -tlnp | grep 2379
Step 3: Register with Placement Driver
# PD should auto-discover the new store
# Check PD logs for registration:
journalctl -u placement-driver -f | grep "store_id=4"
# Verify store registration
curl http://PD_SERVER_IP:2379/pd/api/v1/stores
# Expected: store_id=4 should appear in list
Step 4: Verification
# Check cluster status
flaredb-client --endpoint http://PD_SERVER_IP:2379 cluster-status
# Verify new store is online:
# store_id=4, state=Up, capacity=..., available=...
# Test write/read
flaredb-client --endpoint http://NEW_NODE_IP:2379 \
put test-key test-value
flaredb-client --endpoint http://NEW_NODE_IP:2379 \
get test-key
# Should return: test-value
Troubleshooting
Issue: Node fails to join cluster
Symptoms:
member-addcommand hangs or times out- New node logs show "connection refused" errors
Resolution:
-
Verify network connectivity:
# From leader node nc -zv NEW_NODE_IP 2380 -
Check firewall rules:
# On new node sudo iptables -L -n | grep 2380 -
Verify Raft server is listening:
# On new node ss -tlnp | grep 2380 -
Check TLS configuration mismatch:
# Ensure TLS settings match between nodes # If leader has TLS enabled, new node must too
Issue: New node stuck as learner
Symptoms:
member-listshowsIsLearner=trueafter expected promotion time- Raft index not catching up
Resolution:
-
Check replication lag:
# Compare leader vs new node chainfire-client --endpoint http://LEADER_IP:2379 status | grep raft_index chainfire-client --endpoint http://NEW_NODE_IP:2379 status | grep raft_index -
If lag is large, wait for catchup before promoting
-
If stuck, check new node logs for errors:
journalctl -u chainfire -n 100
Issue: Cluster performance degradation after adding node
Symptoms:
- Increased write latency after new node joins
- Leader election instability
Resolution:
-
Check node resources (CPU, memory, disk I/O):
# On new node top iostat -x 1 -
Verify network latency between nodes:
# From leader to new node ping -c 100 NEW_NODE_IP # Latency should be < 10ms for same datacenter -
Consider adding as learner first, then promoting after stable
Rollback Procedure
If scale-out causes issues, remove the new node:
# Remove node from cluster
chainfire-client --endpoint http://LEADER_IP:2379 \
member-remove \
--node-id 4
# Stop server on new node
systemctl stop chainfire
# Clean up data (if needed)
rm -rf /var/lib/chainfire/*
References
- Configuration:
specifications/configuration.md - TLS Setup:
docs/ops/troubleshooting.md#tls-issues - Cluster API:
chainfire/proto/chainfire.proto(Cluster service)