- netboot-base.nix with SSH key auth - Launch scripts for node01/02/03 - Node configuration.nix and disko.nix - Nix modules for first-boot automation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
286 lines
6.7 KiB
Markdown
286 lines
6.7 KiB
Markdown
# Scale-Out Runbook
|
|
|
|
## Overview
|
|
|
|
This runbook covers adding new nodes to Chainfire (distributed KV) and FlareDB (time-series DB) clusters to increase capacity and fault tolerance.
|
|
|
|
## Prerequisites
|
|
|
|
### Infrastructure
|
|
- ✅ New server/VM provisioned with network access to existing cluster
|
|
- ✅ Ports open: API (2379), Raft (2380), Gossip (2381)
|
|
- ✅ NixOS or compatible environment with Rust toolchain
|
|
|
|
### Certificates (if TLS enabled)
|
|
```bash
|
|
# Generate TLS certificates for new node
|
|
./scripts/generate-dev-certs.sh /etc/centra-cloud/certs
|
|
|
|
# Copy to new node
|
|
scp -r /etc/centra-cloud/certs/chainfire-node-N.{crt,key} new-node:/etc/centra-cloud/certs/
|
|
scp /etc/centra-cloud/certs/ca.crt new-node:/etc/centra-cloud/certs/
|
|
```
|
|
|
|
### Configuration
|
|
- ✅ Node ID assigned (must be unique cluster-wide)
|
|
- ✅ Config file prepared (`/etc/centra-cloud/chainfire.toml` or `/etc/centra-cloud/flaredb.toml`)
|
|
|
|
## Chainfire Scale-Out
|
|
|
|
### Step 1: Prepare New Node Configuration
|
|
|
|
Create `/etc/centra-cloud/chainfire.toml` on the new node:
|
|
|
|
```toml
|
|
[node]
|
|
id = 4 # NEW NODE ID (must be unique)
|
|
name = "chainfire-node-4"
|
|
role = "control_plane"
|
|
|
|
[cluster]
|
|
id = 1
|
|
bootstrap = false # IMPORTANT: Do not bootstrap
|
|
initial_members = [] # Leave empty for join flow
|
|
|
|
[network]
|
|
api_addr = "0.0.0.0:2379"
|
|
raft_addr = "0.0.0.0:2380"
|
|
gossip_addr = "0.0.0.0:2381"
|
|
|
|
[network.tls] # Optional, if TLS enabled
|
|
cert_file = "/etc/centra-cloud/certs/chainfire-node-4.crt"
|
|
key_file = "/etc/centra-cloud/certs/chainfire-node-4.key"
|
|
ca_file = "/etc/centra-cloud/certs/ca.crt"
|
|
require_client_cert = true
|
|
|
|
[storage]
|
|
data_dir = "/var/lib/chainfire"
|
|
|
|
[raft]
|
|
role = "voter" # or "learner" for non-voting replica
|
|
```
|
|
|
|
### Step 2: Start New Node Server
|
|
|
|
```bash
|
|
# On new node
|
|
cd /path/to/chainfire
|
|
nix develop -c cargo run --release --bin chainfire-server -- \
|
|
--config /etc/centra-cloud/chainfire.toml
|
|
|
|
# Verify server is listening
|
|
netstat -tlnp | grep -E '2379|2380'
|
|
```
|
|
|
|
### Step 3: Add Node to Cluster via Leader
|
|
|
|
```bash
|
|
# On existing cluster node or via chainfire-client
|
|
chainfire-client --endpoint http://LEADER_IP:2379 \
|
|
member-add \
|
|
--node-id 4 \
|
|
--peer-url NEW_NODE_IP:2380 \
|
|
--voter # or --learner
|
|
|
|
# Expected output:
|
|
# Node added: id=4, peer_urls=["NEW_NODE_IP:2380"]
|
|
```
|
|
|
|
### Step 4: Verification
|
|
|
|
```bash
|
|
# Check cluster membership
|
|
chainfire-client --endpoint http://LEADER_IP:2379 member-list
|
|
|
|
# Expected output should include new node:
|
|
# ID=4, Name=chainfire-node-4, PeerURLs=[NEW_NODE_IP:2380], IsLearner=false
|
|
|
|
# Check new node status
|
|
chainfire-client --endpoint http://NEW_NODE_IP:2379 status
|
|
|
|
# Verify:
|
|
# - leader: (should show leader node ID, e.g., 1)
|
|
# - raft_term: (should match leader)
|
|
# - raft_index: (should be catching up to leader's index)
|
|
```
|
|
|
|
### Step 5: Promote Learner to Voter (if added as learner)
|
|
|
|
```bash
|
|
# If node was added as learner, promote after data sync
|
|
chainfire-client --endpoint http://LEADER_IP:2379 \
|
|
member-promote \
|
|
--node-id 4
|
|
|
|
# Verify voting status
|
|
chainfire-client --endpoint http://LEADER_IP:2379 member-list
|
|
# IsLearner should now be false
|
|
```
|
|
|
|
## FlareDB Scale-Out
|
|
|
|
### Step 1: Prepare New Node Configuration
|
|
|
|
Create `/etc/centra-cloud/flaredb.toml` on the new node:
|
|
|
|
```toml
|
|
store_id = 4 # NEW STORE ID (must be unique)
|
|
addr = "0.0.0.0:2379"
|
|
data_dir = "/var/lib/flaredb"
|
|
pd_addr = "PD_SERVER_IP:2379" # Placement Driver address
|
|
log_level = "info"
|
|
|
|
[tls] # Optional, if TLS enabled
|
|
cert_file = "/etc/centra-cloud/certs/flaredb-node-4.crt"
|
|
key_file = "/etc/centra-cloud/certs/flaredb-node-4.key"
|
|
ca_file = "/etc/centra-cloud/certs/ca.crt"
|
|
require_client_cert = true
|
|
|
|
[peers]
|
|
# Empty for new node - will be populated by PD
|
|
|
|
[namespace_modes]
|
|
default = "eventual" # or "strong"
|
|
```
|
|
|
|
### Step 2: Start New FlareDB Node
|
|
|
|
```bash
|
|
# On new node
|
|
cd /path/to/flaredb
|
|
nix develop -c cargo run --release --bin flaredb-server -- \
|
|
--config /etc/centra-cloud/flaredb.toml
|
|
|
|
# Verify server is listening
|
|
netstat -tlnp | grep 2379
|
|
```
|
|
|
|
### Step 3: Register with Placement Driver
|
|
|
|
```bash
|
|
# PD should auto-discover the new store
|
|
# Check PD logs for registration:
|
|
journalctl -u placement-driver -f | grep "store_id=4"
|
|
|
|
# Verify store registration
|
|
curl http://PD_SERVER_IP:2379/pd/api/v1/stores
|
|
|
|
# Expected: store_id=4 should appear in list
|
|
```
|
|
|
|
### Step 4: Verification
|
|
|
|
```bash
|
|
# Check cluster status
|
|
flaredb-client --endpoint http://PD_SERVER_IP:2379 cluster-status
|
|
|
|
# Verify new store is online:
|
|
# store_id=4, state=Up, capacity=..., available=...
|
|
|
|
# Test write/read
|
|
flaredb-client --endpoint http://NEW_NODE_IP:2379 \
|
|
put test-key test-value
|
|
flaredb-client --endpoint http://NEW_NODE_IP:2379 \
|
|
get test-key
|
|
# Should return: test-value
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Issue: Node fails to join cluster
|
|
|
|
**Symptoms:**
|
|
- `member-add` command hangs or times out
|
|
- New node logs show "connection refused" errors
|
|
|
|
**Resolution:**
|
|
1. Verify network connectivity:
|
|
```bash
|
|
# From leader node
|
|
nc -zv NEW_NODE_IP 2380
|
|
```
|
|
|
|
2. Check firewall rules:
|
|
```bash
|
|
# On new node
|
|
sudo iptables -L -n | grep 2380
|
|
```
|
|
|
|
3. Verify Raft server is listening:
|
|
```bash
|
|
# On new node
|
|
ss -tlnp | grep 2380
|
|
```
|
|
|
|
4. Check TLS configuration mismatch:
|
|
```bash
|
|
# Ensure TLS settings match between nodes
|
|
# If leader has TLS enabled, new node must too
|
|
```
|
|
|
|
### Issue: New node stuck as learner
|
|
|
|
**Symptoms:**
|
|
- `member-list` shows `IsLearner=true` after expected promotion time
|
|
- Raft index not catching up
|
|
|
|
**Resolution:**
|
|
1. Check replication lag:
|
|
```bash
|
|
# Compare leader vs new node
|
|
chainfire-client --endpoint http://LEADER_IP:2379 status | grep raft_index
|
|
chainfire-client --endpoint http://NEW_NODE_IP:2379 status | grep raft_index
|
|
```
|
|
|
|
2. If lag is large, wait for catchup before promoting
|
|
|
|
3. If stuck, check new node logs for errors:
|
|
```bash
|
|
journalctl -u chainfire -n 100
|
|
```
|
|
|
|
### Issue: Cluster performance degradation after adding node
|
|
|
|
**Symptoms:**
|
|
- Increased write latency after new node joins
|
|
- Leader election instability
|
|
|
|
**Resolution:**
|
|
1. Check node resources (CPU, memory, disk I/O):
|
|
```bash
|
|
# On new node
|
|
top
|
|
iostat -x 1
|
|
```
|
|
|
|
2. Verify network latency between nodes:
|
|
```bash
|
|
# From leader to new node
|
|
ping -c 100 NEW_NODE_IP
|
|
# Latency should be < 10ms for same datacenter
|
|
```
|
|
|
|
3. Consider adding as learner first, then promoting after stable
|
|
|
|
## Rollback Procedure
|
|
|
|
If scale-out causes issues, remove the new node:
|
|
|
|
```bash
|
|
# Remove node from cluster
|
|
chainfire-client --endpoint http://LEADER_IP:2379 \
|
|
member-remove \
|
|
--node-id 4
|
|
|
|
# Stop server on new node
|
|
systemctl stop chainfire
|
|
|
|
# Clean up data (if needed)
|
|
rm -rf /var/lib/chainfire/*
|
|
```
|
|
|
|
## References
|
|
|
|
- Configuration: `specifications/configuration.md`
|
|
- TLS Setup: `docs/ops/troubleshooting.md#tls-issues`
|
|
- Cluster API: `chainfire/proto/chainfire.proto` (Cluster service)
|