photoncloud-monorepo/docs/ops/scale-out.md
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

286 lines
6.7 KiB
Markdown

# Scale-Out Runbook
## Overview
This runbook covers adding new nodes to Chainfire (distributed KV) and FlareDB (time-series DB) clusters to increase capacity and fault tolerance.
## Prerequisites
### Infrastructure
- ✅ New server/VM provisioned with network access to existing cluster
- ✅ Ports open: API (2379), Raft (2380), Gossip (2381)
- ✅ NixOS or compatible environment with Rust toolchain
### Certificates (if TLS enabled)
```bash
# Generate TLS certificates for new node
./scripts/generate-dev-certs.sh /etc/centra-cloud/certs
# Copy to new node
scp -r /etc/centra-cloud/certs/chainfire-node-N.{crt,key} new-node:/etc/centra-cloud/certs/
scp /etc/centra-cloud/certs/ca.crt new-node:/etc/centra-cloud/certs/
```
### Configuration
- ✅ Node ID assigned (must be unique cluster-wide)
- ✅ Config file prepared (`/etc/centra-cloud/chainfire.toml` or `/etc/centra-cloud/flaredb.toml`)
## Chainfire Scale-Out
### Step 1: Prepare New Node Configuration
Create `/etc/centra-cloud/chainfire.toml` on the new node:
```toml
[node]
id = 4 # NEW NODE ID (must be unique)
name = "chainfire-node-4"
role = "control_plane"
[cluster]
id = 1
bootstrap = false # IMPORTANT: Do not bootstrap
initial_members = [] # Leave empty for join flow
[network]
api_addr = "0.0.0.0:2379"
raft_addr = "0.0.0.0:2380"
gossip_addr = "0.0.0.0:2381"
[network.tls] # Optional, if TLS enabled
cert_file = "/etc/centra-cloud/certs/chainfire-node-4.crt"
key_file = "/etc/centra-cloud/certs/chainfire-node-4.key"
ca_file = "/etc/centra-cloud/certs/ca.crt"
require_client_cert = true
[storage]
data_dir = "/var/lib/chainfire"
[raft]
role = "voter" # or "learner" for non-voting replica
```
### Step 2: Start New Node Server
```bash
# On new node
cd /path/to/chainfire
nix develop -c cargo run --release --bin chainfire-server -- \
--config /etc/centra-cloud/chainfire.toml
# Verify server is listening
netstat -tlnp | grep -E '2379|2380'
```
### Step 3: Add Node to Cluster via Leader
```bash
# On existing cluster node or via chainfire-client
chainfire-client --endpoint http://LEADER_IP:2379 \
member-add \
--node-id 4 \
--peer-url NEW_NODE_IP:2380 \
--voter # or --learner
# Expected output:
# Node added: id=4, peer_urls=["NEW_NODE_IP:2380"]
```
### Step 4: Verification
```bash
# Check cluster membership
chainfire-client --endpoint http://LEADER_IP:2379 member-list
# Expected output should include new node:
# ID=4, Name=chainfire-node-4, PeerURLs=[NEW_NODE_IP:2380], IsLearner=false
# Check new node status
chainfire-client --endpoint http://NEW_NODE_IP:2379 status
# Verify:
# - leader: (should show leader node ID, e.g., 1)
# - raft_term: (should match leader)
# - raft_index: (should be catching up to leader's index)
```
### Step 5: Promote Learner to Voter (if added as learner)
```bash
# If node was added as learner, promote after data sync
chainfire-client --endpoint http://LEADER_IP:2379 \
member-promote \
--node-id 4
# Verify voting status
chainfire-client --endpoint http://LEADER_IP:2379 member-list
# IsLearner should now be false
```
## FlareDB Scale-Out
### Step 1: Prepare New Node Configuration
Create `/etc/centra-cloud/flaredb.toml` on the new node:
```toml
store_id = 4 # NEW STORE ID (must be unique)
addr = "0.0.0.0:2379"
data_dir = "/var/lib/flaredb"
pd_addr = "PD_SERVER_IP:2379" # Placement Driver address
log_level = "info"
[tls] # Optional, if TLS enabled
cert_file = "/etc/centra-cloud/certs/flaredb-node-4.crt"
key_file = "/etc/centra-cloud/certs/flaredb-node-4.key"
ca_file = "/etc/centra-cloud/certs/ca.crt"
require_client_cert = true
[peers]
# Empty for new node - will be populated by PD
[namespace_modes]
default = "eventual" # or "strong"
```
### Step 2: Start New FlareDB Node
```bash
# On new node
cd /path/to/flaredb
nix develop -c cargo run --release --bin flaredb-server -- \
--config /etc/centra-cloud/flaredb.toml
# Verify server is listening
netstat -tlnp | grep 2379
```
### Step 3: Register with Placement Driver
```bash
# PD should auto-discover the new store
# Check PD logs for registration:
journalctl -u placement-driver -f | grep "store_id=4"
# Verify store registration
curl http://PD_SERVER_IP:2379/pd/api/v1/stores
# Expected: store_id=4 should appear in list
```
### Step 4: Verification
```bash
# Check cluster status
flaredb-client --endpoint http://PD_SERVER_IP:2379 cluster-status
# Verify new store is online:
# store_id=4, state=Up, capacity=..., available=...
# Test write/read
flaredb-client --endpoint http://NEW_NODE_IP:2379 \
put test-key test-value
flaredb-client --endpoint http://NEW_NODE_IP:2379 \
get test-key
# Should return: test-value
```
## Troubleshooting
### Issue: Node fails to join cluster
**Symptoms:**
- `member-add` command hangs or times out
- New node logs show "connection refused" errors
**Resolution:**
1. Verify network connectivity:
```bash
# From leader node
nc -zv NEW_NODE_IP 2380
```
2. Check firewall rules:
```bash
# On new node
sudo iptables -L -n | grep 2380
```
3. Verify Raft server is listening:
```bash
# On new node
ss -tlnp | grep 2380
```
4. Check TLS configuration mismatch:
```bash
# Ensure TLS settings match between nodes
# If leader has TLS enabled, new node must too
```
### Issue: New node stuck as learner
**Symptoms:**
- `member-list` shows `IsLearner=true` after expected promotion time
- Raft index not catching up
**Resolution:**
1. Check replication lag:
```bash
# Compare leader vs new node
chainfire-client --endpoint http://LEADER_IP:2379 status | grep raft_index
chainfire-client --endpoint http://NEW_NODE_IP:2379 status | grep raft_index
```
2. If lag is large, wait for catchup before promoting
3. If stuck, check new node logs for errors:
```bash
journalctl -u chainfire -n 100
```
### Issue: Cluster performance degradation after adding node
**Symptoms:**
- Increased write latency after new node joins
- Leader election instability
**Resolution:**
1. Check node resources (CPU, memory, disk I/O):
```bash
# On new node
top
iostat -x 1
```
2. Verify network latency between nodes:
```bash
# From leader to new node
ping -c 100 NEW_NODE_IP
# Latency should be < 10ms for same datacenter
```
3. Consider adding as learner first, then promoting after stable
## Rollback Procedure
If scale-out causes issues, remove the new node:
```bash
# Remove node from cluster
chainfire-client --endpoint http://LEADER_IP:2379 \
member-remove \
--node-id 4
# Stop server on new node
systemctl stop chainfire
# Clean up data (if needed)
rm -rf /var/lib/chainfire/*
```
## References
- Configuration: `specifications/configuration.md`
- TLS Setup: `docs/ops/troubleshooting.md#tls-issues`
- Cluster API: `chainfire/proto/chainfire.proto` (Cluster service)