photoncloud-monorepo/docs/ops/scale-out.md

# Scale-Out Runbook

## Overview

This runbook covers adding new nodes to Chainfire (distributed KV) and FlareDB (time-series DB) clusters to increase capacity and fault tolerance.

## Prerequisites

### Infrastructure
- ✅ New server/VM provisioned with network access to existing cluster
- ✅ Ports open: API (2379), Raft (2380), Gossip (2381)
- ✅ NixOS or compatible environment with Rust toolchain

### Certificates (if TLS enabled)
```bash
# Generate TLS certificates for new node
./scripts/generate-dev-certs.sh /etc/centra-cloud/certs

# Copy to new node
scp -r /etc/centra-cloud/certs/chainfire-node-N.{crt,key} new-node:/etc/centra-cloud/certs/
scp /etc/centra-cloud/certs/ca.crt new-node:/etc/centra-cloud/certs/
```

### Configuration
- ✅ Node ID assigned (must be unique cluster-wide)
- ✅ Config file prepared (`/etc/centra-cloud/chainfire.toml` or `/etc/centra-cloud/flaredb.toml`)

## Chainfire Scale-Out

### Step 1: Prepare New Node Configuration

Create `/etc/centra-cloud/chainfire.toml` on the new node:

```toml
[node]
id = 4  # NEW NODE ID (must be unique)
name = "chainfire-node-4"
role = "control_plane"

[cluster]
id = 1
bootstrap = false  # IMPORTANT: Do not bootstrap
initial_members = []  # Leave empty for join flow

[network]
api_addr = "0.0.0.0:2379"
raft_addr = "0.0.0.0:2380"
gossip_addr = "0.0.0.0:2381"

[network.tls]  # Optional, if TLS enabled
cert_file = "/etc/centra-cloud/certs/chainfire-node-4.crt"
key_file = "/etc/centra-cloud/certs/chainfire-node-4.key"
ca_file = "/etc/centra-cloud/certs/ca.crt"
require_client_cert = true

[storage]
data_dir = "/var/lib/chainfire"

[raft]
role = "voter"  # or "learner" for non-voting replica
```

### Step 2: Start New Node Server

```bash
# On new node
cd /path/to/chainfire
nix develop -c cargo run --release --bin chainfire-server -- \
  --config /etc/centra-cloud/chainfire.toml

# Verify server is listening
netstat -tlnp | grep -E '2379|2380'
```

### Step 3: Add Node to Cluster via Leader

```bash
# On existing cluster node or via chainfire-client
chainfire-client --endpoint http://LEADER_IP:2379 \
  member-add \
  --node-id 4 \
  --peer-url NEW_NODE_IP:2380 \
  --voter  # or --learner

# Expected output:
# Node added: id=4, peer_urls=["NEW_NODE_IP:2380"]
```

### Step 4: Verification

```bash
# Check cluster membership
chainfire-client --endpoint http://LEADER_IP:2379 member-list

# Expected output should include new node:
# ID=4, Name=chainfire-node-4, PeerURLs=[NEW_NODE_IP:2380], IsLearner=false

# Check new node status
chainfire-client --endpoint http://NEW_NODE_IP:2379 status

# Verify:
# - leader: (should show leader node ID, e.g., 1)
# - raft_term: (should match leader)
# - raft_index: (should be catching up to leader's index)
```

### Step 5: Promote Learner to Voter (if added as learner)

```bash
# If node was added as learner, promote after data sync
chainfire-client --endpoint http://LEADER_IP:2379 \
  member-promote \
  --node-id 4

# Verify voting status
chainfire-client --endpoint http://LEADER_IP:2379 member-list
# IsLearner should now be false
```

## FlareDB Scale-Out

### Step 1: Prepare New Node Configuration

Create `/etc/centra-cloud/flaredb.toml` on the new node:

```toml
store_id = 4  # NEW STORE ID (must be unique)
addr = "0.0.0.0:2379"
data_dir = "/var/lib/flaredb"
pd_addr = "PD_SERVER_IP:2379"  # Placement Driver address
log_level = "info"

[tls]  # Optional, if TLS enabled
cert_file = "/etc/centra-cloud/certs/flaredb-node-4.crt"
key_file = "/etc/centra-cloud/certs/flaredb-node-4.key"
ca_file = "/etc/centra-cloud/certs/ca.crt"
require_client_cert = true

[peers]
# Empty for new node - will be populated by PD

[namespace_modes]
default = "eventual"  # or "strong"
```

### Step 2: Start New FlareDB Node

```bash
# On new node
cd /path/to/flaredb
nix develop -c cargo run --release --bin flaredb-server -- \
  --config /etc/centra-cloud/flaredb.toml

# Verify server is listening
netstat -tlnp | grep 2379
```

### Step 3: Register with Placement Driver

```bash
# PD should auto-discover the new store
# Check PD logs for registration:
journalctl -u placement-driver -f | grep "store_id=4"

# Verify store registration
curl http://PD_SERVER_IP:2379/pd/api/v1/stores

# Expected: store_id=4 should appear in list
```

### Step 4: Verification

```bash
# Check cluster status
flaredb-client --endpoint http://PD_SERVER_IP:2379 cluster-status

# Verify new store is online:
# store_id=4, state=Up, capacity=..., available=...

# Test write/read
flaredb-client --endpoint http://NEW_NODE_IP:2379 \
  put test-key test-value
flaredb-client --endpoint http://NEW_NODE_IP:2379 \
  get test-key
# Should return: test-value
```

## Troubleshooting

### Issue: Node fails to join cluster

**Symptoms:**
- `member-add` command hangs or times out
- New node logs show "connection refused" errors

**Resolution:**
1. Verify network connectivity:
   ```bash
   # From leader node
   nc -zv NEW_NODE_IP 2380
   ```

2. Check firewall rules:
   ```bash
   # On new node
   sudo iptables -L -n | grep 2380
   ```

3. Verify Raft server is listening:
   ```bash
   # On new node
   ss -tlnp | grep 2380
   ```

4. Check TLS configuration mismatch:
   ```bash
   # Ensure TLS settings match between nodes
   # If leader has TLS enabled, new node must too
   ```

### Issue: New node stuck as learner

**Symptoms:**
- `member-list` shows `IsLearner=true` after expected promotion time
- Raft index not catching up

**Resolution:**
1. Check replication lag:
   ```bash
   # Compare leader vs new node
   chainfire-client --endpoint http://LEADER_IP:2379 status | grep raft_index
   chainfire-client --endpoint http://NEW_NODE_IP:2379 status | grep raft_index
   ```

2. If lag is large, wait for catchup before promoting

3. If stuck, check new node logs for errors:
   ```bash
   journalctl -u chainfire -n 100
   ```

### Issue: Cluster performance degradation after adding node

**Symptoms:**
- Increased write latency after new node joins
- Leader election instability

**Resolution:**
1. Check node resources (CPU, memory, disk I/O):
   ```bash
   # On new node
   top
   iostat -x 1
   ```

2. Verify network latency between nodes:
   ```bash
   # From leader to new node
   ping -c 100 NEW_NODE_IP
   # Latency should be < 10ms for same datacenter
   ```

3. Consider adding as learner first, then promoting after stable

## Rollback Procedure

If scale-out causes issues, remove the new node:

```bash
# Remove node from cluster
chainfire-client --endpoint http://LEADER_IP:2379 \
  member-remove \
  --node-id 4

# Stop server on new node
systemctl stop chainfire

# Clean up data (if needed)
rm -rf /var/lib/chainfire/*
```

## References

- Configuration: `specifications/configuration.md`
- TLS Setup: `docs/ops/troubleshooting.md#tls-issues`
- Cluster API: `chainfire/proto/chainfire.proto` (Cluster service)