# Scale-Out Runbook ## Overview This runbook covers adding new nodes to Chainfire (distributed KV) and FlareDB (time-series DB) clusters to increase capacity and fault tolerance. ## Prerequisites ### Infrastructure - ✅ New server/VM provisioned with network access to existing cluster - ✅ Ports open: API (2379), Raft (2380), Gossip (2381) - ✅ NixOS or compatible environment with Rust toolchain ### Certificates (if TLS enabled) ```bash # Generate TLS certificates for new node ./scripts/generate-dev-certs.sh /etc/centra-cloud/certs # Copy to new node scp -r /etc/centra-cloud/certs/chainfire-node-N.{crt,key} new-node:/etc/centra-cloud/certs/ scp /etc/centra-cloud/certs/ca.crt new-node:/etc/centra-cloud/certs/ ``` ### Configuration - ✅ Node ID assigned (must be unique cluster-wide) - ✅ Config file prepared (`/etc/centra-cloud/chainfire.toml` or `/etc/centra-cloud/flaredb.toml`) ## Chainfire Scale-Out ### Step 1: Prepare New Node Configuration Create `/etc/centra-cloud/chainfire.toml` on the new node: ```toml [node] id = 4 # NEW NODE ID (must be unique) name = "chainfire-node-4" role = "control_plane" [cluster] id = 1 bootstrap = false # IMPORTANT: Do not bootstrap initial_members = [] # Leave empty for join flow [network] api_addr = "0.0.0.0:2379" raft_addr = "0.0.0.0:2380" gossip_addr = "0.0.0.0:2381" [network.tls] # Optional, if TLS enabled cert_file = "/etc/centra-cloud/certs/chainfire-node-4.crt" key_file = "/etc/centra-cloud/certs/chainfire-node-4.key" ca_file = "/etc/centra-cloud/certs/ca.crt" require_client_cert = true [storage] data_dir = "/var/lib/chainfire" [raft] role = "voter" # or "learner" for non-voting replica ``` ### Step 2: Start New Node Server ```bash # On new node cd /path/to/chainfire nix develop -c cargo run --release --bin chainfire-server -- \ --config /etc/centra-cloud/chainfire.toml # Verify server is listening netstat -tlnp | grep -E '2379|2380' ``` ### Step 3: Add Node to Cluster via Leader ```bash # On existing cluster node or via chainfire-client chainfire-client --endpoint http://LEADER_IP:2379 \ member-add \ --node-id 4 \ --peer-url NEW_NODE_IP:2380 \ --voter # or --learner # Expected output: # Node added: id=4, peer_urls=["NEW_NODE_IP:2380"] ``` ### Step 4: Verification ```bash # Check cluster membership chainfire-client --endpoint http://LEADER_IP:2379 member-list # Expected output should include new node: # ID=4, Name=chainfire-node-4, PeerURLs=[NEW_NODE_IP:2380], IsLearner=false # Check new node status chainfire-client --endpoint http://NEW_NODE_IP:2379 status # Verify: # - leader: (should show leader node ID, e.g., 1) # - raft_term: (should match leader) # - raft_index: (should be catching up to leader's index) ``` ### Step 5: Promote Learner to Voter (if added as learner) ```bash # If node was added as learner, promote after data sync chainfire-client --endpoint http://LEADER_IP:2379 \ member-promote \ --node-id 4 # Verify voting status chainfire-client --endpoint http://LEADER_IP:2379 member-list # IsLearner should now be false ``` ## FlareDB Scale-Out ### Step 1: Prepare New Node Configuration Create `/etc/centra-cloud/flaredb.toml` on the new node: ```toml store_id = 4 # NEW STORE ID (must be unique) addr = "0.0.0.0:2379" data_dir = "/var/lib/flaredb" pd_addr = "PD_SERVER_IP:2379" # Placement Driver address log_level = "info" [tls] # Optional, if TLS enabled cert_file = "/etc/centra-cloud/certs/flaredb-node-4.crt" key_file = "/etc/centra-cloud/certs/flaredb-node-4.key" ca_file = "/etc/centra-cloud/certs/ca.crt" require_client_cert = true [peers] # Empty for new node - will be populated by PD [namespace_modes] default = "eventual" # or "strong" ``` ### Step 2: Start New FlareDB Node ```bash # On new node cd /path/to/flaredb nix develop -c cargo run --release --bin flaredb-server -- \ --config /etc/centra-cloud/flaredb.toml # Verify server is listening netstat -tlnp | grep 2379 ``` ### Step 3: Register with Placement Driver ```bash # PD should auto-discover the new store # Check PD logs for registration: journalctl -u placement-driver -f | grep "store_id=4" # Verify store registration curl http://PD_SERVER_IP:2379/pd/api/v1/stores # Expected: store_id=4 should appear in list ``` ### Step 4: Verification ```bash # Check cluster status flaredb-client --endpoint http://PD_SERVER_IP:2379 cluster-status # Verify new store is online: # store_id=4, state=Up, capacity=..., available=... # Test write/read flaredb-client --endpoint http://NEW_NODE_IP:2379 \ put test-key test-value flaredb-client --endpoint http://NEW_NODE_IP:2379 \ get test-key # Should return: test-value ``` ## Troubleshooting ### Issue: Node fails to join cluster **Symptoms:** - `member-add` command hangs or times out - New node logs show "connection refused" errors **Resolution:** 1. Verify network connectivity: ```bash # From leader node nc -zv NEW_NODE_IP 2380 ``` 2. Check firewall rules: ```bash # On new node sudo iptables -L -n | grep 2380 ``` 3. Verify Raft server is listening: ```bash # On new node ss -tlnp | grep 2380 ``` 4. Check TLS configuration mismatch: ```bash # Ensure TLS settings match between nodes # If leader has TLS enabled, new node must too ``` ### Issue: New node stuck as learner **Symptoms:** - `member-list` shows `IsLearner=true` after expected promotion time - Raft index not catching up **Resolution:** 1. Check replication lag: ```bash # Compare leader vs new node chainfire-client --endpoint http://LEADER_IP:2379 status | grep raft_index chainfire-client --endpoint http://NEW_NODE_IP:2379 status | grep raft_index ``` 2. If lag is large, wait for catchup before promoting 3. If stuck, check new node logs for errors: ```bash journalctl -u chainfire -n 100 ``` ### Issue: Cluster performance degradation after adding node **Symptoms:** - Increased write latency after new node joins - Leader election instability **Resolution:** 1. Check node resources (CPU, memory, disk I/O): ```bash # On new node top iostat -x 1 ``` 2. Verify network latency between nodes: ```bash # From leader to new node ping -c 100 NEW_NODE_IP # Latency should be < 10ms for same datacenter ``` 3. Consider adding as learner first, then promoting after stable ## Rollback Procedure If scale-out causes issues, remove the new node: ```bash # Remove node from cluster chainfire-client --endpoint http://LEADER_IP:2379 \ member-remove \ --node-id 4 # Stop server on new node systemctl stop chainfire # Clean up data (if needed) rm -rf /var/lib/chainfire/* ``` ## References - Configuration: `specifications/configuration.md` - TLS Setup: `docs/ops/troubleshooting.md#tls-issues` - Cluster API: `chainfire/proto/chainfire.proto` (Cluster service)