photoncloud-monorepo/docs/por/T032-baremetal-provisioning/RUNBOOK.md
centra d2149b6249 fix(lightningstor): Fix SigV4 canonicalization for AWS S3 auth
- Replace form_urlencoded with RFC 3986 compliant URI encoding
- Implement aws_uri_encode() matching AWS SigV4 spec exactly
- Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded
- All other chars percent-encoded with uppercase hex
- Preserve slashes in paths, encode in query params
- Normalize empty paths to '/' per AWS spec
- Fix test expectations (body hash, HMAC values)
- Add comprehensive SigV4 signature determinism test

This fixes the canonicalization mismatch that caused signature
validation failures in T047. Auth can now be enabled for production.

Refs: T058.S1
2025-12-12 06:23:46 +09:00

2178 lines
62 KiB
Markdown

# Bare-Metal Provisioning Operator Runbook
**Document Version:** 1.0
**Last Updated:** 2025-12-10
**Status:** Production Ready
**Author:** PlasmaCloud Infrastructure Team
## 1. Overview
### 1.1 What This Runbook Covers
This runbook provides comprehensive, step-by-step instructions for deploying PlasmaCloud infrastructure on bare-metal servers using automated PXE-based provisioning. By following this guide, operators will be able to:
- Deploy a complete PlasmaCloud cluster from bare hardware to running services
- Bootstrap a 3-node Raft cluster (Chainfire + FlareDB)
- Add additional nodes to an existing cluster
- Validate cluster health and troubleshoot common issues
- Perform operational tasks (updates, maintenance, recovery)
### 1.2 Prerequisites
**Required Access and Permissions:**
- Root/sudo access on provisioning server
- Physical or IPMI/BMC access to bare-metal servers
- Network access to provisioning VLAN
- SSH key pair for nixos-anywhere
**Required Tools:**
- NixOS with flakes enabled (provisioning workstation)
- curl, jq, ssh client
- ipmitool (optional, for remote management)
- Serial console access tool (optional)
**Required Knowledge:**
- Basic understanding of PXE boot process
- Linux system administration
- Network configuration (DHCP, DNS, firewall)
- NixOS basics (declarative configuration, flakes)
### 1.3 Architecture Diagram
```
┌─────────────────────────────────────────────────────────────────────────┐
│ Bare-Metal Provisioning Flow │
└─────────────────────────────────────────────────────────────────────────┘
Phase 1: PXE Boot Phase 2: Installation
┌──────────────┐ ┌──────────────────┐
│ Bare-Metal │ 1. DHCP Request │ DHCP Server │
│ Server ├─────────────────>│ (PXE Server) │
│ │ └──────────────────┘
│ (powered │ 2. TFTP Get │
│ on, PXE │ bootloader │
│ enabled) │<───────────────────────────┘
│ │
│ 3. iPXE │ 4. HTTP Get ┌──────────────────┐
│ loads │ boot.ipxe │ HTTP Server │
│ ├──────────────────>│ (nginx) │
│ │ └──────────────────┘
│ 5. iPXE │ 6. HTTP Get │
│ menu │ kernel+initrd │
│ │<───────────────────────────┘
│ │
│ 7. Boot │
│ NixOS │
│ Installer│
└──────┬───────┘
│ 8. SSH Connection ┌──────────────────┐
└───────────────────────────>│ Provisioning │
│ Workstation │
│ │
│ 9. Run │
│ nixos- │
│ anywhere │
└──────┬───────────┘
┌────────────────────┴────────────────────┐
│ │
v v
┌──────────────────────────┐ ┌──────────────────────────┐
│ 10. Partition disks │ │ 11. Install NixOS │
│ (disko) │ │ - Build system │
│ - GPT/LVM/LUKS │ │ - Copy closures │
│ - Format filesystems │ │ - Install bootloader│
│ - Mount /mnt │ │ - Inject secrets │
└──────────────────────────┘ └──────────────────────────┘
Phase 3: First Boot Phase 4: Running Cluster
┌──────────────┐ ┌──────────────────┐
│ Bare-Metal │ 12. Reboot │ NixOS System │
│ Server │ ────────────> │ (from disk) │
└──────────────┘ └──────────────────┘
┌───────────────────┴────────────────────┐
│ 13. First-boot automation │
│ - Chainfire cluster join/bootstrap │
│ - FlareDB cluster join/bootstrap │
│ - IAM initialization │
│ - Health checks │
└───────────────────┬────────────────────┘
v
┌──────────────────┐
│ Running Cluster │
│ - All services │
│ healthy │
│ - Raft quorum │
│ - TLS enabled │
└──────────────────┘
```
## 2. Hardware Requirements
### 2.1 Minimum Specifications Per Node
**Control Plane Nodes (3-5 recommended):**
- CPU: 8 cores / 16 threads (Intel Xeon or AMD EPYC)
- RAM: 32 GB DDR4 ECC
- Storage: 500 GB SSD (NVMe preferred)
- Network: 2x 10 GbE (bonded/redundant)
- BMC: IPMI 2.0 or Redfish compatible
**Worker Nodes:**
- CPU: 16+ cores / 32+ threads
- RAM: 64 GB+ DDR4 ECC
- Storage: 1 TB+ NVMe SSD
- Network: 2x 10 GbE or 2x 25 GbE
- BMC: IPMI 2.0 or Redfish compatible
**All-in-One (Development/Testing):**
- CPU: 16 cores / 32 threads
- RAM: 64 GB DDR4
- Storage: 1 TB SSD
- Network: 1x 10 GbE (minimum)
- BMC: Optional but recommended
### 2.2 Recommended Production Specifications
**Control Plane Nodes:**
- CPU: 16-32 cores (Intel Xeon Gold/Platinum or AMD EPYC)
- RAM: 64-128 GB DDR4 ECC
- Storage: 1-2 TB NVMe SSD (RAID1 for redundancy)
- Network: 2x 25 GbE (active/active bonding)
- BMC: Redfish with SOL (Serial-over-LAN)
**Worker Nodes:**
- CPU: 32-64 cores
- RAM: 128-256 GB DDR4 ECC
- Storage: 2-4 TB NVMe SSD
- Network: 2x 25 GbE or 2x 100 GbE
- GPU: Optional (NVIDIA/AMD for ML workloads)
### 2.3 Hardware Compatibility Matrix
| Vendor | Model | Tested | BIOS | UEFI | Notes |
|-----------|---------------|--------|------|------|--------------------------------|
| Dell | PowerEdge R640| Yes | Yes | Yes | Requires BIOS A19+ |
| Dell | PowerEdge R650| Yes | Yes | Yes | Best PXE compatibility |
| HPE | ProLiant DL360| Yes | Yes | Yes | Disable Secure Boot |
| HPE | ProLiant DL380| Yes | Yes | Yes | Latest firmware recommended |
| Supermicro| SYS-2029U | Yes | Yes | Yes | Requires BMC 1.73+ |
| Lenovo | ThinkSystem | Partial| Yes | Yes | Some NIC issues on older models|
| Generic | Whitebox x86 | Partial| Yes | Maybe| UEFI support varies |
### 2.4 BIOS/UEFI Settings
**Required Settings:**
- Boot Mode: UEFI (preferred) or Legacy BIOS
- PXE/Network Boot: Enabled on primary NIC
- Boot Order: Network → Disk
- Secure Boot: Disabled (for PXE boot)
- Virtualization: Enabled (VT-x/AMD-V)
- SR-IOV: Enabled (if using advanced networking)
**Dell-Specific (iDRAC):**
```
System BIOS → Boot Settings:
Boot Mode: UEFI
UEFI Network Stack: Enabled
PXE Device 1: Integrated NIC 1
System BIOS → System Profile:
Profile: Performance
```
**HPE-Specific (iLO):**
```
System Configuration → BIOS/Platform:
Boot Mode: UEFI Mode
Network Boot: Enabled
PXE Support: UEFI Only
System Configuration → UEFI Boot Order:
1. Network Adapter (NIC 1)
2. Hard Disk
```
**Supermicro-Specific (IPMI):**
```
BIOS Setup → Boot:
Boot mode select: UEFI
UEFI Network Stack: Enabled
Boot Option #1: UEFI Network
BIOS Setup → Advanced → CPU Configuration:
Intel Virtualization Technology: Enabled
```
### 2.5 BMC/IPMI Requirements
**Mandatory Features:**
- Remote power control (on/off/reset)
- Boot device selection (PXE/disk)
- Remote console access (KVM-over-IP or SOL)
**Recommended Features:**
- Virtual media mounting
- Sensor monitoring (temperature, fans, PSU)
- Event logging
- SMTP alerting
**Network Configuration:**
- Dedicated BMC network (separate VLAN recommended)
- Static IP or DHCP reservation
- HTTPS access enabled
- Default credentials changed
## 3. Network Setup
### 3.1 Network Topology
**Single-Segment Topology (Simple):**
```
┌─────────────────────────────────────────────────────┐
│ Provisioning Server PXE/DHCP/HTTP │
│ 10.0.100.10 │
└──────────────┬──────────────────────────────────────┘
│ Layer 2 Switch (unmanaged)
┬──────────┴──────────┬─────────────┬
│ │ │
┌───┴────┐ ┌────┴─────┐ ┌───┴────┐
│ Node01 │ │ Node02 │ │ Node03 │
│10.0.100│ │ 10.0.100 │ │10.0.100│
│ .50 │ │ .51 │ │ .52 │
└────────┘ └──────────┘ └────────┘
```
**Multi-VLAN Topology (Production):**
```
┌──────────────────────────────────────────────────────┐
│ Management Network (VLAN 10) │
│ - Provisioning Server: 10.0.10.10 │
│ - BMC/IPMI: 10.0.10.50-99 │
└──────────────────┬───────────────────────────────────┘
┌──────────────────┴───────────────────────────────────┐
│ Provisioning Network (VLAN 100) │
│ - PXE Boot: 10.0.100.0/24 │
│ - DHCP Range: 10.0.100.100-200 │
└──────────────────┬───────────────────────────────────┘
┌──────────────────┴───────────────────────────────────┐
│ Production Network (VLAN 200) │
│ - Static IPs: 10.0.200.10-99 │
│ - Service Traffic │
└──────────────────┬───────────────────────────────────┘
┌────────┴────────┐
│ L3 Switch │
│ (VLANs, Routing)│
└────────┬─────────┘
┬───────────┴──────────┬─────────┬
│ │ │
┌────┴────┐ ┌────┴────┐ │
│ Node01 │ │ Node02 │ │...
│ eth0: │ │ eth0: │
│ VLAN100│ │ VLAN100│
│ eth1: │ │ eth1: │
│ VLAN200│ │ VLAN200│
└─────────┘ └─────────┘
```
### 3.2 DHCP Server Configuration
**ISC DHCP Configuration (`/etc/dhcp/dhcpd.conf`):**
```dhcp
# Global options
option architecture-type code 93 = unsigned integer 16;
default-lease-time 600;
max-lease-time 7200;
authoritative;
# Provisioning subnet
subnet 10.0.100.0 netmask 255.255.255.0 {
range 10.0.100.100 10.0.100.200;
option routers 10.0.100.1;
option domain-name-servers 10.0.100.1, 8.8.8.8;
option domain-name "prov.example.com";
# PXE boot server
next-server 10.0.100.10;
# Architecture-specific boot file selection
if exists user-class and option user-class = "iPXE" {
# iPXE already loaded, provide boot script via HTTP
filename "http://10.0.100.10:8080/boot/ipxe/boot.ipxe";
} elsif option architecture-type = 00:00 {
# BIOS (legacy) - load iPXE via TFTP
filename "undionly.kpxe";
} elsif option architecture-type = 00:07 {
# UEFI x86_64 - load iPXE via TFTP
filename "ipxe.efi";
} elsif option architecture-type = 00:09 {
# UEFI x86_64 (alternate) - load iPXE via TFTP
filename "ipxe.efi";
} else {
# Fallback to UEFI
filename "ipxe.efi";
}
}
# Static reservations for control plane nodes
host node01 {
hardware ethernet 52:54:00:12:34:56;
fixed-address 10.0.100.50;
option host-name "node01";
}
host node02 {
hardware ethernet 52:54:00:12:34:57;
fixed-address 10.0.100.51;
option host-name "node02";
}
host node03 {
hardware ethernet 52:54:00:12:34:58;
fixed-address 10.0.100.52;
option host-name "node03";
}
```
**Validation Commands:**
```bash
# Test DHCP configuration syntax
sudo dhcpd -t -cf /etc/dhcp/dhcpd.conf
# Start DHCP server
sudo systemctl start isc-dhcp-server
sudo systemctl enable isc-dhcp-server
# Monitor DHCP leases
sudo tail -f /var/lib/dhcp/dhcpd.leases
# Test DHCP response
sudo nmap --script broadcast-dhcp-discover -e eth0
```
### 3.3 DNS Requirements
**Forward DNS Zone (`example.com`):**
```zone
; Control plane nodes
node01.example.com. IN A 10.0.200.10
node02.example.com. IN A 10.0.200.11
node03.example.com. IN A 10.0.200.12
; Worker nodes
worker01.example.com. IN A 10.0.200.20
worker02.example.com. IN A 10.0.200.21
; Service VIPs (optional, for load balancing)
chainfire.example.com. IN A 10.0.200.100
flaredb.example.com. IN A 10.0.200.101
iam.example.com. IN A 10.0.200.102
```
**Reverse DNS Zone (`200.0.10.in-addr.arpa`):**
```zone
; Control plane nodes
10.200.0.10.in-addr.arpa. IN PTR node01.example.com.
11.200.0.10.in-addr.arpa. IN PTR node02.example.com.
12.200.0.10.in-addr.arpa. IN PTR node03.example.com.
```
**Validation:**
```bash
# Test forward resolution
dig +short node01.example.com
# Test reverse resolution
dig +short -x 10.0.200.10
# Test from target node after provisioning
ssh root@10.0.100.50 'hostname -f'
```
### 3.4 Firewall Rules
**Service Port Matrix (see NETWORK.md for complete reference):**
| Service | API Port | Raft Port | Additional | Protocol |
|--------------|----------|-----------|------------|----------|
| Chainfire | 2379 | 2380 | 2381 (gossip) | TCP |
| FlareDB | 2479 | 2480 | - | TCP |
| IAM | 8080 | - | - | TCP |
| PlasmaVMC | 9090 | - | - | TCP |
| PrismNET | 9091 | - | - | TCP |
| FlashDNS | 53 | - | - | TCP/UDP |
| FiberLB | 9092 | - | - | TCP |
| K8sHost | 10250 | - | - | TCP |
**iptables Rules (Provisioning Server):**
```bash
#!/bin/bash
# Provisioning server firewall rules
# Allow DHCP
iptables -A INPUT -p udp --dport 67 -j ACCEPT
iptables -A INPUT -p udp --dport 68 -j ACCEPT
# Allow TFTP
iptables -A INPUT -p udp --dport 69 -j ACCEPT
# Allow HTTP (boot server)
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 8080 -j ACCEPT
# Allow SSH (for nixos-anywhere)
iptables -A INPUT -p tcp --dport 22 -j ACCEPT
```
**iptables Rules (Cluster Nodes):**
```bash
#!/bin/bash
# Cluster node firewall rules
# Allow SSH (management)
iptables -A INPUT -p tcp --dport 22 -s 10.0.0.0/8 -j ACCEPT
# Allow Chainfire (from cluster subnet only)
iptables -A INPUT -p tcp --dport 2379 -s 10.0.200.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 2380 -s 10.0.200.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 2381 -s 10.0.200.0/24 -j ACCEPT
# Allow FlareDB
iptables -A INPUT -p tcp --dport 2479 -s 10.0.200.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 2480 -s 10.0.200.0/24 -j ACCEPT
# Allow IAM (from cluster and client subnets)
iptables -A INPUT -p tcp --dport 8080 -s 10.0.0.0/8 -j ACCEPT
# Drop all other traffic
iptables -A INPUT -j DROP
```
**nftables Rules (Modern Alternative):**
```nft
#!/usr/sbin/nft -f
flush ruleset
table inet filter {
chain input {
type filter hook input priority 0; policy drop;
# Allow established connections
ct state established,related accept
# Allow loopback
iif lo accept
# Allow SSH
tcp dport 22 ip saddr 10.0.0.0/8 accept
# Allow cluster services from cluster subnet
tcp dport { 2379, 2380, 2381, 2479, 2480 } ip saddr 10.0.200.0/24 accept
# Allow IAM from internal network
tcp dport 8080 ip saddr 10.0.0.0/8 accept
}
}
```
### 3.5 Static IP Allocation Strategy
**IP Allocation Plan:**
```
10.0.100.0/24 - Provisioning network (DHCP during install)
.1 - Gateway
.10 - PXE/DHCP/HTTP server
.50-.79 - Control plane nodes (static reservations)
.80-.99 - Worker nodes (static reservations)
.100-.200 - DHCP pool (temporary during provisioning)
10.0.200.0/24 - Production network (static IPs)
.1 - Gateway
.10-.19 - Control plane nodes
.20-.99 - Worker nodes
.100-.199 - Service VIPs
```
### 3.6 Network Bandwidth Requirements
**Per-Node During Provisioning:**
- PXE boot: ~200-500 MB (kernel + initrd)
- nixos-anywhere: ~1-5 GB (NixOS closures)
- Time: 5-15 minutes on 1 Gbps link
**Production Cluster:**
- Control plane: 1 Gbps minimum, 10 Gbps recommended
- Workers: 10 Gbps minimum, 25 Gbps recommended
- Inter-node latency: <1ms ideal, <5ms acceptable
## 4. Pre-Deployment Checklist
Complete this checklist before beginning deployment:
### 4.1 Hardware Checklist
- [ ] All servers racked and powered
- [ ] All network cables connected (data + BMC)
- [ ] All power supplies connected (redundant if available)
- [ ] BMC/IPMI network configured
- [ ] BMC credentials documented
- [ ] BIOS/UEFI settings configured per section 2.4
- [ ] PXE boot enabled and first in boot order
- [ ] Secure Boot disabled (if using UEFI)
- [ ] Hardware inventory recorded (MAC addresses, serial numbers)
### 4.2 Network Checklist
- [ ] Network switches configured (VLANs, trunking)
- [ ] DHCP server configured and tested
- [ ] DNS forward/reverse zones created
- [ ] Firewall rules configured
- [ ] Network connectivity verified (ping tests)
- [ ] Bandwidth validated (iperf between nodes)
- [ ] DHCP relay configured (if multi-subnet)
- [ ] NTP server configured for time sync
### 4.3 PXE Server Checklist
- [ ] PXE server deployed (see T032.S2)
- [ ] DHCP service running and healthy
- [ ] TFTP service running and healthy
- [ ] HTTP service running and healthy
- [ ] iPXE bootloaders downloaded (undionly.kpxe, ipxe.efi)
- [ ] NixOS netboot images built and uploaded (see T032.S3)
- [ ] Boot script configured (boot.ipxe)
- [ ] Health endpoints responding
**Validation:**
```bash
# On PXE server
sudo systemctl status isc-dhcp-server
sudo systemctl status atftpd
sudo systemctl status nginx
# Test HTTP access
curl http://10.0.100.10:8080/boot/ipxe/boot.ipxe
curl http://10.0.100.10:8080/health
# Test TFTP access
tftp 10.0.100.10 -c get undionly.kpxe /tmp/test.kpxe
```
### 4.4 Node Configuration Checklist
- [ ] Per-node NixOS configurations created (`/srv/provisioning/nodes/`)
- [ ] Hardware configurations generated or templated
- [ ] Disko disk layouts defined
- [ ] Network settings configured (static IPs, VLANs)
- [ ] Service selections defined (control-plane vs worker)
- [ ] Cluster configuration JSON files created
- [ ] Node inventory documented (MAC hostname role)
### 4.5 TLS Certificates Checklist
- [ ] CA certificate generated
- [ ] Per-node certificates generated
- [ ] Certificate files copied to secrets directories
- [ ] Certificate permissions set (0400 for private keys)
- [ ] Certificate expiry dates documented
- [ ] Rotation procedure documented
**Generate Certificates:**
```bash
# Generate CA (if not already done)
openssl genrsa -out ca-key.pem 4096
openssl req -x509 -new -nodes -key ca-key.pem -days 3650 \
-out ca-cert.pem -subj "/CN=PlasmaCloud CA"
# Generate per-node certificate
for node in node01 node02 node03; do
openssl genrsa -out ${node}-key.pem 4096
openssl req -new -key ${node}-key.pem -out ${node}-csr.pem \
-subj "/CN=${node}.example.com"
openssl x509 -req -in ${node}-csr.pem -CA ca-cert.pem -CAkey ca-key.pem \
-CAcreateserial -out ${node}-cert.pem -days 365
done
```
### 4.6 Provisioning Workstation Checklist
- [ ] NixOS or Nix package manager installed
- [ ] Nix flakes enabled
- [ ] SSH key pair generated for provisioning
- [ ] SSH public key added to netboot images
- [ ] Network access to provisioning VLAN
- [ ] Git repository cloned (if using version control)
- [ ] nixos-anywhere installed: `nix profile install github:nix-community/nixos-anywhere`
## 5. Deployment Workflow
### 5.1 Phase 1: PXE Server Setup
**Reference:** See `/home/centra/cloud/chainfire/baremetal/pxe-server/` (T032.S2)
**Step 1.1: Deploy PXE Server Using NixOS Module**
Create PXE server configuration:
```nix
# /etc/nixos/pxe-server.nix
{ config, pkgs, lib, ... }:
{
imports = [
/path/to/chainfire/baremetal/pxe-server/nixos-module.nix
];
services.centra-pxe-server = {
enable = true;
interface = "eth0";
serverAddress = "10.0.100.10";
dhcp = {
subnet = "10.0.100.0";
netmask = "255.255.255.0";
broadcast = "10.0.100.255";
range = {
start = "10.0.100.100";
end = "10.0.100.200";
};
router = "10.0.100.1";
domainNameServers = [ "10.0.100.1" "8.8.8.8" ];
};
nodes = {
"52:54:00:12:34:56" = {
profile = "control-plane";
hostname = "node01";
ipAddress = "10.0.100.50";
};
"52:54:00:12:34:57" = {
profile = "control-plane";
hostname = "node02";
ipAddress = "10.0.100.51";
};
"52:54:00:12:34:58" = {
profile = "control-plane";
hostname = "node03";
ipAddress = "10.0.100.52";
};
};
};
}
```
Apply configuration:
```bash
sudo nixos-rebuild switch -I nixos-config=/etc/nixos/pxe-server.nix
```
**Step 1.2: Verify PXE Services**
```bash
# Check all services are running
sudo systemctl status dhcpd4.service
sudo systemctl status atftpd.service
sudo systemctl status nginx.service
# Test DHCP server
sudo journalctl -u dhcpd4 -f &
# Power on a test server and watch for DHCP requests
# Test TFTP server
tftp localhost -c get undionly.kpxe /tmp/test.kpxe
ls -lh /tmp/test.kpxe # Should show ~100KB file
# Test HTTP server
curl http://localhost:8080/health
# Expected: {"status":"healthy","services":{"dhcp":"running","tftp":"running","http":"running"}}
curl http://localhost:8080/boot/ipxe/boot.ipxe
# Expected: iPXE boot script content
```
### 5.2 Phase 2: Build Netboot Images
**Reference:** See `/home/centra/cloud/baremetal/image-builder/` (T032.S3)
**Step 2.1: Build Images for All Profiles**
```bash
cd /home/centra/cloud/baremetal/image-builder
# Build all profiles
./build-images.sh
# Or build specific profile
./build-images.sh --profile control-plane
./build-images.sh --profile worker
./build-images.sh --profile all-in-one
```
**Expected Output:**
```
Building netboot image for control-plane...
Building initrd...
[... Nix build output ...]
✓ Build complete: artifacts/control-plane/initrd (234 MB)
✓ Build complete: artifacts/control-plane/bzImage (12 MB)
```
**Step 2.2: Copy Images to PXE Server**
```bash
# Automatic (if PXE server directory exists)
./build-images.sh --deploy
# Manual copy
sudo cp artifacts/control-plane/* /var/lib/pxe-boot/nixos/control-plane/
sudo cp artifacts/worker/* /var/lib/pxe-boot/nixos/worker/
sudo cp artifacts/all-in-one/* /var/lib/pxe-boot/nixos/all-in-one/
```
**Step 2.3: Verify Image Integrity**
```bash
# Check file sizes (should be reasonable)
ls -lh /var/lib/pxe-boot/nixos/*/
# Verify images are accessible via HTTP
curl -I http://10.0.100.10:8080/boot/nixos/control-plane/bzImage
# Expected: HTTP/1.1 200 OK, Content-Length: ~12000000
curl -I http://10.0.100.10:8080/boot/nixos/control-plane/initrd
# Expected: HTTP/1.1 200 OK, Content-Length: ~234000000
```
### 5.3 Phase 3: Prepare Node Configurations
**Step 3.1: Generate Node-Specific NixOS Configs**
Create directory structure:
```bash
mkdir -p /srv/provisioning/nodes/{node01,node02,node03}.example.com/{secrets,}
```
**Node Configuration Template (`nodes/node01.example.com/configuration.nix`):**
```nix
{ config, pkgs, lib, ... }:
{
imports = [
../../profiles/control-plane.nix
../../common/base.nix
./hardware.nix
./disko.nix
];
# Hostname and domain
networking = {
hostName = "node01";
domain = "example.com";
usePredictableInterfaceNames = false; # Use eth0, eth1
# Provisioning interface (temporary)
interfaces.eth0 = {
useDHCP = false;
ipv4.addresses = [{
address = "10.0.100.50";
prefixLength = 24;
}];
};
# Production interface
interfaces.eth1 = {
useDHCP = false;
ipv4.addresses = [{
address = "10.0.200.10";
prefixLength = 24;
}];
};
defaultGateway = "10.0.200.1";
nameservers = [ "10.0.200.1" "8.8.8.8" ];
};
# Enable PlasmaCloud services
services.chainfire = {
enable = true;
port = 2379;
raftPort = 2380;
gossipPort = 2381;
settings = {
node_id = "node01";
cluster_name = "prod-cluster";
tls = {
cert_path = "/etc/nixos/secrets/node01-cert.pem";
key_path = "/etc/nixos/secrets/node01-key.pem";
ca_path = "/etc/nixos/secrets/ca-cert.pem";
};
};
};
services.flaredb = {
enable = true;
port = 2479;
raftPort = 2480;
settings = {
node_id = "node01";
cluster_name = "prod-cluster";
chainfire_endpoint = "https://localhost:2379";
tls = {
cert_path = "/etc/nixos/secrets/node01-cert.pem";
key_path = "/etc/nixos/secrets/node01-key.pem";
ca_path = "/etc/nixos/secrets/ca-cert.pem";
};
};
};
services.iam = {
enable = true;
port = 8080;
settings = {
flaredb_endpoint = "https://localhost:2479";
tls = {
cert_path = "/etc/nixos/secrets/node01-cert.pem";
key_path = "/etc/nixos/secrets/node01-key.pem";
ca_path = "/etc/nixos/secrets/ca-cert.pem";
};
};
};
# Enable first-boot automation
services.first-boot-automation = {
enable = true;
configFile = "/etc/nixos/secrets/cluster-config.json";
};
system.stateVersion = "24.11";
}
```
**Step 3.2: Create cluster-config.json for Each Node**
**Bootstrap Node (node01):**
```json
{
"node_id": "node01",
"node_role": "control-plane",
"bootstrap": true,
"cluster_name": "prod-cluster",
"leader_url": "https://node01.example.com:2379",
"raft_addr": "10.0.200.10:2380",
"initial_peers": [
"node01.example.com:2380",
"node02.example.com:2380",
"node03.example.com:2380"
],
"flaredb_peers": [
"node01.example.com:2480",
"node02.example.com:2480",
"node03.example.com:2480"
]
}
```
Copy to secrets:
```bash
cp cluster-config-node01.json /srv/provisioning/nodes/node01.example.com/secrets/cluster-config.json
cp cluster-config-node02.json /srv/provisioning/nodes/node02.example.com/secrets/cluster-config.json
cp cluster-config-node03.json /srv/provisioning/nodes/node03.example.com/secrets/cluster-config.json
```
**Step 3.3: Generate Disko Disk Layouts**
**Simple Single-Disk Layout (`nodes/node01.example.com/disko.nix`):**
```nix
{ disks ? [ "/dev/sda" ], ... }:
{
disko.devices = {
disk = {
main = {
type = "disk";
device = builtins.head disks;
content = {
type = "gpt";
partitions = {
ESP = {
size = "1G";
type = "EF00";
content = {
type = "filesystem";
format = "vfat";
mountpoint = "/boot";
};
};
root = {
size = "100%";
content = {
type = "filesystem";
format = "ext4";
mountpoint = "/";
};
};
};
};
};
};
};
}
```
**Step 3.4: Pre-Generate TLS Certificates**
```bash
# Copy per-node certificates
cp ca-cert.pem /srv/provisioning/nodes/node01.example.com/secrets/
cp node01-cert.pem /srv/provisioning/nodes/node01.example.com/secrets/
cp node01-key.pem /srv/provisioning/nodes/node01.example.com/secrets/
# Set permissions
chmod 644 /srv/provisioning/nodes/node01.example.com/secrets/*-cert.pem
chmod 644 /srv/provisioning/nodes/node01.example.com/secrets/ca-cert.pem
chmod 600 /srv/provisioning/nodes/node01.example.com/secrets/*-key.pem
```
### 5.4 Phase 4: Bootstrap First 3 Nodes
**Step 4.1: Power On Nodes via BMC**
```bash
# Using ipmitool (example for Dell/HP/Supermicro)
for ip in 10.0.10.50 10.0.10.51 10.0.10.52; do
ipmitool -I lanplus -H $ip -U admin -P password chassis bootdev pxe options=persistent
ipmitool -I lanplus -H $ip -U admin -P password chassis power on
done
```
**Step 4.2: Verify PXE Boot Success**
Watch serial console (if available):
```bash
# Connect via IPMI SOL
ipmitool -I lanplus -H 10.0.10.50 -U admin -P password sol activate
# Expected output:
# ... DHCP discovery ...
# ... TFTP download undionly.kpxe or ipxe.efi ...
# ... iPXE menu appears ...
# ... Kernel and initrd download ...
# ... NixOS installer boots ...
# ... SSH server starts ...
```
Verify installer is ready:
```bash
# Wait for nodes to appear in DHCP leases
sudo tail -f /var/lib/dhcp/dhcpd.leases
# Test SSH connectivity
ssh root@10.0.100.50 'uname -a'
# Expected: Linux node01 ... nixos
```
**Step 4.3: Run nixos-anywhere Simultaneously on All 3**
Create provisioning script:
```bash
#!/bin/bash
# /srv/provisioning/scripts/provision-bootstrap-nodes.sh
set -euo pipefail
NODES=("node01" "node02" "node03")
PROVISION_IPS=("10.0.100.50" "10.0.100.51" "10.0.100.52")
FLAKE_ROOT="/srv/provisioning"
for i in "${!NODES[@]}"; do
node="${NODES[$i]}"
ip="${PROVISION_IPS[$i]}"
echo "Provisioning $node at $ip..."
nix run github:nix-community/nixos-anywhere -- \
--flake "$FLAKE_ROOT#$node" \
--build-on-remote \
root@$ip &
done
wait
echo "All nodes provisioned successfully!"
```
Run provisioning:
```bash
chmod +x /srv/provisioning/scripts/provision-bootstrap-nodes.sh
./provision-bootstrap-nodes.sh
```
**Expected output per node:**
```
Provisioning node01 at 10.0.100.50...
Connecting via SSH...
Running disko to partition disks...
Building NixOS system...
Installing bootloader...
Copying secrets...
Installation complete. Rebooting...
```
**Step 4.4: Wait for First-Boot Automation**
After reboot, nodes will boot from disk and run first-boot automation. Monitor progress:
```bash
# Watch logs on node01 (via SSH after it reboots)
ssh root@10.0.200.10 # Note: now on production network
# Check cluster join services
journalctl -u chainfire-cluster-join.service -f
journalctl -u flaredb-cluster-join.service -f
# Expected log output:
# {"level":"INFO","message":"Waiting for local chainfire service..."}
# {"level":"INFO","message":"Local chainfire healthy"}
# {"level":"INFO","message":"Bootstrap node, cluster initialized"}
# {"level":"INFO","message":"Cluster join complete"}
```
**Step 4.5: Verify Cluster Health**
```bash
# Check Chainfire cluster
curl -k https://node01.example.com:2379/admin/cluster/members | jq
# Expected output:
# {
# "members": [
# {"id":"node01","raft_addr":"10.0.200.10:2380","status":"healthy","role":"leader"},
# {"id":"node02","raft_addr":"10.0.200.11:2380","status":"healthy","role":"follower"},
# {"id":"node03","raft_addr":"10.0.200.12:2380","status":"healthy","role":"follower"}
# ]
# }
# Check FlareDB cluster
curl -k https://node01.example.com:2479/admin/cluster/members | jq
# Check IAM service
curl -k https://node01.example.com:8080/health | jq
# Expected: {"status":"healthy","database":"connected"}
```
### 5.5 Phase 5: Add Additional Nodes
**Step 5.1: Prepare Join-Mode Configurations**
Create configuration for node04 (worker profile):
```json
{
"node_id": "node04",
"node_role": "worker",
"bootstrap": false,
"cluster_name": "prod-cluster",
"leader_url": "https://node01.example.com:2379",
"raft_addr": "10.0.200.20:2380"
}
```
**Step 5.2: Power On and Provision Nodes**
```bash
# Power on node via BMC
ipmitool -I lanplus -H 10.0.10.54 -U admin -P password chassis bootdev pxe
ipmitool -I lanplus -H 10.0.10.54 -U admin -P password chassis power on
# Wait for PXE boot and SSH ready
sleep 60
# Provision node
nix run github:nix-community/nixos-anywhere -- \
--flake /srv/provisioning#node04 \
--build-on-remote \
root@10.0.100.60
```
**Step 5.3: Verify Cluster Join via API**
```bash
# Check cluster members (should include node04)
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members[] | select(.id=="node04")'
# Expected:
# {"id":"node04","raft_addr":"10.0.200.20:2380","status":"healthy","role":"follower"}
```
**Step 5.4: Validate Replication and Service Distribution**
```bash
# Write test data on leader
curl -k -X PUT https://node01.example.com:2379/v1/kv/test \
-H "Content-Type: application/json" \
-d '{"value":"hello world"}'
# Read from follower (should be replicated)
curl -k https://node02.example.com:2379/v1/kv/test | jq
# Expected: {"key":"test","value":"hello world"}
```
## 6. Verification & Validation
### 6.1 Health Check Commands for All Services
**Chainfire:**
```bash
curl -k https://node01.example.com:2379/health | jq
# Expected: {"status":"healthy","raft":"leader","cluster_size":3}
# Check cluster membership
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members | length'
# Expected: 3 (for initial bootstrap)
```
**FlareDB:**
```bash
curl -k https://node01.example.com:2479/health | jq
# Expected: {"status":"healthy","raft":"leader","chainfire":"connected"}
# Query test metric
curl -k https://node01.example.com:2479/v1/query \
-H "Content-Type: application/json" \
-d '{"query":"up{job=\"node\"}","time":"now"}'
```
**IAM:**
```bash
curl -k https://node01.example.com:8080/health | jq
# Expected: {"status":"healthy","database":"connected","version":"1.0.0"}
# List users (requires authentication)
curl -k https://node01.example.com:8080/api/users \
-H "Authorization: Bearer $IAM_TOKEN" | jq
```
**PlasmaVMC:**
```bash
curl -k https://node01.example.com:9090/health | jq
# Expected: {"status":"healthy","vms_running":0}
# List VMs
curl -k https://node01.example.com:9090/api/vms | jq
```
**PrismNET:**
```bash
curl -k https://node01.example.com:9091/health | jq
# Expected: {"status":"healthy","networks":0}
```
**FlashDNS:**
```bash
dig @node01.example.com example.com
# Expected: DNS response with ANSWER section
# Health check
curl -k https://node01.example.com:853/health | jq
```
**FiberLB:**
```bash
curl -k https://node01.example.com:9092/health | jq
# Expected: {"status":"healthy","backends":0}
```
**K8sHost:**
```bash
kubectl --kubeconfig=/etc/kubernetes/admin.conf get nodes
# Expected: Node list including this node
```
### 6.2 Cluster Membership Verification
```bash
#!/bin/bash
# /srv/provisioning/scripts/verify-cluster.sh
echo "Checking Chainfire cluster..."
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members[] | {id, status, role}'
echo ""
echo "Checking FlareDB cluster..."
curl -k https://node01.example.com:2479/admin/cluster/members | jq '.members[] | {id, status, role}'
echo ""
echo "Cluster health summary:"
echo " Chainfire nodes: $(curl -sk https://node01.example.com:2379/admin/cluster/members | jq '.members | length')"
echo " FlareDB nodes: $(curl -sk https://node01.example.com:2479/admin/cluster/members | jq '.members | length')"
echo " Raft leaders: Chainfire=$(curl -sk https://node01.example.com:2379/admin/cluster/members | jq -r '.members[] | select(.role=="leader") | .id'), FlareDB=$(curl -sk https://node01.example.com:2479/admin/cluster/members | jq -r '.members[] | select(.role=="leader") | .id')"
```
### 6.3 Raft Leader Election Check
```bash
# Identify current leader
LEADER=$(curl -sk https://node01.example.com:2379/admin/cluster/members | jq -r '.members[] | select(.role=="leader") | .id')
echo "Current Chainfire leader: $LEADER"
# Verify all followers can reach leader
for node in node01 node02 node03; do
echo "Checking $node..."
curl -sk https://$node.example.com:2379/admin/cluster/leader | jq
done
```
### 6.4 TLS Certificate Validation
```bash
# Check certificate expiry
for node in node01 node02 node03; do
echo "Checking $node certificate..."
echo | openssl s_client -connect $node.example.com:2379 2>/dev/null | openssl x509 -noout -dates
done
# Verify certificate chain
echo | openssl s_client -connect node01.example.com:2379 -CAfile /srv/provisioning/ca-cert.pem -verify 1
# Expected: Verify return code: 0 (ok)
```
### 6.5 Network Connectivity Tests
```bash
# Test inter-node connectivity (from node01)
ssh root@node01.example.com '
for node in node02 node03; do
echo "Testing connectivity to $node..."
nc -zv $node.example.com 2379
nc -zv $node.example.com 2380
done
'
# Test bandwidth (iperf3)
ssh root@node02.example.com 'iperf3 -s' &
ssh root@node01.example.com 'iperf3 -c node02.example.com -t 10'
# Expected: ~10 Gbps on 10GbE, ~1 Gbps on 1GbE
```
### 6.6 Performance Smoke Tests
**Chainfire Write Performance:**
```bash
# 1000 writes
time for i in {1..1000}; do
curl -sk -X PUT https://node01.example.com:2379/v1/kv/test$i \
-H "Content-Type: application/json" \
-d "{\"value\":\"test data $i\"}" > /dev/null
done
# Expected: <10 seconds on healthy cluster
```
**FlareDB Query Performance:**
```bash
# Insert test metrics
curl -k -X POST https://node01.example.com:2479/v1/write \
-H "Content-Type: application/json" \
-d '{"metric":"test_metric","value":42,"timestamp":"'$(date -Iseconds)'"}'
# Query performance
time curl -k https://node01.example.com:2479/v1/query \
-H "Content-Type: application/json" \
-d '{"query":"test_metric","start":"1h","end":"now"}'
# Expected: <100ms response time
```
## 7. Common Operations
### 7.1 Adding a New Node
**Step 1: Prepare Node Configuration**
```bash
# Create node directory
mkdir -p /srv/provisioning/nodes/node05.example.com/secrets
# Copy template configuration
cp /srv/provisioning/nodes/node01.example.com/configuration.nix \
/srv/provisioning/nodes/node05.example.com/
# Edit for new node
vim /srv/provisioning/nodes/node05.example.com/configuration.nix
# Update: hostName, ipAddresses, node_id
```
**Step 2: Generate Cluster Config (Join Mode)**
```json
{
"node_id": "node05",
"node_role": "worker",
"bootstrap": false,
"cluster_name": "prod-cluster",
"leader_url": "https://node01.example.com:2379",
"raft_addr": "10.0.200.21:2380"
}
```
**Step 3: Provision Node**
```bash
# Power on and PXE boot
ipmitool -I lanplus -H 10.0.10.55 -U admin -P password chassis bootdev pxe
ipmitool -I lanplus -H 10.0.10.55 -U admin -P password chassis power on
# Wait for SSH
sleep 60
# Run nixos-anywhere
nix run github:nix-community/nixos-anywhere -- \
--flake /srv/provisioning#node05 \
root@10.0.100.65
```
**Step 4: Verify Join**
```bash
# Check cluster membership
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members[] | select(.id=="node05")'
```
### 7.2 Replacing a Failed Node
**Step 1: Remove Failed Node from Cluster**
```bash
# Remove from Chainfire cluster
curl -k -X DELETE https://node01.example.com:2379/admin/member/node02
# Remove from FlareDB cluster
curl -k -X DELETE https://node01.example.com:2479/admin/member/node02
```
**Step 2: Physically Replace Hardware**
- Power off old node
- Remove from rack
- Install new node
- Connect all cables
- Configure BMC
**Step 3: Provision Replacement Node**
```bash
# Use same node ID and configuration
nix run github:nix-community/nixos-anywhere -- \
--flake /srv/provisioning#node02 \
root@10.0.100.51
```
**Step 4: Verify Rejoin**
```bash
# Cluster should automatically add node during first-boot
curl -k https://node01.example.com:2379/admin/cluster/members | jq
```
### 7.3 Updating Node Configuration
**Step 1: Edit Configuration**
```bash
vim /srv/provisioning/nodes/node01.example.com/configuration.nix
# Make changes (e.g., add service, change network config)
```
**Step 2: Build and Deploy**
```bash
# Build configuration locally
nix build /srv/provisioning#node01
# Deploy to node (from node or remote)
nixos-rebuild switch --flake /srv/provisioning#node01
```
**Step 3: Verify Changes**
```bash
# Check active configuration
ssh root@node01.example.com 'nixos-rebuild list-generations'
# Test services still healthy
curl -k https://node01.example.com:2379/health | jq
```
### 7.4 Rolling Updates
**Update Process (One Node at a Time):**
```bash
#!/bin/bash
# /srv/provisioning/scripts/rolling-update.sh
NODES=("node01" "node02" "node03")
for node in "${NODES[@]}"; do
echo "Updating $node..."
# Build new configuration
nix build /srv/provisioning#$node
# Deploy (test mode first)
ssh root@$node.example.com "nixos-rebuild test --flake /srv/provisioning#$node"
# Verify health
if ! curl -k https://$node.example.com:2379/health | jq -e '.status == "healthy"'; then
echo "ERROR: $node unhealthy after test, aborting"
ssh root@$node.example.com "nixos-rebuild switch --rollback"
exit 1
fi
# Apply permanently
ssh root@$node.example.com "nixos-rebuild switch --flake /srv/provisioning#$node"
# Wait for reboot if kernel changed
echo "Waiting 30s for stabilization..."
sleep 30
# Final health check
curl -k https://$node.example.com:2379/health | jq
echo "$node updated successfully"
done
```
### 7.5 Draining a Node for Maintenance
**Step 1: Mark Node for Drain**
```bash
# Disable node in load balancer (if using one)
curl -k -X POST https://node01.example.com:9092/api/backend/node02 \
-d '{"status":"drain"}'
```
**Step 2: Migrate VMs (PlasmaVMC)**
```bash
# List VMs on node
ssh root@node02.example.com 'systemctl list-units | grep plasmavmc-vm@'
# Migrate each VM
curl -k -X POST https://node01.example.com:9090/api/vms/vm-001/migrate \
-d '{"target_node":"node03"}'
```
**Step 3: Stop Services**
```bash
ssh root@node02.example.com '
systemctl stop plasmavmc.service
systemctl stop chainfire.service
systemctl stop flaredb.service
'
```
**Step 4: Perform Maintenance**
```bash
# Reboot for kernel update, hardware maintenance, etc.
ssh root@node02.example.com 'reboot'
```
**Step 5: Re-enable Node**
```bash
# Verify all services healthy
ssh root@node02.example.com 'systemctl status chainfire flaredb plasmavmc'
# Re-enable in load balancer
curl -k -X POST https://node01.example.com:9092/api/backend/node02 \
-d '{"status":"active"}'
```
### 7.6 Decommissioning a Node
**Step 1: Drain Node (see 7.5)**
**Step 2: Remove from Cluster**
```bash
# Remove from Chainfire
curl -k -X DELETE https://node01.example.com:2379/admin/member/node02
# Remove from FlareDB
curl -k -X DELETE https://node01.example.com:2479/admin/member/node02
# Verify removal
curl -k https://node01.example.com:2379/admin/cluster/members | jq
```
**Step 3: Power Off**
```bash
# Via BMC
ipmitool -I lanplus -H 10.0.10.51 -U admin -P password chassis power off
# Or via SSH
ssh root@node02.example.com 'poweroff'
```
**Step 4: Update Inventory**
```bash
# Remove from node inventory
vim /srv/provisioning/inventory.json
# Remove node02 entry
# Remove from DNS
# Update DNS zone to remove node02.example.com
# Remove from monitoring
# Update Prometheus targets to remove node02
```
## 8. Troubleshooting
### 8.1 PXE Boot Failures
**Symptom:** Server does not obtain IP address or does not boot from network
**Diagnosis:**
```bash
# Monitor DHCP server logs
sudo journalctl -u dhcpd4 -f
# Monitor TFTP requests
sudo tcpdump -i eth0 -n port 69
# Check PXE server services
sudo systemctl status dhcpd4 atftpd nginx
```
**Common Causes:**
1. **DHCP server not running:** `sudo systemctl start dhcpd4`
2. **Wrong network interface:** Check `interfaces` in dhcpd.conf
3. **Firewall blocking DHCP/TFTP:** `sudo iptables -L -n | grep -E "67|68|69"`
4. **PXE not enabled in BIOS:** Enter BIOS and enable Network Boot
5. **Network cable disconnected:** Check physical connection
**Solution:**
```bash
# Restart all PXE services
sudo systemctl restart dhcpd4 atftpd nginx
# Verify DHCP configuration
sudo dhcpd -t -cf /etc/dhcp/dhcpd.conf
# Test TFTP
tftp localhost -c get undionly.kpxe /tmp/test.kpxe
# Power cycle server
ipmitool -I lanplus -H <bmc-ip> -U admin chassis power cycle
```
### 8.2 Installation Failures (nixos-anywhere)
**Symptom:** nixos-anywhere fails during disk partitioning, installation, or bootloader setup
**Diagnosis:**
```bash
# Check nixos-anywhere output for errors
# Common errors: disk not found, partition table errors, out of space
# SSH to installer for manual inspection
ssh root@10.0.100.50
# Check disk status
lsblk
dmesg | grep -i error
```
**Common Causes:**
1. **Disk device wrong:** Update disko.nix with correct device (e.g., /dev/nvme0n1)
2. **Disk not wiped:** Previous partition table conflicts
3. **Out of disk space:** Insufficient storage for Nix closures
4. **Network issues:** Cannot download packages from binary cache
**Solution:**
```bash
# Manual disk wipe (on installer)
ssh root@10.0.100.50 '
wipefs -a /dev/sda
sgdisk --zap-all /dev/sda
'
# Retry nixos-anywhere
nix run github:nix-community/nixos-anywhere -- \
--flake /srv/provisioning#node01 \
--debug \
root@10.0.100.50
```
### 8.3 Cluster Join Failures
**Symptom:** Node boots successfully but does not join cluster
**Diagnosis:**
```bash
# Check first-boot logs on node
ssh root@node01.example.com 'journalctl -u chainfire-cluster-join.service -u flaredb-cluster-join.service'
# Common errors:
# - "Health check timeout after 120s"
# - "Join request failed: connection refused"
# - "Configuration file not found"
```
**Bootstrap Mode vs Join Mode:**
- **Bootstrap:** Node expects to create new cluster with peers
- **Join:** Node expects to connect to existing leader
**Common Causes:**
1. **Wrong bootstrap flag:** Check cluster-config.json
2. **Leader unreachable:** Network/firewall issue
3. **TLS certificate errors:** Verify cert paths and validity
4. **Service not starting:** Check main service (chainfire.service)
**Solution:**
```bash
# Verify cluster-config.json
ssh root@node01.example.com 'cat /etc/nixos/secrets/cluster-config.json | jq'
# Test leader connectivity
ssh root@node04.example.com 'curl -k https://node01.example.com:2379/health'
# Check TLS certificates
ssh root@node04.example.com 'ls -l /etc/nixos/secrets/*.pem'
# Manual cluster join (if automation fails)
curl -k -X POST https://node01.example.com:2379/admin/member/add \
-H "Content-Type: application/json" \
-d '{"id":"node04","raft_addr":"10.0.200.20:2380"}'
```
### 8.4 Service Start Failures
**Symptom:** Service fails to start after boot
**Diagnosis:**
```bash
# Check service status
ssh root@node01.example.com 'systemctl status chainfire.service'
# View logs
ssh root@node01.example.com 'journalctl -u chainfire.service -n 100'
# Common errors:
# - "bind: address already in use" (port conflict)
# - "certificate verify failed" (TLS issue)
# - "permission denied" (file permissions)
```
**Common Causes:**
1. **Port already in use:** Another service using same port
2. **Missing dependencies:** Required service not running
3. **Configuration error:** Invalid config file
4. **File permissions:** Cannot read secrets
**Solution:**
```bash
# Check port usage
ssh root@node01.example.com 'ss -tlnp | grep 2379'
# Verify dependencies
ssh root@node01.example.com 'systemctl list-dependencies chainfire.service'
# Test configuration manually
ssh root@node01.example.com 'chainfire-server --config /etc/nixos/chainfire.toml --check-config'
# Fix permissions
ssh root@node01.example.com 'chmod 600 /etc/nixos/secrets/*-key.pem'
```
### 8.5 Network Connectivity Issues
**Symptom:** Nodes cannot communicate with each other or external services
**Diagnosis:**
```bash
# Test basic connectivity
ssh root@node01.example.com 'ping -c 3 node02.example.com'
# Test specific ports
ssh root@node01.example.com 'nc -zv node02.example.com 2379'
# Check firewall rules
ssh root@node01.example.com 'iptables -L -n | grep 2379'
# Check routing
ssh root@node01.example.com 'ip route show'
```
**Common Causes:**
1. **Firewall blocking traffic:** Missing iptables rules
2. **Wrong IP address:** Configuration mismatch
3. **Network interface down:** Interface not configured
4. **DNS resolution failure:** Cannot resolve hostnames
**Solution:**
```bash
# Add firewall rules
ssh root@node01.example.com '
iptables -A INPUT -p tcp --dport 2379 -s 10.0.200.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 2380 -s 10.0.200.0/24 -j ACCEPT
iptables-save > /etc/iptables/rules.v4
'
# Fix DNS resolution
ssh root@node01.example.com '
echo "10.0.200.11 node02.example.com node02" >> /etc/hosts
'
# Restart networking
ssh root@node01.example.com 'systemctl restart systemd-networkd'
```
### 8.6 TLS Certificate Errors
**Symptom:** Services cannot establish TLS connections
**Diagnosis:**
```bash
# Test TLS connection
openssl s_client -connect node01.example.com:2379 -CAfile /srv/provisioning/ca-cert.pem
# Check certificate validity
ssh root@node01.example.com '
openssl x509 -in /etc/nixos/secrets/node01-cert.pem -noout -dates
'
# Common errors:
# - "certificate verify failed" (wrong CA)
# - "certificate has expired" (cert expired)
# - "certificate subject name mismatch" (wrong CN)
```
**Common Causes:**
1. **Expired certificate:** Regenerate certificate
2. **Wrong CA certificate:** Verify CA cert is correct
3. **Hostname mismatch:** CN does not match hostname
4. **File permissions:** Cannot read certificate files
**Solution:**
```bash
# Regenerate certificate
openssl req -new -key /srv/provisioning/secrets/node01-key.pem \
-out /srv/provisioning/secrets/node01-csr.pem \
-subj "/CN=node01.example.com"
openssl x509 -req -in /srv/provisioning/secrets/node01-csr.pem \
-CA /srv/provisioning/ca-cert.pem \
-CAkey /srv/provisioning/ca-key.pem \
-CAcreateserial \
-out /srv/provisioning/secrets/node01-cert.pem \
-days 365
# Copy to node
scp /srv/provisioning/secrets/node01-cert.pem root@node01.example.com:/etc/nixos/secrets/
# Restart service
ssh root@node01.example.com 'systemctl restart chainfire.service'
```
### 8.7 Performance Degradation
**Symptom:** Services are slow or unresponsive
**Diagnosis:**
```bash
# Check system load
ssh root@node01.example.com 'uptime'
ssh root@node01.example.com 'top -bn1 | head -20'
# Check disk I/O
ssh root@node01.example.com 'iostat -x 1 5'
# Check network bandwidth
ssh root@node01.example.com 'iftop -i eth1'
# Check Raft logs for slow operations
ssh root@node01.example.com 'journalctl -u chainfire.service | grep "slow operation"'
```
**Common Causes:**
1. **High CPU usage:** Too many requests, inefficient queries
2. **Disk I/O bottleneck:** Slow disk, too many writes
3. **Network saturation:** Bandwidth exhausted
4. **Memory pressure:** OOM killer active
5. **Raft slow commits:** Network latency between nodes
**Solution:**
```bash
# Add more resources (vertical scaling)
# Or add more nodes (horizontal scaling)
# Check for resource leaks
ssh root@node01.example.com 'systemctl status chainfire | grep Memory'
# Restart service to clear memory leaks (temporary)
ssh root@node01.example.com 'systemctl restart chainfire.service'
# Optimize disk I/O (enable write caching if safe)
ssh root@node01.example.com 'hdparm -W1 /dev/sda'
```
## 9. Rollback & Recovery
### 9.1 NixOS Generation Rollback
NixOS provides atomic rollback capability via generations:
**List Available Generations:**
```bash
ssh root@node01.example.com 'nixos-rebuild list-generations'
# Example output:
# 1 2025-12-10 10:30:00
# 2 2025-12-10 12:45:00 (current)
```
**Rollback to Previous Generation:**
```bash
# Rollback and reboot
ssh root@node01.example.com 'nixos-rebuild switch --rollback'
# Or boot into previous generation once (no permanent change)
ssh root@node01.example.com 'nixos-rebuild boot --rollback && reboot'
```
**Rollback to Specific Generation:**
```bash
ssh root@node01.example.com 'nix-env --switch-generation 1 -p /nix/var/nix/profiles/system'
ssh root@node01.example.com 'reboot'
```
### 9.2 Re-Provisioning from PXE
Complete re-provisioning wipes all data and reinstalls from scratch:
**Step 1: Remove Node from Cluster**
```bash
curl -k -X DELETE https://node01.example.com:2379/admin/member/node02
curl -k -X DELETE https://node01.example.com:2479/admin/member/node02
```
**Step 2: Set Boot to PXE**
```bash
ipmitool -I lanplus -H 10.0.10.51 -U admin chassis bootdev pxe
```
**Step 3: Reboot Node**
```bash
ssh root@node02.example.com 'reboot'
# Or via BMC
ipmitool -I lanplus -H 10.0.10.51 -U admin chassis power cycle
```
**Step 4: Run nixos-anywhere**
```bash
# Wait for PXE boot and SSH ready
sleep 90
nix run github:nix-community/nixos-anywhere -- \
--flake /srv/provisioning#node02 \
root@10.0.100.51
```
### 9.3 Disaster Recovery Procedures
**Complete Cluster Loss (All Nodes Down):**
**Step 1: Restore from Backup (if available)**
```bash
# Restore Chainfire data
ssh root@node01.example.com '
systemctl stop chainfire.service
rm -rf /var/lib/chainfire/*
tar -xzf /backup/chainfire-$(date +%Y%m%d).tar.gz -C /var/lib/chainfire/
systemctl start chainfire.service
'
```
**Step 2: Bootstrap New Cluster**
If no backup, re-provision all nodes as bootstrap:
```bash
# Update cluster-config.json for all nodes
# Set bootstrap=true, same initial_peers
# Provision all 3 nodes
for node in node01 node02 node03; do
nix run github:nix-community/nixos-anywhere -- \
--flake /srv/provisioning#$node \
root@<node-ip> &
done
wait
```
**Single Node Failure:**
**Step 1: Verify Cluster Quorum**
```bash
# Check remaining nodes have quorum
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members | length'
# Expected: 2 (if 3-node cluster with 1 failure)
```
**Step 2: Remove Failed Node**
```bash
curl -k -X DELETE https://node01.example.com:2379/admin/member/node02
```
**Step 3: Provision Replacement**
```bash
# Use same node ID and configuration
nix run github:nix-community/nixos-anywhere -- \
--flake /srv/provisioning#node02 \
root@10.0.100.51
```
### 9.4 Backup and Restore
**Automated Backup Script:**
```bash
#!/bin/bash
# /srv/provisioning/scripts/backup-cluster.sh
BACKUP_DIR="/backup/cluster-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BACKUP_DIR"
# Backup Chainfire data
for node in node01 node02 node03; do
ssh root@$node.example.com \
"tar -czf - /var/lib/chainfire" > "$BACKUP_DIR/chainfire-$node.tar.gz"
done
# Backup FlareDB data
for node in node01 node02 node03; do
ssh root@$node.example.com \
"tar -czf - /var/lib/flaredb" > "$BACKUP_DIR/flaredb-$node.tar.gz"
done
# Backup configurations
cp -r /srv/provisioning/nodes "$BACKUP_DIR/configs"
echo "Backup complete: $BACKUP_DIR"
```
**Restore Script:**
```bash
#!/bin/bash
# /srv/provisioning/scripts/restore-cluster.sh
BACKUP_DIR="$1"
if [ -z "$BACKUP_DIR" ]; then
echo "Usage: $0 <backup-dir>"
exit 1
fi
# Stop services on all nodes
for node in node01 node02 node03; do
ssh root@$node.example.com 'systemctl stop chainfire flaredb'
done
# Restore Chainfire data
for node in node01 node02 node03; do
cat "$BACKUP_DIR/chainfire-$node.tar.gz" | \
ssh root@$node.example.com "cd / && tar -xzf -"
done
# Restore FlareDB data
for node in node01 node02 node03; do
cat "$BACKUP_DIR/flaredb-$node.tar.gz" | \
ssh root@$node.example.com "cd / && tar -xzf -"
done
# Restart services
for node in node01 node02 node03; do
ssh root@$node.example.com 'systemctl start chainfire flaredb'
done
echo "Restore complete"
```
## 10. Security Best Practices
### 10.1 SSH Key Management
**Generate Dedicated Provisioning Key:**
```bash
ssh-keygen -t ed25519 -C "provisioning@example.com" -f ~/.ssh/id_ed25519_provisioning
```
**Add to Netboot Image:**
```nix
# In netboot-base.nix
users.users.root.openssh.authorizedKeys.keys = [
"ssh-ed25519 AAAAC3Nza... provisioning@example.com"
];
```
**Rotate Keys Regularly:**
```bash
# Generate new key
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519_provisioning_new
# Add to all nodes
for node in node01 node02 node03; do
ssh-copy-id -i ~/.ssh/id_ed25519_provisioning_new.pub root@$node.example.com
done
# Remove old key from authorized_keys
# Update netboot image with new key
```
### 10.2 TLS Certificate Rotation
**Automated Rotation Script:**
```bash
#!/bin/bash
# /srv/provisioning/scripts/rotate-certs.sh
# Generate new certificates
for node in node01 node02 node03; do
openssl genrsa -out ${node}-key-new.pem 4096
openssl req -new -key ${node}-key-new.pem -out ${node}-csr.pem \
-subj "/CN=${node}.example.com"
openssl x509 -req -in ${node}-csr.pem \
-CA ca-cert.pem -CAkey ca-key.pem \
-CAcreateserial -out ${node}-cert-new.pem -days 365
done
# Deploy new certificates (without restarting services yet)
for node in node01 node02 node03; do
scp ${node}-cert-new.pem root@${node}.example.com:/etc/nixos/secrets/${node}-cert-new.pem
scp ${node}-key-new.pem root@${node}.example.com:/etc/nixos/secrets/${node}-key-new.pem
done
# Update configuration to use new certs
# ... (NixOS configuration update) ...
# Rolling restart to apply new certificates
for node in node01 node02 node03; do
ssh root@${node}.example.com 'systemctl restart chainfire flaredb iam'
sleep 30 # Wait for stabilization
done
echo "Certificate rotation complete"
```
### 10.3 Secrets Management
**Best Practices:**
- Store secrets outside Nix store (use `/etc/nixos/secrets/`)
- Set restrictive permissions (0600 for private keys, 0400 for passwords)
- Use environment variables for runtime secrets
- Never commit secrets to Git
- Use encrypted secrets (sops-nix or agenix)
**Example with sops-nix:**
```nix
# In configuration.nix
{
imports = [ <sops-nix/modules/sops> ];
sops.defaultSopsFile = ./secrets.yaml;
sops.secrets."node01/tls-key" = {
owner = "chainfire";
mode = "0400";
};
services.chainfire.settings.tls.key_path = config.sops.secrets."node01/tls-key".path;
}
```
### 10.4 Network Isolation
**VLAN Segmentation:**
- Management VLAN (10): BMC/IPMI, provisioning workstation
- Provisioning VLAN (100): PXE boot, temporary
- Production VLAN (200): Cluster services, inter-node communication
- Client VLAN (300): External clients accessing services
**Firewall Zones:**
```bash
# Example nftables rules
table inet filter {
chain input {
type filter hook input priority 0; policy drop;
# Management from trusted subnet only
iifname "eth0" ip saddr 10.0.10.0/24 tcp dport 22 accept
# Cluster traffic from cluster subnet only
iifname "eth1" ip saddr 10.0.200.0/24 tcp dport { 2379, 2380, 2479, 2480 } accept
# Client traffic from client subnet only
iifname "eth2" ip saddr 10.0.300.0/24 tcp dport { 8080, 9090 } accept
}
}
```
### 10.5 Audit Logging
**Enable Structured Logging:**
```nix
# In configuration.nix
services.chainfire.settings.logging = {
level = "info";
format = "json";
output = "journal";
};
# Enable journald forwarding to SIEM
services.journald.extraConfig = ''
ForwardToSyslog=yes
Storage=persistent
MaxRetentionSec=7days
'';
```
**Audit Key Events:**
- Cluster membership changes
- Node joins/leaves
- Authentication failures
- Configuration changes
- TLS certificate errors
**Log Aggregation:**
```bash
# Forward logs to central logging server
# Example: rsyslog configuration
cat > /etc/rsyslog.d/50-remote.conf <<EOF
*.* @@logging-server.example.com:514
EOF
systemctl restart rsyslog
```
---
## Appendix A: Service Port Reference
See [NETWORK.md](NETWORK.md) for complete port matrix.
## Appendix B: Hardware Vendor Commands
See [HARDWARE.md](HARDWARE.md) for vendor-specific BIOS configurations and IPMI commands.
## Appendix C: Complete Command Reference
See [COMMANDS.md](COMMANDS.md) for all commands organized by task.
## Appendix D: Quick Reference Cards
See [QUICKSTART.md](QUICKSTART.md) for condensed deployment guide.
## Appendix E: Deployment Flow Diagrams
See [diagrams/deployment-flow.md](diagrams/deployment-flow.md) for visual workflow.
## Appendix F: Related Documentation
- **Design Document:** `/home/centra/cloud/docs/por/T032-baremetal-provisioning/design.md`
- **PXE Server:** `/home/centra/cloud/chainfire/baremetal/pxe-server/README.md`
- **Image Builder:** `/home/centra/cloud/baremetal/image-builder/README.md`
- **First-Boot Automation:** `/home/centra/cloud/baremetal/first-boot/README.md`
---
**End of Operator Runbook**