photoncloud-monorepo/docs/por/T032-baremetal-provisioning/RUNBOOK.md
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

62 KiB

Bare-Metal Provisioning Operator Runbook

Document Version: 1.0 Last Updated: 2025-12-10 Status: Production Ready Author: PlasmaCloud Infrastructure Team

1. Overview

1.1 What This Runbook Covers

This runbook provides comprehensive, step-by-step instructions for deploying PlasmaCloud infrastructure on bare-metal servers using automated PXE-based provisioning. By following this guide, operators will be able to:

  • Deploy a complete PlasmaCloud cluster from bare hardware to running services
  • Bootstrap a 3-node Raft cluster (Chainfire + FlareDB)
  • Add additional nodes to an existing cluster
  • Validate cluster health and troubleshoot common issues
  • Perform operational tasks (updates, maintenance, recovery)

1.2 Prerequisites

Required Access and Permissions:

  • Root/sudo access on provisioning server
  • Physical or IPMI/BMC access to bare-metal servers
  • Network access to provisioning VLAN
  • SSH key pair for nixos-anywhere

Required Tools:

  • NixOS with flakes enabled (provisioning workstation)
  • curl, jq, ssh client
  • ipmitool (optional, for remote management)
  • Serial console access tool (optional)

Required Knowledge:

  • Basic understanding of PXE boot process
  • Linux system administration
  • Network configuration (DHCP, DNS, firewall)
  • NixOS basics (declarative configuration, flakes)

1.3 Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                    Bare-Metal Provisioning Flow                         │
└─────────────────────────────────────────────────────────────────────────┘

Phase 1: PXE Boot                Phase 2: Installation
┌──────────────┐                  ┌──────────────────┐
│  Bare-Metal  │  1. DHCP Request │   DHCP Server    │
│   Server     ├─────────────────>│  (PXE Server)    │
│              │                  └──────────────────┘
│  (powered    │  2. TFTP Get                │
│   on, PXE    │     bootloader             │
│   enabled)   │<───────────────────────────┘
│              │
│  3. iPXE     │  4. HTTP Get      ┌──────────────────┐
│     loads    │     boot.ipxe     │   HTTP Server    │
│              ├──────────────────>│   (nginx)        │
│              │                   └──────────────────┘
│  5. iPXE     │  6. HTTP Get               │
│     menu     │     kernel+initrd          │
│              │<───────────────────────────┘
│              │
│  7. Boot     │
│     NixOS    │
│     Installer│
└──────┬───────┘
       │
       │  8. SSH Connection         ┌──────────────────┐
       └───────────────────────────>│  Provisioning    │
                                    │  Workstation     │
                                    │                  │
                                    │  9. Run          │
                                    │     nixos-       │
                                    │     anywhere     │
                                    └──────┬───────────┘
                                           │
                      ┌────────────────────┴────────────────────┐
                      │                                          │
                      v                                          v
       ┌──────────────────────────┐          ┌──────────────────────────┐
       │  10. Partition disks     │          │  11. Install NixOS       │
       │      (disko)             │          │      - Build system      │
       │  - GPT/LVM/LUKS          │          │      - Copy closures     │
       │  - Format filesystems    │          │      - Install bootloader│
       │  - Mount /mnt            │          │      - Inject secrets    │
       └──────────────────────────┘          └──────────────────────────┘

Phase 3: First Boot              Phase 4: Running Cluster
┌──────────────┐                 ┌──────────────────┐
│  Bare-Metal  │  12. Reboot     │   NixOS System   │
│   Server     │  ────────────>  │   (from disk)    │
└──────────────┘                 └──────────────────┘
                                          │
                      ┌───────────────────┴────────────────────┐
                      │  13. First-boot automation             │
                      │  - Chainfire cluster join/bootstrap    │
                      │  - FlareDB cluster join/bootstrap      │
                      │  - IAM initialization                  │
                      │  - Health checks                       │
                      └───────────────────┬────────────────────┘
                                          │
                                          v
                                 ┌──────────────────┐
                                 │  Running Cluster │
                                 │  - All services  │
                                 │    healthy       │
                                 │  - Raft quorum   │
                                 │  - TLS enabled   │
                                 └──────────────────┘

2. Hardware Requirements

2.1 Minimum Specifications Per Node

Control Plane Nodes (3-5 recommended):

  • CPU: 8 cores / 16 threads (Intel Xeon or AMD EPYC)
  • RAM: 32 GB DDR4 ECC
  • Storage: 500 GB SSD (NVMe preferred)
  • Network: 2x 10 GbE (bonded/redundant)
  • BMC: IPMI 2.0 or Redfish compatible

Worker Nodes:

  • CPU: 16+ cores / 32+ threads
  • RAM: 64 GB+ DDR4 ECC
  • Storage: 1 TB+ NVMe SSD
  • Network: 2x 10 GbE or 2x 25 GbE
  • BMC: IPMI 2.0 or Redfish compatible

All-in-One (Development/Testing):

  • CPU: 16 cores / 32 threads
  • RAM: 64 GB DDR4
  • Storage: 1 TB SSD
  • Network: 1x 10 GbE (minimum)
  • BMC: Optional but recommended

Control Plane Nodes:

  • CPU: 16-32 cores (Intel Xeon Gold/Platinum or AMD EPYC)
  • RAM: 64-128 GB DDR4 ECC
  • Storage: 1-2 TB NVMe SSD (RAID1 for redundancy)
  • Network: 2x 25 GbE (active/active bonding)
  • BMC: Redfish with SOL (Serial-over-LAN)

Worker Nodes:

  • CPU: 32-64 cores
  • RAM: 128-256 GB DDR4 ECC
  • Storage: 2-4 TB NVMe SSD
  • Network: 2x 25 GbE or 2x 100 GbE
  • GPU: Optional (NVIDIA/AMD for ML workloads)

2.3 Hardware Compatibility Matrix

Vendor Model Tested BIOS UEFI Notes
Dell PowerEdge R640 Yes Yes Yes Requires BIOS A19+
Dell PowerEdge R650 Yes Yes Yes Best PXE compatibility
HPE ProLiant DL360 Yes Yes Yes Disable Secure Boot
HPE ProLiant DL380 Yes Yes Yes Latest firmware recommended
Supermicro SYS-2029U Yes Yes Yes Requires BMC 1.73+
Lenovo ThinkSystem Partial Yes Yes Some NIC issues on older models
Generic Whitebox x86 Partial Yes Maybe UEFI support varies

2.4 BIOS/UEFI Settings

Required Settings:

  • Boot Mode: UEFI (preferred) or Legacy BIOS
  • PXE/Network Boot: Enabled on primary NIC
  • Boot Order: Network → Disk
  • Secure Boot: Disabled (for PXE boot)
  • Virtualization: Enabled (VT-x/AMD-V)
  • SR-IOV: Enabled (if using advanced networking)

Dell-Specific (iDRAC):

System BIOS → Boot Settings:
  Boot Mode: UEFI
  UEFI Network Stack: Enabled
  PXE Device 1: Integrated NIC 1

System BIOS → System Profile:
  Profile: Performance

HPE-Specific (iLO):

System Configuration → BIOS/Platform:
  Boot Mode: UEFI Mode
  Network Boot: Enabled
  PXE Support: UEFI Only

System Configuration → UEFI Boot Order:
  1. Network Adapter (NIC 1)
  2. Hard Disk

Supermicro-Specific (IPMI):

BIOS Setup → Boot:
  Boot mode select: UEFI
  UEFI Network Stack: Enabled
  Boot Option #1: UEFI Network

BIOS Setup → Advanced → CPU Configuration:
  Intel Virtualization Technology: Enabled

2.5 BMC/IPMI Requirements

Mandatory Features:

  • Remote power control (on/off/reset)
  • Boot device selection (PXE/disk)
  • Remote console access (KVM-over-IP or SOL)

Recommended Features:

  • Virtual media mounting
  • Sensor monitoring (temperature, fans, PSU)
  • Event logging
  • SMTP alerting

Network Configuration:

  • Dedicated BMC network (separate VLAN recommended)
  • Static IP or DHCP reservation
  • HTTPS access enabled
  • Default credentials changed

3. Network Setup

3.1 Network Topology

Single-Segment Topology (Simple):

┌─────────────────────────────────────────────────────┐
│  Provisioning Server    PXE/DHCP/HTTP              │
│  10.0.100.10                                        │
└──────────────┬──────────────────────────────────────┘
               │
               │  Layer 2 Switch (unmanaged)
               │
    ┬──────────┴──────────┬─────────────┬
    │                     │             │
┌───┴────┐          ┌────┴─────┐  ┌───┴────┐
│ Node01 │          │  Node02  │  │ Node03 │
│10.0.100│          │ 10.0.100 │  │10.0.100│
│  .50   │          │   .51    │  │  .52   │
└────────┘          └──────────┘  └────────┘

Multi-VLAN Topology (Production):

┌──────────────────────────────────────────────────────┐
│  Management Network (VLAN 10)                        │
│  - Provisioning Server: 10.0.10.10                   │
│  - BMC/IPMI: 10.0.10.50-99                          │
└──────────────────┬───────────────────────────────────┘
                   │
┌──────────────────┴───────────────────────────────────┐
│  Provisioning Network (VLAN 100)                     │
│  - PXE Boot: 10.0.100.0/24                          │
│  - DHCP Range: 10.0.100.100-200                     │
└──────────────────┬───────────────────────────────────┘
                   │
┌──────────────────┴───────────────────────────────────┐
│  Production Network (VLAN 200)                       │
│  - Static IPs: 10.0.200.10-99                       │
│  - Service Traffic                                   │
└──────────────────┬───────────────────────────────────┘
                   │
          ┌────────┴────────┐
          │  L3 Switch      │
          │  (VLANs, Routing)│
          └────────┬─────────┘
                   │
       ┬───────────┴──────────┬─────────┬
       │                      │         │
  ┌────┴────┐           ┌────┴────┐   │
  │ Node01  │           │ Node02  │   │...
  │ eth0:   │           │ eth0:   │
  │  VLAN100│           │  VLAN100│
  │ eth1:   │           │ eth1:   │
  │  VLAN200│           │  VLAN200│
  └─────────┘           └─────────┘

3.2 DHCP Server Configuration

ISC DHCP Configuration (/etc/dhcp/dhcpd.conf):

# Global options
option architecture-type code 93 = unsigned integer 16;
default-lease-time 600;
max-lease-time 7200;
authoritative;

# Provisioning subnet
subnet 10.0.100.0 netmask 255.255.255.0 {
    range 10.0.100.100 10.0.100.200;
    option routers 10.0.100.1;
    option domain-name-servers 10.0.100.1, 8.8.8.8;
    option domain-name "prov.example.com";

    # PXE boot server
    next-server 10.0.100.10;

    # Architecture-specific boot file selection
    if exists user-class and option user-class = "iPXE" {
        # iPXE already loaded, provide boot script via HTTP
        filename "http://10.0.100.10:8080/boot/ipxe/boot.ipxe";
    } elsif option architecture-type = 00:00 {
        # BIOS (legacy) - load iPXE via TFTP
        filename "undionly.kpxe";
    } elsif option architecture-type = 00:07 {
        # UEFI x86_64 - load iPXE via TFTP
        filename "ipxe.efi";
    } elsif option architecture-type = 00:09 {
        # UEFI x86_64 (alternate) - load iPXE via TFTP
        filename "ipxe.efi";
    } else {
        # Fallback to UEFI
        filename "ipxe.efi";
    }
}

# Static reservations for control plane nodes
host node01 {
    hardware ethernet 52:54:00:12:34:56;
    fixed-address 10.0.100.50;
    option host-name "node01";
}

host node02 {
    hardware ethernet 52:54:00:12:34:57;
    fixed-address 10.0.100.51;
    option host-name "node02";
}

host node03 {
    hardware ethernet 52:54:00:12:34:58;
    fixed-address 10.0.100.52;
    option host-name "node03";
}

Validation Commands:

# Test DHCP configuration syntax
sudo dhcpd -t -cf /etc/dhcp/dhcpd.conf

# Start DHCP server
sudo systemctl start isc-dhcp-server
sudo systemctl enable isc-dhcp-server

# Monitor DHCP leases
sudo tail -f /var/lib/dhcp/dhcpd.leases

# Test DHCP response
sudo nmap --script broadcast-dhcp-discover -e eth0

3.3 DNS Requirements

Forward DNS Zone (example.com):

; Control plane nodes
node01.example.com.    IN  A    10.0.200.10
node02.example.com.    IN  A    10.0.200.11
node03.example.com.    IN  A    10.0.200.12

; Worker nodes
worker01.example.com.  IN  A    10.0.200.20
worker02.example.com.  IN  A    10.0.200.21

; Service VIPs (optional, for load balancing)
chainfire.example.com. IN  A    10.0.200.100
flaredb.example.com.   IN  A    10.0.200.101
iam.example.com.       IN  A    10.0.200.102

Reverse DNS Zone (200.0.10.in-addr.arpa):

; Control plane nodes
10.200.0.10.in-addr.arpa.  IN  PTR  node01.example.com.
11.200.0.10.in-addr.arpa.  IN  PTR  node02.example.com.
12.200.0.10.in-addr.arpa.  IN  PTR  node03.example.com.

Validation:

# Test forward resolution
dig +short node01.example.com

# Test reverse resolution
dig +short -x 10.0.200.10

# Test from target node after provisioning
ssh root@10.0.100.50 'hostname -f'

3.4 Firewall Rules

Service Port Matrix (see NETWORK.md for complete reference):

Service API Port Raft Port Additional Protocol
Chainfire 2379 2380 2381 (gossip) TCP
FlareDB 2479 2480 - TCP
IAM 8080 - - TCP
PlasmaVMC 9090 - - TCP
NovaNET 9091 - - TCP
FlashDNS 53 - - TCP/UDP
FiberLB 9092 - - TCP
K8sHost 10250 - - TCP

iptables Rules (Provisioning Server):

#!/bin/bash
# Provisioning server firewall rules

# Allow DHCP
iptables -A INPUT -p udp --dport 67 -j ACCEPT
iptables -A INPUT -p udp --dport 68 -j ACCEPT

# Allow TFTP
iptables -A INPUT -p udp --dport 69 -j ACCEPT

# Allow HTTP (boot server)
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 8080 -j ACCEPT

# Allow SSH (for nixos-anywhere)
iptables -A INPUT -p tcp --dport 22 -j ACCEPT

iptables Rules (Cluster Nodes):

#!/bin/bash
# Cluster node firewall rules

# Allow SSH (management)
iptables -A INPUT -p tcp --dport 22 -s 10.0.0.0/8 -j ACCEPT

# Allow Chainfire (from cluster subnet only)
iptables -A INPUT -p tcp --dport 2379 -s 10.0.200.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 2380 -s 10.0.200.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 2381 -s 10.0.200.0/24 -j ACCEPT

# Allow FlareDB
iptables -A INPUT -p tcp --dport 2479 -s 10.0.200.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 2480 -s 10.0.200.0/24 -j ACCEPT

# Allow IAM (from cluster and client subnets)
iptables -A INPUT -p tcp --dport 8080 -s 10.0.0.0/8 -j ACCEPT

# Drop all other traffic
iptables -A INPUT -j DROP

nftables Rules (Modern Alternative):

#!/usr/sbin/nft -f

flush ruleset

table inet filter {
    chain input {
        type filter hook input priority 0; policy drop;

        # Allow established connections
        ct state established,related accept

        # Allow loopback
        iif lo accept

        # Allow SSH
        tcp dport 22 ip saddr 10.0.0.0/8 accept

        # Allow cluster services from cluster subnet
        tcp dport { 2379, 2380, 2381, 2479, 2480 } ip saddr 10.0.200.0/24 accept

        # Allow IAM from internal network
        tcp dport 8080 ip saddr 10.0.0.0/8 accept
    }
}

3.5 Static IP Allocation Strategy

IP Allocation Plan:

10.0.100.0/24  - Provisioning network (DHCP during install)
  .1           - Gateway
  .10          - PXE/DHCP/HTTP server
  .50-.79      - Control plane nodes (static reservations)
  .80-.99      - Worker nodes (static reservations)
  .100-.200    - DHCP pool (temporary during provisioning)

10.0.200.0/24  - Production network (static IPs)
  .1           - Gateway
  .10-.19      - Control plane nodes
  .20-.99      - Worker nodes
  .100-.199    - Service VIPs

3.6 Network Bandwidth Requirements

Per-Node During Provisioning:

  • PXE boot: ~200-500 MB (kernel + initrd)
  • nixos-anywhere: ~1-5 GB (NixOS closures)
  • Time: 5-15 minutes on 1 Gbps link

Production Cluster:

  • Control plane: 1 Gbps minimum, 10 Gbps recommended
  • Workers: 10 Gbps minimum, 25 Gbps recommended
  • Inter-node latency: <1ms ideal, <5ms acceptable

4. Pre-Deployment Checklist

Complete this checklist before beginning deployment:

4.1 Hardware Checklist

  • All servers racked and powered
  • All network cables connected (data + BMC)
  • All power supplies connected (redundant if available)
  • BMC/IPMI network configured
  • BMC credentials documented
  • BIOS/UEFI settings configured per section 2.4
  • PXE boot enabled and first in boot order
  • Secure Boot disabled (if using UEFI)
  • Hardware inventory recorded (MAC addresses, serial numbers)

4.2 Network Checklist

  • Network switches configured (VLANs, trunking)
  • DHCP server configured and tested
  • DNS forward/reverse zones created
  • Firewall rules configured
  • Network connectivity verified (ping tests)
  • Bandwidth validated (iperf between nodes)
  • DHCP relay configured (if multi-subnet)
  • NTP server configured for time sync

4.3 PXE Server Checklist

  • PXE server deployed (see T032.S2)
  • DHCP service running and healthy
  • TFTP service running and healthy
  • HTTP service running and healthy
  • iPXE bootloaders downloaded (undionly.kpxe, ipxe.efi)
  • NixOS netboot images built and uploaded (see T032.S3)
  • Boot script configured (boot.ipxe)
  • Health endpoints responding

Validation:

# On PXE server
sudo systemctl status isc-dhcp-server
sudo systemctl status atftpd
sudo systemctl status nginx

# Test HTTP access
curl http://10.0.100.10:8080/boot/ipxe/boot.ipxe
curl http://10.0.100.10:8080/health

# Test TFTP access
tftp 10.0.100.10 -c get undionly.kpxe /tmp/test.kpxe

4.4 Node Configuration Checklist

  • Per-node NixOS configurations created (/srv/provisioning/nodes/)
  • Hardware configurations generated or templated
  • Disko disk layouts defined
  • Network settings configured (static IPs, VLANs)
  • Service selections defined (control-plane vs worker)
  • Cluster configuration JSON files created
  • Node inventory documented (MAC → hostname → role)

4.5 TLS Certificates Checklist

  • CA certificate generated
  • Per-node certificates generated
  • Certificate files copied to secrets directories
  • Certificate permissions set (0400 for private keys)
  • Certificate expiry dates documented
  • Rotation procedure documented

Generate Certificates:

# Generate CA (if not already done)
openssl genrsa -out ca-key.pem 4096
openssl req -x509 -new -nodes -key ca-key.pem -days 3650 \
  -out ca-cert.pem -subj "/CN=PlasmaCloud CA"

# Generate per-node certificate
for node in node01 node02 node03; do
  openssl genrsa -out ${node}-key.pem 4096
  openssl req -new -key ${node}-key.pem -out ${node}-csr.pem \
    -subj "/CN=${node}.example.com"
  openssl x509 -req -in ${node}-csr.pem -CA ca-cert.pem -CAkey ca-key.pem \
    -CAcreateserial -out ${node}-cert.pem -days 365
done

4.6 Provisioning Workstation Checklist

  • NixOS or Nix package manager installed
  • Nix flakes enabled
  • SSH key pair generated for provisioning
  • SSH public key added to netboot images
  • Network access to provisioning VLAN
  • Git repository cloned (if using version control)
  • nixos-anywhere installed: nix profile install github:nix-community/nixos-anywhere

5. Deployment Workflow

5.1 Phase 1: PXE Server Setup

Reference: See /home/centra/cloud/chainfire/baremetal/pxe-server/ (T032.S2)

Step 1.1: Deploy PXE Server Using NixOS Module

Create PXE server configuration:

# /etc/nixos/pxe-server.nix
{ config, pkgs, lib, ... }:

{
  imports = [
    /path/to/chainfire/baremetal/pxe-server/nixos-module.nix
  ];

  services.centra-pxe-server = {
    enable = true;
    interface = "eth0";
    serverAddress = "10.0.100.10";

    dhcp = {
      subnet = "10.0.100.0";
      netmask = "255.255.255.0";
      broadcast = "10.0.100.255";
      range = {
        start = "10.0.100.100";
        end = "10.0.100.200";
      };
      router = "10.0.100.1";
      domainNameServers = [ "10.0.100.1" "8.8.8.8" ];
    };

    nodes = {
      "52:54:00:12:34:56" = {
        profile = "control-plane";
        hostname = "node01";
        ipAddress = "10.0.100.50";
      };
      "52:54:00:12:34:57" = {
        profile = "control-plane";
        hostname = "node02";
        ipAddress = "10.0.100.51";
      };
      "52:54:00:12:34:58" = {
        profile = "control-plane";
        hostname = "node03";
        ipAddress = "10.0.100.52";
      };
    };
  };
}

Apply configuration:

sudo nixos-rebuild switch -I nixos-config=/etc/nixos/pxe-server.nix

Step 1.2: Verify PXE Services

# Check all services are running
sudo systemctl status dhcpd4.service
sudo systemctl status atftpd.service
sudo systemctl status nginx.service

# Test DHCP server
sudo journalctl -u dhcpd4 -f &
# Power on a test server and watch for DHCP requests

# Test TFTP server
tftp localhost -c get undionly.kpxe /tmp/test.kpxe
ls -lh /tmp/test.kpxe  # Should show ~100KB file

# Test HTTP server
curl http://localhost:8080/health
# Expected: {"status":"healthy","services":{"dhcp":"running","tftp":"running","http":"running"}}

curl http://localhost:8080/boot/ipxe/boot.ipxe
# Expected: iPXE boot script content

5.2 Phase 2: Build Netboot Images

Reference: See /home/centra/cloud/baremetal/image-builder/ (T032.S3)

Step 2.1: Build Images for All Profiles

cd /home/centra/cloud/baremetal/image-builder

# Build all profiles
./build-images.sh

# Or build specific profile
./build-images.sh --profile control-plane
./build-images.sh --profile worker
./build-images.sh --profile all-in-one

Expected Output:

Building netboot image for control-plane...
Building initrd...
[... Nix build output ...]
✓ Build complete: artifacts/control-plane/initrd (234 MB)
✓ Build complete: artifacts/control-plane/bzImage (12 MB)

Step 2.2: Copy Images to PXE Server

# Automatic (if PXE server directory exists)
./build-images.sh --deploy

# Manual copy
sudo cp artifacts/control-plane/* /var/lib/pxe-boot/nixos/control-plane/
sudo cp artifacts/worker/* /var/lib/pxe-boot/nixos/worker/
sudo cp artifacts/all-in-one/* /var/lib/pxe-boot/nixos/all-in-one/

Step 2.3: Verify Image Integrity

# Check file sizes (should be reasonable)
ls -lh /var/lib/pxe-boot/nixos/*/

# Verify images are accessible via HTTP
curl -I http://10.0.100.10:8080/boot/nixos/control-plane/bzImage
# Expected: HTTP/1.1 200 OK, Content-Length: ~12000000

curl -I http://10.0.100.10:8080/boot/nixos/control-plane/initrd
# Expected: HTTP/1.1 200 OK, Content-Length: ~234000000

5.3 Phase 3: Prepare Node Configurations

Step 3.1: Generate Node-Specific NixOS Configs

Create directory structure:

mkdir -p /srv/provisioning/nodes/{node01,node02,node03}.example.com/{secrets,}

Node Configuration Template (nodes/node01.example.com/configuration.nix):

{ config, pkgs, lib, ... }:

{
  imports = [
    ../../profiles/control-plane.nix
    ../../common/base.nix
    ./hardware.nix
    ./disko.nix
  ];

  # Hostname and domain
  networking = {
    hostName = "node01";
    domain = "example.com";
    usePredictableInterfaceNames = false;  # Use eth0, eth1

    # Provisioning interface (temporary)
    interfaces.eth0 = {
      useDHCP = false;
      ipv4.addresses = [{
        address = "10.0.100.50";
        prefixLength = 24;
      }];
    };

    # Production interface
    interfaces.eth1 = {
      useDHCP = false;
      ipv4.addresses = [{
        address = "10.0.200.10";
        prefixLength = 24;
      }];
    };

    defaultGateway = "10.0.200.1";
    nameservers = [ "10.0.200.1" "8.8.8.8" ];
  };

  # Enable PlasmaCloud services
  services.chainfire = {
    enable = true;
    port = 2379;
    raftPort = 2380;
    gossipPort = 2381;
    settings = {
      node_id = "node01";
      cluster_name = "prod-cluster";
      tls = {
        cert_path = "/etc/nixos/secrets/node01-cert.pem";
        key_path = "/etc/nixos/secrets/node01-key.pem";
        ca_path = "/etc/nixos/secrets/ca-cert.pem";
      };
    };
  };

  services.flaredb = {
    enable = true;
    port = 2479;
    raftPort = 2480;
    settings = {
      node_id = "node01";
      cluster_name = "prod-cluster";
      chainfire_endpoint = "https://localhost:2379";
      tls = {
        cert_path = "/etc/nixos/secrets/node01-cert.pem";
        key_path = "/etc/nixos/secrets/node01-key.pem";
        ca_path = "/etc/nixos/secrets/ca-cert.pem";
      };
    };
  };

  services.iam = {
    enable = true;
    port = 8080;
    settings = {
      flaredb_endpoint = "https://localhost:2479";
      tls = {
        cert_path = "/etc/nixos/secrets/node01-cert.pem";
        key_path = "/etc/nixos/secrets/node01-key.pem";
        ca_path = "/etc/nixos/secrets/ca-cert.pem";
      };
    };
  };

  # Enable first-boot automation
  services.first-boot-automation = {
    enable = true;
    configFile = "/etc/nixos/secrets/cluster-config.json";
  };

  system.stateVersion = "24.11";
}

Step 3.2: Create cluster-config.json for Each Node

Bootstrap Node (node01):

{
  "node_id": "node01",
  "node_role": "control-plane",
  "bootstrap": true,
  "cluster_name": "prod-cluster",
  "leader_url": "https://node01.example.com:2379",
  "raft_addr": "10.0.200.10:2380",
  "initial_peers": [
    "node01.example.com:2380",
    "node02.example.com:2380",
    "node03.example.com:2380"
  ],
  "flaredb_peers": [
    "node01.example.com:2480",
    "node02.example.com:2480",
    "node03.example.com:2480"
  ]
}

Copy to secrets:

cp cluster-config-node01.json /srv/provisioning/nodes/node01.example.com/secrets/cluster-config.json
cp cluster-config-node02.json /srv/provisioning/nodes/node02.example.com/secrets/cluster-config.json
cp cluster-config-node03.json /srv/provisioning/nodes/node03.example.com/secrets/cluster-config.json

Step 3.3: Generate Disko Disk Layouts

Simple Single-Disk Layout (nodes/node01.example.com/disko.nix):

{ disks ? [ "/dev/sda" ], ... }:
{
  disko.devices = {
    disk = {
      main = {
        type = "disk";
        device = builtins.head disks;
        content = {
          type = "gpt";
          partitions = {
            ESP = {
              size = "1G";
              type = "EF00";
              content = {
                type = "filesystem";
                format = "vfat";
                mountpoint = "/boot";
              };
            };
            root = {
              size = "100%";
              content = {
                type = "filesystem";
                format = "ext4";
                mountpoint = "/";
              };
            };
          };
        };
      };
    };
  };
}

Step 3.4: Pre-Generate TLS Certificates

# Copy per-node certificates
cp ca-cert.pem /srv/provisioning/nodes/node01.example.com/secrets/
cp node01-cert.pem /srv/provisioning/nodes/node01.example.com/secrets/
cp node01-key.pem /srv/provisioning/nodes/node01.example.com/secrets/

# Set permissions
chmod 644 /srv/provisioning/nodes/node01.example.com/secrets/*-cert.pem
chmod 644 /srv/provisioning/nodes/node01.example.com/secrets/ca-cert.pem
chmod 600 /srv/provisioning/nodes/node01.example.com/secrets/*-key.pem

5.4 Phase 4: Bootstrap First 3 Nodes

Step 4.1: Power On Nodes via BMC

# Using ipmitool (example for Dell/HP/Supermicro)
for ip in 10.0.10.50 10.0.10.51 10.0.10.52; do
  ipmitool -I lanplus -H $ip -U admin -P password chassis bootdev pxe options=persistent
  ipmitool -I lanplus -H $ip -U admin -P password chassis power on
done

Step 4.2: Verify PXE Boot Success

Watch serial console (if available):

# Connect via IPMI SOL
ipmitool -I lanplus -H 10.0.10.50 -U admin -P password sol activate

# Expected output:
# ... DHCP discovery ...
# ... TFTP download undionly.kpxe or ipxe.efi ...
# ... iPXE menu appears ...
# ... Kernel and initrd download ...
# ... NixOS installer boots ...
# ... SSH server starts ...

Verify installer is ready:

# Wait for nodes to appear in DHCP leases
sudo tail -f /var/lib/dhcp/dhcpd.leases

# Test SSH connectivity
ssh root@10.0.100.50 'uname -a'
# Expected: Linux node01 ... nixos

Step 4.3: Run nixos-anywhere Simultaneously on All 3

Create provisioning script:

#!/bin/bash
# /srv/provisioning/scripts/provision-bootstrap-nodes.sh

set -euo pipefail

NODES=("node01" "node02" "node03")
PROVISION_IPS=("10.0.100.50" "10.0.100.51" "10.0.100.52")
FLAKE_ROOT="/srv/provisioning"

for i in "${!NODES[@]}"; do
  node="${NODES[$i]}"
  ip="${PROVISION_IPS[$i]}"

  echo "Provisioning $node at $ip..."

  nix run github:nix-community/nixos-anywhere -- \
    --flake "$FLAKE_ROOT#$node" \
    --build-on-remote \
    root@$ip &
done

wait
echo "All nodes provisioned successfully!"

Run provisioning:

chmod +x /srv/provisioning/scripts/provision-bootstrap-nodes.sh
./provision-bootstrap-nodes.sh

Expected output per node:

Provisioning node01 at 10.0.100.50...
Connecting via SSH...
Running disko to partition disks...
Building NixOS system...
Installing bootloader...
Copying secrets...
Installation complete. Rebooting...

Step 4.4: Wait for First-Boot Automation

After reboot, nodes will boot from disk and run first-boot automation. Monitor progress:

# Watch logs on node01 (via SSH after it reboots)
ssh root@10.0.200.10  # Note: now on production network

# Check cluster join services
journalctl -u chainfire-cluster-join.service -f
journalctl -u flaredb-cluster-join.service -f

# Expected log output:
# {"level":"INFO","message":"Waiting for local chainfire service..."}
# {"level":"INFO","message":"Local chainfire healthy"}
# {"level":"INFO","message":"Bootstrap node, cluster initialized"}
# {"level":"INFO","message":"Cluster join complete"}

Step 4.5: Verify Cluster Health

# Check Chainfire cluster
curl -k https://node01.example.com:2379/admin/cluster/members | jq

# Expected output:
# {
#   "members": [
#     {"id":"node01","raft_addr":"10.0.200.10:2380","status":"healthy","role":"leader"},
#     {"id":"node02","raft_addr":"10.0.200.11:2380","status":"healthy","role":"follower"},
#     {"id":"node03","raft_addr":"10.0.200.12:2380","status":"healthy","role":"follower"}
#   ]
# }

# Check FlareDB cluster
curl -k https://node01.example.com:2479/admin/cluster/members | jq

# Check IAM service
curl -k https://node01.example.com:8080/health | jq
# Expected: {"status":"healthy","database":"connected"}

5.5 Phase 5: Add Additional Nodes

Step 5.1: Prepare Join-Mode Configurations

Create configuration for node04 (worker profile):

{
  "node_id": "node04",
  "node_role": "worker",
  "bootstrap": false,
  "cluster_name": "prod-cluster",
  "leader_url": "https://node01.example.com:2379",
  "raft_addr": "10.0.200.20:2380"
}

Step 5.2: Power On and Provision Nodes

# Power on node via BMC
ipmitool -I lanplus -H 10.0.10.54 -U admin -P password chassis bootdev pxe
ipmitool -I lanplus -H 10.0.10.54 -U admin -P password chassis power on

# Wait for PXE boot and SSH ready
sleep 60

# Provision node
nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node04 \
  --build-on-remote \
  root@10.0.100.60

Step 5.3: Verify Cluster Join via API

# Check cluster members (should include node04)
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members[] | select(.id=="node04")'

# Expected:
# {"id":"node04","raft_addr":"10.0.200.20:2380","status":"healthy","role":"follower"}

Step 5.4: Validate Replication and Service Distribution

# Write test data on leader
curl -k -X PUT https://node01.example.com:2379/v1/kv/test \
  -H "Content-Type: application/json" \
  -d '{"value":"hello world"}'

# Read from follower (should be replicated)
curl -k https://node02.example.com:2379/v1/kv/test | jq

# Expected: {"key":"test","value":"hello world"}

6. Verification & Validation

6.1 Health Check Commands for All Services

Chainfire:

curl -k https://node01.example.com:2379/health | jq
# Expected: {"status":"healthy","raft":"leader","cluster_size":3}

# Check cluster membership
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members | length'
# Expected: 3 (for initial bootstrap)

FlareDB:

curl -k https://node01.example.com:2479/health | jq
# Expected: {"status":"healthy","raft":"leader","chainfire":"connected"}

# Query test metric
curl -k https://node01.example.com:2479/v1/query \
  -H "Content-Type: application/json" \
  -d '{"query":"up{job=\"node\"}","time":"now"}'

IAM:

curl -k https://node01.example.com:8080/health | jq
# Expected: {"status":"healthy","database":"connected","version":"1.0.0"}

# List users (requires authentication)
curl -k https://node01.example.com:8080/api/users \
  -H "Authorization: Bearer $IAM_TOKEN" | jq

PlasmaVMC:

curl -k https://node01.example.com:9090/health | jq
# Expected: {"status":"healthy","vms_running":0}

# List VMs
curl -k https://node01.example.com:9090/api/vms | jq

NovaNET:

curl -k https://node01.example.com:9091/health | jq
# Expected: {"status":"healthy","networks":0}

FlashDNS:

dig @node01.example.com example.com
# Expected: DNS response with ANSWER section

# Health check
curl -k https://node01.example.com:853/health | jq

FiberLB:

curl -k https://node01.example.com:9092/health | jq
# Expected: {"status":"healthy","backends":0}

K8sHost:

kubectl --kubeconfig=/etc/kubernetes/admin.conf get nodes
# Expected: Node list including this node

6.2 Cluster Membership Verification

#!/bin/bash
# /srv/provisioning/scripts/verify-cluster.sh

echo "Checking Chainfire cluster..."
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members[] | {id, status, role}'

echo ""
echo "Checking FlareDB cluster..."
curl -k https://node01.example.com:2479/admin/cluster/members | jq '.members[] | {id, status, role}'

echo ""
echo "Cluster health summary:"
echo "  Chainfire nodes: $(curl -sk https://node01.example.com:2379/admin/cluster/members | jq '.members | length')"
echo "  FlareDB nodes: $(curl -sk https://node01.example.com:2479/admin/cluster/members | jq '.members | length')"
echo "  Raft leaders: Chainfire=$(curl -sk https://node01.example.com:2379/admin/cluster/members | jq -r '.members[] | select(.role=="leader") | .id'), FlareDB=$(curl -sk https://node01.example.com:2479/admin/cluster/members | jq -r '.members[] | select(.role=="leader") | .id')"

6.3 Raft Leader Election Check

# Identify current leader
LEADER=$(curl -sk https://node01.example.com:2379/admin/cluster/members | jq -r '.members[] | select(.role=="leader") | .id')
echo "Current Chainfire leader: $LEADER"

# Verify all followers can reach leader
for node in node01 node02 node03; do
  echo "Checking $node..."
  curl -sk https://$node.example.com:2379/admin/cluster/leader | jq
done

6.4 TLS Certificate Validation

# Check certificate expiry
for node in node01 node02 node03; do
  echo "Checking $node certificate..."
  echo | openssl s_client -connect $node.example.com:2379 2>/dev/null | openssl x509 -noout -dates
done

# Verify certificate chain
echo | openssl s_client -connect node01.example.com:2379 -CAfile /srv/provisioning/ca-cert.pem -verify 1
# Expected: Verify return code: 0 (ok)

6.5 Network Connectivity Tests

# Test inter-node connectivity (from node01)
ssh root@node01.example.com '
  for node in node02 node03; do
    echo "Testing connectivity to $node..."
    nc -zv $node.example.com 2379
    nc -zv $node.example.com 2380
  done
'

# Test bandwidth (iperf3)
ssh root@node02.example.com 'iperf3 -s' &
ssh root@node01.example.com 'iperf3 -c node02.example.com -t 10'
# Expected: ~10 Gbps on 10GbE, ~1 Gbps on 1GbE

6.6 Performance Smoke Tests

Chainfire Write Performance:

# 1000 writes
time for i in {1..1000}; do
  curl -sk -X PUT https://node01.example.com:2379/v1/kv/test$i \
    -H "Content-Type: application/json" \
    -d "{\"value\":\"test data $i\"}" > /dev/null
done

# Expected: <10 seconds on healthy cluster

FlareDB Query Performance:

# Insert test metrics
curl -k -X POST https://node01.example.com:2479/v1/write \
  -H "Content-Type: application/json" \
  -d '{"metric":"test_metric","value":42,"timestamp":"'$(date -Iseconds)'"}'

# Query performance
time curl -k https://node01.example.com:2479/v1/query \
  -H "Content-Type: application/json" \
  -d '{"query":"test_metric","start":"1h","end":"now"}'

# Expected: <100ms response time

7. Common Operations

7.1 Adding a New Node

Step 1: Prepare Node Configuration

# Create node directory
mkdir -p /srv/provisioning/nodes/node05.example.com/secrets

# Copy template configuration
cp /srv/provisioning/nodes/node01.example.com/configuration.nix \
   /srv/provisioning/nodes/node05.example.com/

# Edit for new node
vim /srv/provisioning/nodes/node05.example.com/configuration.nix
# Update: hostName, ipAddresses, node_id

Step 2: Generate Cluster Config (Join Mode)

{
  "node_id": "node05",
  "node_role": "worker",
  "bootstrap": false,
  "cluster_name": "prod-cluster",
  "leader_url": "https://node01.example.com:2379",
  "raft_addr": "10.0.200.21:2380"
}

Step 3: Provision Node

# Power on and PXE boot
ipmitool -I lanplus -H 10.0.10.55 -U admin -P password chassis bootdev pxe
ipmitool -I lanplus -H 10.0.10.55 -U admin -P password chassis power on

# Wait for SSH
sleep 60

# Run nixos-anywhere
nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node05 \
  root@10.0.100.65

Step 4: Verify Join

# Check cluster membership
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members[] | select(.id=="node05")'

7.2 Replacing a Failed Node

Step 1: Remove Failed Node from Cluster

# Remove from Chainfire cluster
curl -k -X DELETE https://node01.example.com:2379/admin/member/node02

# Remove from FlareDB cluster
curl -k -X DELETE https://node01.example.com:2479/admin/member/node02

Step 2: Physically Replace Hardware

  • Power off old node
  • Remove from rack
  • Install new node
  • Connect all cables
  • Configure BMC

Step 3: Provision Replacement Node

# Use same node ID and configuration
nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node02 \
  root@10.0.100.51

Step 4: Verify Rejoin

# Cluster should automatically add node during first-boot
curl -k https://node01.example.com:2379/admin/cluster/members | jq

7.3 Updating Node Configuration

Step 1: Edit Configuration

vim /srv/provisioning/nodes/node01.example.com/configuration.nix
# Make changes (e.g., add service, change network config)

Step 2: Build and Deploy

# Build configuration locally
nix build /srv/provisioning#node01

# Deploy to node (from node or remote)
nixos-rebuild switch --flake /srv/provisioning#node01

Step 3: Verify Changes

# Check active configuration
ssh root@node01.example.com 'nixos-rebuild list-generations'

# Test services still healthy
curl -k https://node01.example.com:2379/health | jq

7.4 Rolling Updates

Update Process (One Node at a Time):

#!/bin/bash
# /srv/provisioning/scripts/rolling-update.sh

NODES=("node01" "node02" "node03")

for node in "${NODES[@]}"; do
  echo "Updating $node..."

  # Build new configuration
  nix build /srv/provisioning#$node

  # Deploy (test mode first)
  ssh root@$node.example.com "nixos-rebuild test --flake /srv/provisioning#$node"

  # Verify health
  if ! curl -k https://$node.example.com:2379/health | jq -e '.status == "healthy"'; then
    echo "ERROR: $node unhealthy after test, aborting"
    ssh root@$node.example.com "nixos-rebuild switch --rollback"
    exit 1
  fi

  # Apply permanently
  ssh root@$node.example.com "nixos-rebuild switch --flake /srv/provisioning#$node"

  # Wait for reboot if kernel changed
  echo "Waiting 30s for stabilization..."
  sleep 30

  # Final health check
  curl -k https://$node.example.com:2379/health | jq

  echo "$node updated successfully"
done

7.5 Draining a Node for Maintenance

Step 1: Mark Node for Drain

# Disable node in load balancer (if using one)
curl -k -X POST https://node01.example.com:9092/api/backend/node02 \
  -d '{"status":"drain"}'

Step 2: Migrate VMs (PlasmaVMC)

# List VMs on node
ssh root@node02.example.com 'systemctl list-units | grep plasmavmc-vm@'

# Migrate each VM
curl -k -X POST https://node01.example.com:9090/api/vms/vm-001/migrate \
  -d '{"target_node":"node03"}'

Step 3: Stop Services

ssh root@node02.example.com '
  systemctl stop plasmavmc.service
  systemctl stop chainfire.service
  systemctl stop flaredb.service
'

Step 4: Perform Maintenance

# Reboot for kernel update, hardware maintenance, etc.
ssh root@node02.example.com 'reboot'

Step 5: Re-enable Node

# Verify all services healthy
ssh root@node02.example.com 'systemctl status chainfire flaredb plasmavmc'

# Re-enable in load balancer
curl -k -X POST https://node01.example.com:9092/api/backend/node02 \
  -d '{"status":"active"}'

7.6 Decommissioning a Node

Step 1: Drain Node (see 7.5)

Step 2: Remove from Cluster

# Remove from Chainfire
curl -k -X DELETE https://node01.example.com:2379/admin/member/node02

# Remove from FlareDB
curl -k -X DELETE https://node01.example.com:2479/admin/member/node02

# Verify removal
curl -k https://node01.example.com:2379/admin/cluster/members | jq

Step 3: Power Off

# Via BMC
ipmitool -I lanplus -H 10.0.10.51 -U admin -P password chassis power off

# Or via SSH
ssh root@node02.example.com 'poweroff'

Step 4: Update Inventory

# Remove from node inventory
vim /srv/provisioning/inventory.json
# Remove node02 entry

# Remove from DNS
# Update DNS zone to remove node02.example.com

# Remove from monitoring
# Update Prometheus targets to remove node02

8. Troubleshooting

8.1 PXE Boot Failures

Symptom: Server does not obtain IP address or does not boot from network

Diagnosis:

# Monitor DHCP server logs
sudo journalctl -u dhcpd4 -f

# Monitor TFTP requests
sudo tcpdump -i eth0 -n port 69

# Check PXE server services
sudo systemctl status dhcpd4 atftpd nginx

Common Causes:

  1. DHCP server not running: sudo systemctl start dhcpd4
  2. Wrong network interface: Check interfaces in dhcpd.conf
  3. Firewall blocking DHCP/TFTP: sudo iptables -L -n | grep -E "67|68|69"
  4. PXE not enabled in BIOS: Enter BIOS and enable Network Boot
  5. Network cable disconnected: Check physical connection

Solution:

# Restart all PXE services
sudo systemctl restart dhcpd4 atftpd nginx

# Verify DHCP configuration
sudo dhcpd -t -cf /etc/dhcp/dhcpd.conf

# Test TFTP
tftp localhost -c get undionly.kpxe /tmp/test.kpxe

# Power cycle server
ipmitool -I lanplus -H <bmc-ip> -U admin chassis power cycle

8.2 Installation Failures (nixos-anywhere)

Symptom: nixos-anywhere fails during disk partitioning, installation, or bootloader setup

Diagnosis:

# Check nixos-anywhere output for errors
# Common errors: disk not found, partition table errors, out of space

# SSH to installer for manual inspection
ssh root@10.0.100.50

# Check disk status
lsblk
dmesg | grep -i error

Common Causes:

  1. Disk device wrong: Update disko.nix with correct device (e.g., /dev/nvme0n1)
  2. Disk not wiped: Previous partition table conflicts
  3. Out of disk space: Insufficient storage for Nix closures
  4. Network issues: Cannot download packages from binary cache

Solution:

# Manual disk wipe (on installer)
ssh root@10.0.100.50 '
  wipefs -a /dev/sda
  sgdisk --zap-all /dev/sda
'

# Retry nixos-anywhere
nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node01 \
  --debug \
  root@10.0.100.50

8.3 Cluster Join Failures

Symptom: Node boots successfully but does not join cluster

Diagnosis:

# Check first-boot logs on node
ssh root@node01.example.com 'journalctl -u chainfire-cluster-join.service -u flaredb-cluster-join.service'

# Common errors:
# - "Health check timeout after 120s"
# - "Join request failed: connection refused"
# - "Configuration file not found"

Bootstrap Mode vs Join Mode:

  • Bootstrap: Node expects to create new cluster with peers
  • Join: Node expects to connect to existing leader

Common Causes:

  1. Wrong bootstrap flag: Check cluster-config.json
  2. Leader unreachable: Network/firewall issue
  3. TLS certificate errors: Verify cert paths and validity
  4. Service not starting: Check main service (chainfire.service)

Solution:

# Verify cluster-config.json
ssh root@node01.example.com 'cat /etc/nixos/secrets/cluster-config.json | jq'

# Test leader connectivity
ssh root@node04.example.com 'curl -k https://node01.example.com:2379/health'

# Check TLS certificates
ssh root@node04.example.com 'ls -l /etc/nixos/secrets/*.pem'

# Manual cluster join (if automation fails)
curl -k -X POST https://node01.example.com:2379/admin/member/add \
  -H "Content-Type: application/json" \
  -d '{"id":"node04","raft_addr":"10.0.200.20:2380"}'

8.4 Service Start Failures

Symptom: Service fails to start after boot

Diagnosis:

# Check service status
ssh root@node01.example.com 'systemctl status chainfire.service'

# View logs
ssh root@node01.example.com 'journalctl -u chainfire.service -n 100'

# Common errors:
# - "bind: address already in use" (port conflict)
# - "certificate verify failed" (TLS issue)
# - "permission denied" (file permissions)

Common Causes:

  1. Port already in use: Another service using same port
  2. Missing dependencies: Required service not running
  3. Configuration error: Invalid config file
  4. File permissions: Cannot read secrets

Solution:

# Check port usage
ssh root@node01.example.com 'ss -tlnp | grep 2379'

# Verify dependencies
ssh root@node01.example.com 'systemctl list-dependencies chainfire.service'

# Test configuration manually
ssh root@node01.example.com 'chainfire-server --config /etc/nixos/chainfire.toml --check-config'

# Fix permissions
ssh root@node01.example.com 'chmod 600 /etc/nixos/secrets/*-key.pem'

8.5 Network Connectivity Issues

Symptom: Nodes cannot communicate with each other or external services

Diagnosis:

# Test basic connectivity
ssh root@node01.example.com 'ping -c 3 node02.example.com'

# Test specific ports
ssh root@node01.example.com 'nc -zv node02.example.com 2379'

# Check firewall rules
ssh root@node01.example.com 'iptables -L -n | grep 2379'

# Check routing
ssh root@node01.example.com 'ip route show'

Common Causes:

  1. Firewall blocking traffic: Missing iptables rules
  2. Wrong IP address: Configuration mismatch
  3. Network interface down: Interface not configured
  4. DNS resolution failure: Cannot resolve hostnames

Solution:

# Add firewall rules
ssh root@node01.example.com '
  iptables -A INPUT -p tcp --dport 2379 -s 10.0.200.0/24 -j ACCEPT
  iptables -A INPUT -p tcp --dport 2380 -s 10.0.200.0/24 -j ACCEPT
  iptables-save > /etc/iptables/rules.v4
'

# Fix DNS resolution
ssh root@node01.example.com '
  echo "10.0.200.11 node02.example.com node02" >> /etc/hosts
'

# Restart networking
ssh root@node01.example.com 'systemctl restart systemd-networkd'

8.6 TLS Certificate Errors

Symptom: Services cannot establish TLS connections

Diagnosis:

# Test TLS connection
openssl s_client -connect node01.example.com:2379 -CAfile /srv/provisioning/ca-cert.pem

# Check certificate validity
ssh root@node01.example.com '
  openssl x509 -in /etc/nixos/secrets/node01-cert.pem -noout -dates
'

# Common errors:
# - "certificate verify failed" (wrong CA)
# - "certificate has expired" (cert expired)
# - "certificate subject name mismatch" (wrong CN)

Common Causes:

  1. Expired certificate: Regenerate certificate
  2. Wrong CA certificate: Verify CA cert is correct
  3. Hostname mismatch: CN does not match hostname
  4. File permissions: Cannot read certificate files

Solution:

# Regenerate certificate
openssl req -new -key /srv/provisioning/secrets/node01-key.pem \
  -out /srv/provisioning/secrets/node01-csr.pem \
  -subj "/CN=node01.example.com"

openssl x509 -req -in /srv/provisioning/secrets/node01-csr.pem \
  -CA /srv/provisioning/ca-cert.pem \
  -CAkey /srv/provisioning/ca-key.pem \
  -CAcreateserial \
  -out /srv/provisioning/secrets/node01-cert.pem \
  -days 365

# Copy to node
scp /srv/provisioning/secrets/node01-cert.pem root@node01.example.com:/etc/nixos/secrets/

# Restart service
ssh root@node01.example.com 'systemctl restart chainfire.service'

8.7 Performance Degradation

Symptom: Services are slow or unresponsive

Diagnosis:

# Check system load
ssh root@node01.example.com 'uptime'
ssh root@node01.example.com 'top -bn1 | head -20'

# Check disk I/O
ssh root@node01.example.com 'iostat -x 1 5'

# Check network bandwidth
ssh root@node01.example.com 'iftop -i eth1'

# Check Raft logs for slow operations
ssh root@node01.example.com 'journalctl -u chainfire.service | grep "slow operation"'

Common Causes:

  1. High CPU usage: Too many requests, inefficient queries
  2. Disk I/O bottleneck: Slow disk, too many writes
  3. Network saturation: Bandwidth exhausted
  4. Memory pressure: OOM killer active
  5. Raft slow commits: Network latency between nodes

Solution:

# Add more resources (vertical scaling)
# Or add more nodes (horizontal scaling)

# Check for resource leaks
ssh root@node01.example.com 'systemctl status chainfire | grep Memory'

# Restart service to clear memory leaks (temporary)
ssh root@node01.example.com 'systemctl restart chainfire.service'

# Optimize disk I/O (enable write caching if safe)
ssh root@node01.example.com 'hdparm -W1 /dev/sda'

9. Rollback & Recovery

9.1 NixOS Generation Rollback

NixOS provides atomic rollback capability via generations:

List Available Generations:

ssh root@node01.example.com 'nixos-rebuild list-generations'
# Example output:
#   1   2025-12-10 10:30:00
#   2   2025-12-10 12:45:00 (current)

Rollback to Previous Generation:

# Rollback and reboot
ssh root@node01.example.com 'nixos-rebuild switch --rollback'

# Or boot into previous generation once (no permanent change)
ssh root@node01.example.com 'nixos-rebuild boot --rollback && reboot'

Rollback to Specific Generation:

ssh root@node01.example.com 'nix-env --switch-generation 1 -p /nix/var/nix/profiles/system'
ssh root@node01.example.com 'reboot'

9.2 Re-Provisioning from PXE

Complete re-provisioning wipes all data and reinstalls from scratch:

Step 1: Remove Node from Cluster

curl -k -X DELETE https://node01.example.com:2379/admin/member/node02
curl -k -X DELETE https://node01.example.com:2479/admin/member/node02

Step 2: Set Boot to PXE

ipmitool -I lanplus -H 10.0.10.51 -U admin chassis bootdev pxe

Step 3: Reboot Node

ssh root@node02.example.com 'reboot'
# Or via BMC
ipmitool -I lanplus -H 10.0.10.51 -U admin chassis power cycle

Step 4: Run nixos-anywhere

# Wait for PXE boot and SSH ready
sleep 90

nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node02 \
  root@10.0.100.51

9.3 Disaster Recovery Procedures

Complete Cluster Loss (All Nodes Down):

Step 1: Restore from Backup (if available)

# Restore Chainfire data
ssh root@node01.example.com '
  systemctl stop chainfire.service
  rm -rf /var/lib/chainfire/*
  tar -xzf /backup/chainfire-$(date +%Y%m%d).tar.gz -C /var/lib/chainfire/
  systemctl start chainfire.service
'

Step 2: Bootstrap New Cluster If no backup, re-provision all nodes as bootstrap:

# Update cluster-config.json for all nodes
# Set bootstrap=true, same initial_peers

# Provision all 3 nodes
for node in node01 node02 node03; do
  nix run github:nix-community/nixos-anywhere -- \
    --flake /srv/provisioning#$node \
    root@<node-ip> &
done
wait

Single Node Failure:

Step 1: Verify Cluster Quorum

# Check remaining nodes have quorum
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members | length'
# Expected: 2 (if 3-node cluster with 1 failure)

Step 2: Remove Failed Node

curl -k -X DELETE https://node01.example.com:2379/admin/member/node02

Step 3: Provision Replacement

# Use same node ID and configuration
nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node02 \
  root@10.0.100.51

9.4 Backup and Restore

Automated Backup Script:

#!/bin/bash
# /srv/provisioning/scripts/backup-cluster.sh

BACKUP_DIR="/backup/cluster-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BACKUP_DIR"

# Backup Chainfire data
for node in node01 node02 node03; do
  ssh root@$node.example.com \
    "tar -czf - /var/lib/chainfire" > "$BACKUP_DIR/chainfire-$node.tar.gz"
done

# Backup FlareDB data
for node in node01 node02 node03; do
  ssh root@$node.example.com \
    "tar -czf - /var/lib/flaredb" > "$BACKUP_DIR/flaredb-$node.tar.gz"
done

# Backup configurations
cp -r /srv/provisioning/nodes "$BACKUP_DIR/configs"

echo "Backup complete: $BACKUP_DIR"

Restore Script:

#!/bin/bash
# /srv/provisioning/scripts/restore-cluster.sh

BACKUP_DIR="$1"
if [ -z "$BACKUP_DIR" ]; then
  echo "Usage: $0 <backup-dir>"
  exit 1
fi

# Stop services on all nodes
for node in node01 node02 node03; do
  ssh root@$node.example.com 'systemctl stop chainfire flaredb'
done

# Restore Chainfire data
for node in node01 node02 node03; do
  cat "$BACKUP_DIR/chainfire-$node.tar.gz" | \
    ssh root@$node.example.com "cd / && tar -xzf -"
done

# Restore FlareDB data
for node in node01 node02 node03; do
  cat "$BACKUP_DIR/flaredb-$node.tar.gz" | \
    ssh root@$node.example.com "cd / && tar -xzf -"
done

# Restart services
for node in node01 node02 node03; do
  ssh root@$node.example.com 'systemctl start chainfire flaredb'
done

echo "Restore complete"

10. Security Best Practices

10.1 SSH Key Management

Generate Dedicated Provisioning Key:

ssh-keygen -t ed25519 -C "provisioning@example.com" -f ~/.ssh/id_ed25519_provisioning

Add to Netboot Image:

# In netboot-base.nix
users.users.root.openssh.authorizedKeys.keys = [
  "ssh-ed25519 AAAAC3Nza... provisioning@example.com"
];

Rotate Keys Regularly:

# Generate new key
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519_provisioning_new

# Add to all nodes
for node in node01 node02 node03; do
  ssh-copy-id -i ~/.ssh/id_ed25519_provisioning_new.pub root@$node.example.com
done

# Remove old key from authorized_keys
# Update netboot image with new key

10.2 TLS Certificate Rotation

Automated Rotation Script:

#!/bin/bash
# /srv/provisioning/scripts/rotate-certs.sh

# Generate new certificates
for node in node01 node02 node03; do
  openssl genrsa -out ${node}-key-new.pem 4096
  openssl req -new -key ${node}-key-new.pem -out ${node}-csr.pem \
    -subj "/CN=${node}.example.com"
  openssl x509 -req -in ${node}-csr.pem \
    -CA ca-cert.pem -CAkey ca-key.pem \
    -CAcreateserial -out ${node}-cert-new.pem -days 365
done

# Deploy new certificates (without restarting services yet)
for node in node01 node02 node03; do
  scp ${node}-cert-new.pem root@${node}.example.com:/etc/nixos/secrets/${node}-cert-new.pem
  scp ${node}-key-new.pem root@${node}.example.com:/etc/nixos/secrets/${node}-key-new.pem
done

# Update configuration to use new certs
# ... (NixOS configuration update) ...

# Rolling restart to apply new certificates
for node in node01 node02 node03; do
  ssh root@${node}.example.com 'systemctl restart chainfire flaredb iam'
  sleep 30  # Wait for stabilization
done

echo "Certificate rotation complete"

10.3 Secrets Management

Best Practices:

  • Store secrets outside Nix store (use /etc/nixos/secrets/)
  • Set restrictive permissions (0600 for private keys, 0400 for passwords)
  • Use environment variables for runtime secrets
  • Never commit secrets to Git
  • Use encrypted secrets (sops-nix or agenix)

Example with sops-nix:

# In configuration.nix
{
  imports = [ <sops-nix/modules/sops> ];

  sops.defaultSopsFile = ./secrets.yaml;
  sops.secrets."node01/tls-key" = {
    owner = "chainfire";
    mode = "0400";
  };

  services.chainfire.settings.tls.key_path = config.sops.secrets."node01/tls-key".path;
}

10.4 Network Isolation

VLAN Segmentation:

  • Management VLAN (10): BMC/IPMI, provisioning workstation
  • Provisioning VLAN (100): PXE boot, temporary
  • Production VLAN (200): Cluster services, inter-node communication
  • Client VLAN (300): External clients accessing services

Firewall Zones:

# Example nftables rules
table inet filter {
  chain input {
    type filter hook input priority 0; policy drop;

    # Management from trusted subnet only
    iifname "eth0" ip saddr 10.0.10.0/24 tcp dport 22 accept

    # Cluster traffic from cluster subnet only
    iifname "eth1" ip saddr 10.0.200.0/24 tcp dport { 2379, 2380, 2479, 2480 } accept

    # Client traffic from client subnet only
    iifname "eth2" ip saddr 10.0.300.0/24 tcp dport { 8080, 9090 } accept
  }
}

10.5 Audit Logging

Enable Structured Logging:

# In configuration.nix
services.chainfire.settings.logging = {
  level = "info";
  format = "json";
  output = "journal";
};

# Enable journald forwarding to SIEM
services.journald.extraConfig = ''
  ForwardToSyslog=yes
  Storage=persistent
  MaxRetentionSec=7days
'';

Audit Key Events:

  • Cluster membership changes
  • Node joins/leaves
  • Authentication failures
  • Configuration changes
  • TLS certificate errors

Log Aggregation:

# Forward logs to central logging server
# Example: rsyslog configuration
cat > /etc/rsyslog.d/50-remote.conf <<EOF
*.* @@logging-server.example.com:514
EOF
systemctl restart rsyslog

Appendix A: Service Port Reference

See NETWORK.md for complete port matrix.

Appendix B: Hardware Vendor Commands

See HARDWARE.md for vendor-specific BIOS configurations and IPMI commands.

Appendix C: Complete Command Reference

See COMMANDS.md for all commands organized by task.

Appendix D: Quick Reference Cards

See QUICKSTART.md for condensed deployment guide.

Appendix E: Deployment Flow Diagrams

See diagrams/deployment-flow.md for visual workflow.

  • Design Document: /home/centra/cloud/docs/por/T032-baremetal-provisioning/design.md
  • PXE Server: /home/centra/cloud/chainfire/baremetal/pxe-server/README.md
  • Image Builder: /home/centra/cloud/baremetal/image-builder/README.md
  • First-Boot Automation: /home/centra/cloud/baremetal/first-boot/README.md

End of Operator Runbook