centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere

- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-11 09:59:19 +09:00

62 KiB

Raw Blame History

Bare-Metal Provisioning Operator Runbook

Document Version: 1.0 Last Updated: 2025-12-10 Status: Production Ready Author: PlasmaCloud Infrastructure Team

1. Overview

1.1 What This Runbook Covers

This runbook provides comprehensive, step-by-step instructions for deploying PlasmaCloud infrastructure on bare-metal servers using automated PXE-based provisioning. By following this guide, operators will be able to:

Deploy a complete PlasmaCloud cluster from bare hardware to running services
Bootstrap a 3-node Raft cluster (Chainfire + FlareDB)
Add additional nodes to an existing cluster
Validate cluster health and troubleshoot common issues
Perform operational tasks (updates, maintenance, recovery)

1.2 Prerequisites

Required Access and Permissions:

Root/sudo access on provisioning server
Physical or IPMI/BMC access to bare-metal servers
Network access to provisioning VLAN
SSH key pair for nixos-anywhere

Required Tools:

NixOS with flakes enabled (provisioning workstation)
curl, jq, ssh client
ipmitool (optional, for remote management)
Serial console access tool (optional)

Required Knowledge:

Basic understanding of PXE boot process
Linux system administration
Network configuration (DHCP, DNS, firewall)
NixOS basics (declarative configuration, flakes)

1.3 Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                    Bare-Metal Provisioning Flow                         │
└─────────────────────────────────────────────────────────────────────────┘

Phase 1: PXE Boot                Phase 2: Installation
┌──────────────┐                  ┌──────────────────┐
│  Bare-Metal  │  1. DHCP Request │   DHCP Server    │
│   Server     ├─────────────────>│  (PXE Server)    │
│              │                  └──────────────────┘
│  (powered    │  2. TFTP Get                │
│   on, PXE    │     bootloader             │
│   enabled)   │<───────────────────────────┘
│              │
│  3. iPXE     │  4. HTTP Get      ┌──────────────────┐
│     loads    │     boot.ipxe     │   HTTP Server    │
│              ├──────────────────>│   (nginx)        │
│              │                   └──────────────────┘
│  5. iPXE     │  6. HTTP Get               │
│     menu     │     kernel+initrd          │
│              │<───────────────────────────┘
│              │
│  7. Boot     │
│     NixOS    │
│     Installer│
└──────┬───────┘
       │
       │  8. SSH Connection         ┌──────────────────┐
       └───────────────────────────>│  Provisioning    │
                                    │  Workstation     │
                                    │                  │
                                    │  9. Run          │
                                    │     nixos-       │
                                    │     anywhere     │
                                    └──────┬───────────┘
                                           │
                      ┌────────────────────┴────────────────────┐
                      │                                          │
                      v                                          v
       ┌──────────────────────────┐          ┌──────────────────────────┐
       │  10. Partition disks     │          │  11. Install NixOS       │
       │      (disko)             │          │      - Build system      │
       │  - GPT/LVM/LUKS          │          │      - Copy closures     │
       │  - Format filesystems    │          │      - Install bootloader│
       │  - Mount /mnt            │          │      - Inject secrets    │
       └──────────────────────────┘          └──────────────────────────┘

Phase 3: First Boot              Phase 4: Running Cluster
┌──────────────┐                 ┌──────────────────┐
│  Bare-Metal  │  12. Reboot     │   NixOS System   │
│   Server     │  ────────────>  │   (from disk)    │
└──────────────┘                 └──────────────────┘
                                          │
                      ┌───────────────────┴────────────────────┐
                      │  13. First-boot automation             │
                      │  - Chainfire cluster join/bootstrap    │
                      │  - FlareDB cluster join/bootstrap      │
                      │  - IAM initialization                  │
                      │  - Health checks                       │
                      └───────────────────┬────────────────────┘
                                          │
                                          v
                                 ┌──────────────────┐
                                 │  Running Cluster │
                                 │  - All services  │
                                 │    healthy       │
                                 │  - Raft quorum   │
                                 │  - TLS enabled   │
                                 └──────────────────┘

2. Hardware Requirements

2.1 Minimum Specifications Per Node

Control Plane Nodes (3-5 recommended):

CPU: 8 cores / 16 threads (Intel Xeon or AMD EPYC)
RAM: 32 GB DDR4 ECC
Storage: 500 GB SSD (NVMe preferred)
Network: 2x 10 GbE (bonded/redundant)
BMC: IPMI 2.0 or Redfish compatible

Worker Nodes:

CPU: 16+ cores / 32+ threads
RAM: 64 GB+ DDR4 ECC
Storage: 1 TB+ NVMe SSD
Network: 2x 10 GbE or 2x 25 GbE
BMC: IPMI 2.0 or Redfish compatible

All-in-One (Development/Testing):

CPU: 16 cores / 32 threads
RAM: 64 GB DDR4
Storage: 1 TB SSD
Network: 1x 10 GbE (minimum)
BMC: Optional but recommended

2.2 Recommended Production Specifications

Control Plane Nodes:

CPU: 16-32 cores (Intel Xeon Gold/Platinum or AMD EPYC)
RAM: 64-128 GB DDR4 ECC
Storage: 1-2 TB NVMe SSD (RAID1 for redundancy)
Network: 2x 25 GbE (active/active bonding)
BMC: Redfish with SOL (Serial-over-LAN)

Worker Nodes:

CPU: 32-64 cores
RAM: 128-256 GB DDR4 ECC
Storage: 2-4 TB NVMe SSD
Network: 2x 25 GbE or 2x 100 GbE
GPU: Optional (NVIDIA/AMD for ML workloads)

2.3 Hardware Compatibility Matrix

Vendor	Model	Tested	BIOS	UEFI	Notes
Dell	PowerEdge R640	Yes	Yes	Yes	Requires BIOS A19+
Dell	PowerEdge R650	Yes	Yes	Yes	Best PXE compatibility
HPE	ProLiant DL360	Yes	Yes	Yes	Disable Secure Boot
HPE	ProLiant DL380	Yes	Yes	Yes	Latest firmware recommended
Supermicro	SYS-2029U	Yes	Yes	Yes	Requires BMC 1.73+
Lenovo	ThinkSystem	Partial	Yes	Yes	Some NIC issues on older models
Generic	Whitebox x86	Partial	Yes	Maybe	UEFI support varies

2.4 BIOS/UEFI Settings

Required Settings:

Boot Mode: UEFI (preferred) or Legacy BIOS
PXE/Network Boot: Enabled on primary NIC
Boot Order: Network → Disk
Secure Boot: Disabled (for PXE boot)
Virtualization: Enabled (VT-x/AMD-V)
SR-IOV: Enabled (if using advanced networking)

Dell-Specific (iDRAC):

System BIOS → Boot Settings:
  Boot Mode: UEFI
  UEFI Network Stack: Enabled
  PXE Device 1: Integrated NIC 1

System BIOS → System Profile:
  Profile: Performance

HPE-Specific (iLO):

System Configuration → BIOS/Platform:
  Boot Mode: UEFI Mode
  Network Boot: Enabled
  PXE Support: UEFI Only

System Configuration → UEFI Boot Order:
  1. Network Adapter (NIC 1)
  2. Hard Disk

Supermicro-Specific (IPMI):

BIOS Setup → Boot:
  Boot mode select: UEFI
  UEFI Network Stack: Enabled
  Boot Option #1: UEFI Network

BIOS Setup → Advanced → CPU Configuration:
  Intel Virtualization Technology: Enabled

2.5 BMC/IPMI Requirements

Mandatory Features:

Remote power control (on/off/reset)
Boot device selection (PXE/disk)
Remote console access (KVM-over-IP or SOL)

Recommended Features:

Virtual media mounting
Sensor monitoring (temperature, fans, PSU)
Event logging
SMTP alerting

Network Configuration:

Dedicated BMC network (separate VLAN recommended)
Static IP or DHCP reservation
HTTPS access enabled
Default credentials changed

3. Network Setup

3.1 Network Topology

Single-Segment Topology (Simple):

┌─────────────────────────────────────────────────────┐
│  Provisioning Server    PXE/DHCP/HTTP              │
│  10.0.100.10                                        │
└──────────────┬──────────────────────────────────────┘
               │
               │  Layer 2 Switch (unmanaged)
               │
    ┬──────────┴──────────┬─────────────┬
    │                     │             │
┌───┴────┐          ┌────┴─────┐  ┌───┴────┐
│ Node01 │          │  Node02  │  │ Node03 │
│10.0.100│          │ 10.0.100 │  │10.0.100│
│  .50   │          │   .51    │  │  .52   │
└────────┘          └──────────┘  └────────┘

Multi-VLAN Topology (Production):

┌──────────────────────────────────────────────────────┐
│  Management Network (VLAN 10)                        │
│  - Provisioning Server: 10.0.10.10                   │
│  - BMC/IPMI: 10.0.10.50-99                          │
└──────────────────┬───────────────────────────────────┘
                   │
┌──────────────────┴───────────────────────────────────┐
│  Provisioning Network (VLAN 100)                     │
│  - PXE Boot: 10.0.100.0/24                          │
│  - DHCP Range: 10.0.100.100-200                     │
└──────────────────┬───────────────────────────────────┘
                   │
┌──────────────────┴───────────────────────────────────┐
│  Production Network (VLAN 200)                       │
│  - Static IPs: 10.0.200.10-99                       │
│  - Service Traffic                                   │
└──────────────────┬───────────────────────────────────┘
                   │
          ┌────────┴────────┐
          │  L3 Switch      │
          │  (VLANs, Routing)│
          └────────┬─────────┘
                   │
       ┬───────────┴──────────┬─────────┬
       │                      │         │
  ┌────┴────┐           ┌────┴────┐   │
  │ Node01  │           │ Node02  │   │...
  │ eth0:   │           │ eth0:   │
  │  VLAN100│           │  VLAN100│
  │ eth1:   │           │ eth1:   │
  │  VLAN200│           │  VLAN200│
  └─────────┘           └─────────┘

3.2 DHCP Server Configuration

ISC DHCP Configuration (/etc/dhcp/dhcpd.conf):

# Global options
option architecture-type code 93 = unsigned integer 16;
default-lease-time 600;
max-lease-time 7200;
authoritative;

# Provisioning subnet
subnet 10.0.100.0 netmask 255.255.255.0 {
    range 10.0.100.100 10.0.100.200;
    option routers 10.0.100.1;
    option domain-name-servers 10.0.100.1, 8.8.8.8;
    option domain-name "prov.example.com";

    # PXE boot server
    next-server 10.0.100.10;

    # Architecture-specific boot file selection
    if exists user-class and option user-class = "iPXE" {
        # iPXE already loaded, provide boot script via HTTP
        filename "http://10.0.100.10:8080/boot/ipxe/boot.ipxe";
    } elsif option architecture-type = 00:00 {
        # BIOS (legacy) - load iPXE via TFTP
        filename "undionly.kpxe";
    } elsif option architecture-type = 00:07 {
        # UEFI x86_64 - load iPXE via TFTP
        filename "ipxe.efi";
    } elsif option architecture-type = 00:09 {
        # UEFI x86_64 (alternate) - load iPXE via TFTP
        filename "ipxe.efi";
    } else {
        # Fallback to UEFI
        filename "ipxe.efi";
    }
}

# Static reservations for control plane nodes
host node01 {
    hardware ethernet 52:54:00:12:34:56;
    fixed-address 10.0.100.50;
    option host-name "node01";
}

host node02 {
    hardware ethernet 52:54:00:12:34:57;
    fixed-address 10.0.100.51;
    option host-name "node02";
}

host node03 {
    hardware ethernet 52:54:00:12:34:58;
    fixed-address 10.0.100.52;
    option host-name "node03";
}

Validation Commands:

# Test DHCP configuration syntax
sudo dhcpd -t -cf /etc/dhcp/dhcpd.conf

# Start DHCP server
sudo systemctl start isc-dhcp-server
sudo systemctl enable isc-dhcp-server

# Monitor DHCP leases
sudo tail -f /var/lib/dhcp/dhcpd.leases

# Test DHCP response
sudo nmap --script broadcast-dhcp-discover -e eth0

3.3 DNS Requirements

Forward DNS Zone (example.com):

; Control plane nodes
node01.example.com.    IN  A    10.0.200.10
node02.example.com.    IN  A    10.0.200.11
node03.example.com.    IN  A    10.0.200.12

; Worker nodes
worker01.example.com.  IN  A    10.0.200.20
worker02.example.com.  IN  A    10.0.200.21

; Service VIPs (optional, for load balancing)
chainfire.example.com. IN  A    10.0.200.100
flaredb.example.com.   IN  A    10.0.200.101
iam.example.com.       IN  A    10.0.200.102

Reverse DNS Zone (200.0.10.in-addr.arpa):

; Control plane nodes
10.200.0.10.in-addr.arpa.  IN  PTR  node01.example.com.
11.200.0.10.in-addr.arpa.  IN  PTR  node02.example.com.
12.200.0.10.in-addr.arpa.  IN  PTR  node03.example.com.

Validation:

# Test forward resolution
dig +short node01.example.com

# Test reverse resolution
dig +short -x 10.0.200.10

# Test from target node after provisioning
ssh root@10.0.100.50 'hostname -f'

3.4 Firewall Rules

Service Port Matrix (see NETWORK.md for complete reference):

Service	API Port	Raft Port	Additional	Protocol
Chainfire	2379	2380	2381 (gossip)	TCP
FlareDB	2479	2480	-	TCP
IAM	8080	-	-	TCP
PlasmaVMC	9090	-	-	TCP
NovaNET	9091	-	-	TCP
FlashDNS	53	-	-	TCP/UDP
FiberLB	9092	-	-	TCP
K8sHost	10250	-	-	TCP

iptables Rules (Provisioning Server):

#!/bin/bash
# Provisioning server firewall rules

# Allow DHCP
iptables -A INPUT -p udp --dport 67 -j ACCEPT
iptables -A INPUT -p udp --dport 68 -j ACCEPT

# Allow TFTP
iptables -A INPUT -p udp --dport 69 -j ACCEPT

# Allow HTTP (boot server)
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 8080 -j ACCEPT

# Allow SSH (for nixos-anywhere)
iptables -A INPUT -p tcp --dport 22 -j ACCEPT

iptables Rules (Cluster Nodes):

#!/bin/bash
# Cluster node firewall rules

# Allow SSH (management)
iptables -A INPUT -p tcp --dport 22 -s 10.0.0.0/8 -j ACCEPT

# Allow Chainfire (from cluster subnet only)
iptables -A INPUT -p tcp --dport 2379 -s 10.0.200.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 2380 -s 10.0.200.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 2381 -s 10.0.200.0/24 -j ACCEPT

# Allow FlareDB
iptables -A INPUT -p tcp --dport 2479 -s 10.0.200.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 2480 -s 10.0.200.0/24 -j ACCEPT

# Allow IAM (from cluster and client subnets)
iptables -A INPUT -p tcp --dport 8080 -s 10.0.0.0/8 -j ACCEPT

# Drop all other traffic
iptables -A INPUT -j DROP

nftables Rules (Modern Alternative):

#!/usr/sbin/nft -f

flush ruleset

table inet filter {
    chain input {
        type filter hook input priority 0; policy drop;

        # Allow established connections
        ct state established,related accept

        # Allow loopback
        iif lo accept

        # Allow SSH
        tcp dport 22 ip saddr 10.0.0.0/8 accept

        # Allow cluster services from cluster subnet
        tcp dport { 2379, 2380, 2381, 2479, 2480 } ip saddr 10.0.200.0/24 accept

        # Allow IAM from internal network
        tcp dport 8080 ip saddr 10.0.0.0/8 accept
    }
}

3.5 Static IP Allocation Strategy

IP Allocation Plan:

10.0.100.0/24  - Provisioning network (DHCP during install)
  .1           - Gateway
  .10          - PXE/DHCP/HTTP server
  .50-.79      - Control plane nodes (static reservations)
  .80-.99      - Worker nodes (static reservations)
  .100-.200    - DHCP pool (temporary during provisioning)

10.0.200.0/24  - Production network (static IPs)
  .1           - Gateway
  .10-.19      - Control plane nodes
  .20-.99      - Worker nodes
  .100-.199    - Service VIPs

3.6 Network Bandwidth Requirements

Per-Node During Provisioning:

PXE boot: ~200-500 MB (kernel + initrd)
nixos-anywhere: ~1-5 GB (NixOS closures)
Time: 5-15 minutes on 1 Gbps link

Production Cluster:

Control plane: 1 Gbps minimum, 10 Gbps recommended
Workers: 10 Gbps minimum, 25 Gbps recommended
Inter-node latency: <1ms ideal, <5ms acceptable

4. Pre-Deployment Checklist

Complete this checklist before beginning deployment:

4.1 Hardware Checklist

All servers racked and powered
All network cables connected (data + BMC)
All power supplies connected (redundant if available)
BMC/IPMI network configured
BMC credentials documented
BIOS/UEFI settings configured per section 2.4
PXE boot enabled and first in boot order
Secure Boot disabled (if using UEFI)
Hardware inventory recorded (MAC addresses, serial numbers)

4.3 PXE Server Checklist

PXE server deployed (see T032.S2)
DHCP service running and healthy
TFTP service running and healthy
HTTP service running and healthy
iPXE bootloaders downloaded (undionly.kpxe, ipxe.efi)
NixOS netboot images built and uploaded (see T032.S3)
Boot script configured (boot.ipxe)
Health endpoints responding

Validation:

# On PXE server
sudo systemctl status isc-dhcp-server
sudo systemctl status atftpd
sudo systemctl status nginx

# Test HTTP access
curl http://10.0.100.10:8080/boot/ipxe/boot.ipxe
curl http://10.0.100.10:8080/health

# Test TFTP access
tftp 10.0.100.10 -c get undionly.kpxe /tmp/test.kpxe

4.4 Node Configuration Checklist

Per-node NixOS configurations created (/srv/provisioning/nodes/)
Hardware configurations generated or templated
Disko disk layouts defined
Network settings configured (static IPs, VLANs)
Service selections defined (control-plane vs worker)
Cluster configuration JSON files created
Node inventory documented (MAC → hostname → role)

4.5 TLS Certificates Checklist

CA certificate generated
Per-node certificates generated
Certificate files copied to secrets directories
Certificate permissions set (0400 for private keys)
Certificate expiry dates documented
Rotation procedure documented

Generate Certificates:

# Generate CA (if not already done)
openssl genrsa -out ca-key.pem 4096
openssl req -x509 -new -nodes -key ca-key.pem -days 3650 \
  -out ca-cert.pem -subj "/CN=PlasmaCloud CA"

# Generate per-node certificate
for node in node01 node02 node03; do
  openssl genrsa -out ${node}-key.pem 4096
  openssl req -new -key ${node}-key.pem -out ${node}-csr.pem \
    -subj "/CN=${node}.example.com"
  openssl x509 -req -in ${node}-csr.pem -CA ca-cert.pem -CAkey ca-key.pem \
    -CAcreateserial -out ${node}-cert.pem -days 365
done

4.6 Provisioning Workstation Checklist

NixOS or Nix package manager installed
Nix flakes enabled
SSH key pair generated for provisioning
SSH public key added to netboot images
Network access to provisioning VLAN
Git repository cloned (if using version control)
nixos-anywhere installed: nix profile install github:nix-community/nixos-anywhere

5. Deployment Workflow

5.1 Phase 1: PXE Server Setup

Reference: See /home/centra/cloud/chainfire/baremetal/pxe-server/ (T032.S2)

Step 1.1: Deploy PXE Server Using NixOS Module

Create PXE server configuration:

# /etc/nixos/pxe-server.nix
{ config, pkgs, lib, ... }:

{
  imports = [
    /path/to/chainfire/baremetal/pxe-server/nixos-module.nix
  ];

  services.centra-pxe-server = {
    enable = true;
    interface = "eth0";
    serverAddress = "10.0.100.10";

    dhcp = {
      subnet = "10.0.100.0";
      netmask = "255.255.255.0";
      broadcast = "10.0.100.255";
      range = {
        start = "10.0.100.100";
        end = "10.0.100.200";
      };
      router = "10.0.100.1";
      domainNameServers = [ "10.0.100.1" "8.8.8.8" ];
    };

    nodes = {
      "52:54:00:12:34:56" = {
        profile = "control-plane";
        hostname = "node01";
        ipAddress = "10.0.100.50";
      };
      "52:54:00:12:34:57" = {
        profile = "control-plane";
        hostname = "node02";
        ipAddress = "10.0.100.51";
      };
      "52:54:00:12:34:58" = {
        profile = "control-plane";
        hostname = "node03";
        ipAddress = "10.0.100.52";
      };
    };
  };
}

Apply configuration:

sudo nixos-rebuild switch -I nixos-config=/etc/nixos/pxe-server.nix

Step 1.2: Verify PXE Services

# Check all services are running
sudo systemctl status dhcpd4.service
sudo systemctl status atftpd.service
sudo systemctl status nginx.service

# Test DHCP server
sudo journalctl -u dhcpd4 -f &
# Power on a test server and watch for DHCP requests

# Test TFTP server
tftp localhost -c get undionly.kpxe /tmp/test.kpxe
ls -lh /tmp/test.kpxe  # Should show ~100KB file

# Test HTTP server
curl http://localhost:8080/health
# Expected: {"status":"healthy","services":{"dhcp":"running","tftp":"running","http":"running"}}

curl http://localhost:8080/boot/ipxe/boot.ipxe
# Expected: iPXE boot script content

5.2 Phase 2: Build Netboot Images

Reference: See /home/centra/cloud/baremetal/image-builder/ (T032.S3)

Step 2.1: Build Images for All Profiles

cd /home/centra/cloud/baremetal/image-builder

# Build all profiles
./build-images.sh

# Or build specific profile
./build-images.sh --profile control-plane
./build-images.sh --profile worker
./build-images.sh --profile all-in-one

Expected Output:

Building netboot image for control-plane...
Building initrd...
[... Nix build output ...]
✓ Build complete: artifacts/control-plane/initrd (234 MB)
✓ Build complete: artifacts/control-plane/bzImage (12 MB)

Step 2.2: Copy Images to PXE Server

# Automatic (if PXE server directory exists)
./build-images.sh --deploy

# Manual copy
sudo cp artifacts/control-plane/* /var/lib/pxe-boot/nixos/control-plane/
sudo cp artifacts/worker/* /var/lib/pxe-boot/nixos/worker/
sudo cp artifacts/all-in-one/* /var/lib/pxe-boot/nixos/all-in-one/

Step 2.3: Verify Image Integrity

# Check file sizes (should be reasonable)
ls -lh /var/lib/pxe-boot/nixos/*/

# Verify images are accessible via HTTP
curl -I http://10.0.100.10:8080/boot/nixos/control-plane/bzImage
# Expected: HTTP/1.1 200 OK, Content-Length: ~12000000

curl -I http://10.0.100.10:8080/boot/nixos/control-plane/initrd
# Expected: HTTP/1.1 200 OK, Content-Length: ~234000000

5.3 Phase 3: Prepare Node Configurations

Step 3.1: Generate Node-Specific NixOS Configs

Create directory structure:

mkdir -p /srv/provisioning/nodes/{node01,node02,node03}.example.com/{secrets,}

Node Configuration Template (nodes/node01.example.com/configuration.nix):

{ config, pkgs, lib, ... }:

{
  imports = [
    ../../profiles/control-plane.nix
    ../../common/base.nix
    ./hardware.nix
    ./disko.nix
  ];

  # Hostname and domain
  networking = {
    hostName = "node01";
    domain = "example.com";
    usePredictableInterfaceNames = false;  # Use eth0, eth1

    # Provisioning interface (temporary)
    interfaces.eth0 = {
      useDHCP = false;
      ipv4.addresses = [{
        address = "10.0.100.50";
        prefixLength = 24;
      }];
    };

    # Production interface
    interfaces.eth1 = {
      useDHCP = false;
      ipv4.addresses = [{
        address = "10.0.200.10";
        prefixLength = 24;
      }];
    };

    defaultGateway = "10.0.200.1";
    nameservers = [ "10.0.200.1" "8.8.8.8" ];
  };

  # Enable PlasmaCloud services
  services.chainfire = {
    enable = true;
    port = 2379;
    raftPort = 2380;
    gossipPort = 2381;
    settings = {
      node_id = "node01";
      cluster_name = "prod-cluster";
      tls = {
        cert_path = "/etc/nixos/secrets/node01-cert.pem";
        key_path = "/etc/nixos/secrets/node01-key.pem";
        ca_path = "/etc/nixos/secrets/ca-cert.pem";
      };
    };
  };

  services.flaredb = {
    enable = true;
    port = 2479;
    raftPort = 2480;
    settings = {
      node_id = "node01";
      cluster_name = "prod-cluster";
      chainfire_endpoint = "https://localhost:2379";
      tls = {
        cert_path = "/etc/nixos/secrets/node01-cert.pem";
        key_path = "/etc/nixos/secrets/node01-key.pem";
        ca_path = "/etc/nixos/secrets/ca-cert.pem";
      };
    };
  };

  services.iam = {
    enable = true;
    port = 8080;
    settings = {
      flaredb_endpoint = "https://localhost:2479";
      tls = {
        cert_path = "/etc/nixos/secrets/node01-cert.pem";
        key_path = "/etc/nixos/secrets/node01-key.pem";
        ca_path = "/etc/nixos/secrets/ca-cert.pem";
      };
    };
  };

  # Enable first-boot automation
  services.first-boot-automation = {
    enable = true;
    configFile = "/etc/nixos/secrets/cluster-config.json";
  };

  system.stateVersion = "24.11";
}

Step 3.2: Create cluster-config.json for Each Node

Bootstrap Node (node01):

{
  "node_id": "node01",
  "node_role": "control-plane",
  "bootstrap": true,
  "cluster_name": "prod-cluster",
  "leader_url": "https://node01.example.com:2379",
  "raft_addr": "10.0.200.10:2380",
  "initial_peers": [
    "node01.example.com:2380",
    "node02.example.com:2380",
    "node03.example.com:2380"
  ],
  "flaredb_peers": [
    "node01.example.com:2480",
    "node02.example.com:2480",
    "node03.example.com:2480"
  ]
}

Copy to secrets:

cp cluster-config-node01.json /srv/provisioning/nodes/node01.example.com/secrets/cluster-config.json
cp cluster-config-node02.json /srv/provisioning/nodes/node02.example.com/secrets/cluster-config.json
cp cluster-config-node03.json /srv/provisioning/nodes/node03.example.com/secrets/cluster-config.json

Step 3.3: Generate Disko Disk Layouts

Simple Single-Disk Layout (nodes/node01.example.com/disko.nix):

{ disks ? [ "/dev/sda" ], ... }:
{
  disko.devices = {
    disk = {
      main = {
        type = "disk";
        device = builtins.head disks;
        content = {
          type = "gpt";
          partitions = {
            ESP = {
              size = "1G";
              type = "EF00";
              content = {
                type = "filesystem";
                format = "vfat";
                mountpoint = "/boot";
              };
            };
            root = {
              size = "100%";
              content = {
                type = "filesystem";
                format = "ext4";
                mountpoint = "/";
              };
            };
          };
        };
      };
    };
  };
}

Step 3.4: Pre-Generate TLS Certificates

# Copy per-node certificates
cp ca-cert.pem /srv/provisioning/nodes/node01.example.com/secrets/
cp node01-cert.pem /srv/provisioning/nodes/node01.example.com/secrets/
cp node01-key.pem /srv/provisioning/nodes/node01.example.com/secrets/

# Set permissions
chmod 644 /srv/provisioning/nodes/node01.example.com/secrets/*-cert.pem
chmod 644 /srv/provisioning/nodes/node01.example.com/secrets/ca-cert.pem
chmod 600 /srv/provisioning/nodes/node01.example.com/secrets/*-key.pem

5.4 Phase 4: Bootstrap First 3 Nodes

Step 4.1: Power On Nodes via BMC

# Using ipmitool (example for Dell/HP/Supermicro)
for ip in 10.0.10.50 10.0.10.51 10.0.10.52; do
  ipmitool -I lanplus -H $ip -U admin -P password chassis bootdev pxe options=persistent
  ipmitool -I lanplus -H $ip -U admin -P password chassis power on
done

Step 4.2: Verify PXE Boot Success

Watch serial console (if available):

# Connect via IPMI SOL
ipmitool -I lanplus -H 10.0.10.50 -U admin -P password sol activate

# Expected output:
# ... DHCP discovery ...
# ... TFTP download undionly.kpxe or ipxe.efi ...
# ... iPXE menu appears ...
# ... Kernel and initrd download ...
# ... NixOS installer boots ...
# ... SSH server starts ...

Verify installer is ready:

# Wait for nodes to appear in DHCP leases
sudo tail -f /var/lib/dhcp/dhcpd.leases

# Test SSH connectivity
ssh root@10.0.100.50 'uname -a'
# Expected: Linux node01 ... nixos

Step 4.3: Run nixos-anywhere Simultaneously on All 3

Create provisioning script:

#!/bin/bash
# /srv/provisioning/scripts/provision-bootstrap-nodes.sh

set -euo pipefail

NODES=("node01" "node02" "node03")
PROVISION_IPS=("10.0.100.50" "10.0.100.51" "10.0.100.52")
FLAKE_ROOT="/srv/provisioning"

for i in "${!NODES[@]}"; do
  node="${NODES[$i]}"
  ip="${PROVISION_IPS[$i]}"

  echo "Provisioning $node at $ip..."

  nix run github:nix-community/nixos-anywhere -- \
    --flake "$FLAKE_ROOT#$node" \
    --build-on-remote \
    root@$ip &
done

wait
echo "All nodes provisioned successfully!"

Run provisioning:

chmod +x /srv/provisioning/scripts/provision-bootstrap-nodes.sh
./provision-bootstrap-nodes.sh

Expected output per node:

Provisioning node01 at 10.0.100.50...
Connecting via SSH...
Running disko to partition disks...
Building NixOS system...
Installing bootloader...
Copying secrets...
Installation complete. Rebooting...

Step 4.4: Wait for First-Boot Automation

After reboot, nodes will boot from disk and run first-boot automation. Monitor progress:

# Watch logs on node01 (via SSH after it reboots)
ssh root@10.0.200.10  # Note: now on production network

# Check cluster join services
journalctl -u chainfire-cluster-join.service -f
journalctl -u flaredb-cluster-join.service -f

# Expected log output:
# {"level":"INFO","message":"Waiting for local chainfire service..."}
# {"level":"INFO","message":"Local chainfire healthy"}
# {"level":"INFO","message":"Bootstrap node, cluster initialized"}
# {"level":"INFO","message":"Cluster join complete"}

Step 4.5: Verify Cluster Health

# Check Chainfire cluster
curl -k https://node01.example.com:2379/admin/cluster/members | jq

# Expected output:
# {
#   "members": [
#     {"id":"node01","raft_addr":"10.0.200.10:2380","status":"healthy","role":"leader"},
#     {"id":"node02","raft_addr":"10.0.200.11:2380","status":"healthy","role":"follower"},
#     {"id":"node03","raft_addr":"10.0.200.12:2380","status":"healthy","role":"follower"}
#   ]
# }

# Check FlareDB cluster
curl -k https://node01.example.com:2479/admin/cluster/members | jq

# Check IAM service
curl -k https://node01.example.com:8080/health | jq
# Expected: {"status":"healthy","database":"connected"}

5.5 Phase 5: Add Additional Nodes

Step 5.1: Prepare Join-Mode Configurations

Create configuration for node04 (worker profile):

{
  "node_id": "node04",
  "node_role": "worker",
  "bootstrap": false,
  "cluster_name": "prod-cluster",
  "leader_url": "https://node01.example.com:2379",
  "raft_addr": "10.0.200.20:2380"
}

Step 5.2: Power On and Provision Nodes

# Power on node via BMC
ipmitool -I lanplus -H 10.0.10.54 -U admin -P password chassis bootdev pxe
ipmitool -I lanplus -H 10.0.10.54 -U admin -P password chassis power on

# Wait for PXE boot and SSH ready
sleep 60

# Provision node
nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node04 \
  --build-on-remote \
  root@10.0.100.60

Step 5.3: Verify Cluster Join via API

# Check cluster members (should include node04)
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members[] | select(.id=="node04")'

# Expected:
# {"id":"node04","raft_addr":"10.0.200.20:2380","status":"healthy","role":"follower"}

Step 5.4: Validate Replication and Service Distribution

# Write test data on leader
curl -k -X PUT https://node01.example.com:2379/v1/kv/test \
  -H "Content-Type: application/json" \
  -d '{"value":"hello world"}'

# Read from follower (should be replicated)
curl -k https://node02.example.com:2379/v1/kv/test | jq

# Expected: {"key":"test","value":"hello world"}

6. Verification & Validation

6.1 Health Check Commands for All Services

Chainfire:

curl -k https://node01.example.com:2379/health | jq
# Expected: {"status":"healthy","raft":"leader","cluster_size":3}

# Check cluster membership
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members | length'
# Expected: 3 (for initial bootstrap)

FlareDB:

curl -k https://node01.example.com:2479/health | jq
# Expected: {"status":"healthy","raft":"leader","chainfire":"connected"}

# Query test metric
curl -k https://node01.example.com:2479/v1/query \
  -H "Content-Type: application/json" \
  -d '{"query":"up{job=\"node\"}","time":"now"}'

IAM:

curl -k https://node01.example.com:8080/health | jq
# Expected: {"status":"healthy","database":"connected","version":"1.0.0"}

# List users (requires authentication)
curl -k https://node01.example.com:8080/api/users \
  -H "Authorization: Bearer $IAM_TOKEN" | jq

PlasmaVMC:

curl -k https://node01.example.com:9090/health | jq
# Expected: {"status":"healthy","vms_running":0}

# List VMs
curl -k https://node01.example.com:9090/api/vms | jq

NovaNET:

curl -k https://node01.example.com:9091/health | jq
# Expected: {"status":"healthy","networks":0}

FlashDNS:

dig @node01.example.com example.com
# Expected: DNS response with ANSWER section

# Health check
curl -k https://node01.example.com:853/health | jq

FiberLB:

curl -k https://node01.example.com:9092/health | jq
# Expected: {"status":"healthy","backends":0}

K8sHost:

kubectl --kubeconfig=/etc/kubernetes/admin.conf get nodes
# Expected: Node list including this node

6.2 Cluster Membership Verification

#!/bin/bash
# /srv/provisioning/scripts/verify-cluster.sh

echo "Checking Chainfire cluster..."
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members[] | {id, status, role}'

echo ""
echo "Checking FlareDB cluster..."
curl -k https://node01.example.com:2479/admin/cluster/members | jq '.members[] | {id, status, role}'

echo ""
echo "Cluster health summary:"
echo "  Chainfire nodes: $(curl -sk https://node01.example.com:2379/admin/cluster/members | jq '.members | length')"
echo "  FlareDB nodes: $(curl -sk https://node01.example.com:2479/admin/cluster/members | jq '.members | length')"
echo "  Raft leaders: Chainfire=$(curl -sk https://node01.example.com:2379/admin/cluster/members | jq -r '.members[] | select(.role=="leader") | .id'), FlareDB=$(curl -sk https://node01.example.com:2479/admin/cluster/members | jq -r '.members[] | select(.role=="leader") | .id')"

6.3 Raft Leader Election Check

# Identify current leader
LEADER=$(curl -sk https://node01.example.com:2379/admin/cluster/members | jq -r '.members[] | select(.role=="leader") | .id')
echo "Current Chainfire leader: $LEADER"

# Verify all followers can reach leader
for node in node01 node02 node03; do
  echo "Checking $node..."
  curl -sk https://$node.example.com:2379/admin/cluster/leader | jq
done

6.4 TLS Certificate Validation

# Check certificate expiry
for node in node01 node02 node03; do
  echo "Checking $node certificate..."
  echo | openssl s_client -connect $node.example.com:2379 2>/dev/null | openssl x509 -noout -dates
done

# Verify certificate chain
echo | openssl s_client -connect node01.example.com:2379 -CAfile /srv/provisioning/ca-cert.pem -verify 1
# Expected: Verify return code: 0 (ok)

6.5 Network Connectivity Tests

# Test inter-node connectivity (from node01)
ssh root@node01.example.com '
  for node in node02 node03; do
    echo "Testing connectivity to $node..."
    nc -zv $node.example.com 2379
    nc -zv $node.example.com 2380
  done
'

# Test bandwidth (iperf3)
ssh root@node02.example.com 'iperf3 -s' &
ssh root@node01.example.com 'iperf3 -c node02.example.com -t 10'
# Expected: ~10 Gbps on 10GbE, ~1 Gbps on 1GbE

6.6 Performance Smoke Tests

Chainfire Write Performance:

# 1000 writes
time for i in {1..1000}; do
  curl -sk -X PUT https://node01.example.com:2379/v1/kv/test$i \
    -H "Content-Type: application/json" \
    -d "{\"value\":\"test data $i\"}" > /dev/null
done

# Expected: <10 seconds on healthy cluster

FlareDB Query Performance:

# Insert test metrics
curl -k -X POST https://node01.example.com:2479/v1/write \
  -H "Content-Type: application/json" \
  -d '{"metric":"test_metric","value":42,"timestamp":"'$(date -Iseconds)'"}'

# Query performance
time curl -k https://node01.example.com:2479/v1/query \
  -H "Content-Type: application/json" \
  -d '{"query":"test_metric","start":"1h","end":"now"}'

# Expected: <100ms response time

7. Common Operations

7.1 Adding a New Node

Step 1: Prepare Node Configuration

# Create node directory
mkdir -p /srv/provisioning/nodes/node05.example.com/secrets

# Copy template configuration
cp /srv/provisioning/nodes/node01.example.com/configuration.nix \
   /srv/provisioning/nodes/node05.example.com/

# Edit for new node
vim /srv/provisioning/nodes/node05.example.com/configuration.nix
# Update: hostName, ipAddresses, node_id

Step 2: Generate Cluster Config (Join Mode)

{
  "node_id": "node05",
  "node_role": "worker",
  "bootstrap": false,
  "cluster_name": "prod-cluster",
  "leader_url": "https://node01.example.com:2379",
  "raft_addr": "10.0.200.21:2380"
}

Step 3: Provision Node

# Power on and PXE boot
ipmitool -I lanplus -H 10.0.10.55 -U admin -P password chassis bootdev pxe
ipmitool -I lanplus -H 10.0.10.55 -U admin -P password chassis power on

# Wait for SSH
sleep 60

# Run nixos-anywhere
nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node05 \
  root@10.0.100.65

Step 4: Verify Join

# Check cluster membership
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members[] | select(.id=="node05")'

7.2 Replacing a Failed Node

Step 1: Remove Failed Node from Cluster

# Remove from Chainfire cluster
curl -k -X DELETE https://node01.example.com:2379/admin/member/node02

# Remove from FlareDB cluster
curl -k -X DELETE https://node01.example.com:2479/admin/member/node02

Step 2: Physically Replace Hardware

Power off old node
Remove from rack
Install new node
Connect all cables
Configure BMC

Step 3: Provision Replacement Node

# Use same node ID and configuration
nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node02 \
  root@10.0.100.51

Step 4: Verify Rejoin

# Cluster should automatically add node during first-boot
curl -k https://node01.example.com:2379/admin/cluster/members | jq

7.3 Updating Node Configuration

Step 1: Edit Configuration

vim /srv/provisioning/nodes/node01.example.com/configuration.nix
# Make changes (e.g., add service, change network config)

Step 2: Build and Deploy

# Build configuration locally
nix build /srv/provisioning#node01

# Deploy to node (from node or remote)
nixos-rebuild switch --flake /srv/provisioning#node01

Step 3: Verify Changes

# Check active configuration
ssh root@node01.example.com 'nixos-rebuild list-generations'

# Test services still healthy
curl -k https://node01.example.com:2379/health | jq

7.4 Rolling Updates

Update Process (One Node at a Time):

#!/bin/bash
# /srv/provisioning/scripts/rolling-update.sh

NODES=("node01" "node02" "node03")

for node in "${NODES[@]}"; do
  echo "Updating $node..."

  # Build new configuration
  nix build /srv/provisioning#$node

  # Deploy (test mode first)
  ssh root@$node.example.com "nixos-rebuild test --flake /srv/provisioning#$node"

  # Verify health
  if ! curl -k https://$node.example.com:2379/health | jq -e '.status == "healthy"'; then
    echo "ERROR: $node unhealthy after test, aborting"
    ssh root@$node.example.com "nixos-rebuild switch --rollback"
    exit 1
  fi

  # Apply permanently
  ssh root@$node.example.com "nixos-rebuild switch --flake /srv/provisioning#$node"

  # Wait for reboot if kernel changed
  echo "Waiting 30s for stabilization..."
  sleep 30

  # Final health check
  curl -k https://$node.example.com:2379/health | jq

  echo "$node updated successfully"
done

7.5 Draining a Node for Maintenance

Step 1: Mark Node for Drain

# Disable node in load balancer (if using one)
curl -k -X POST https://node01.example.com:9092/api/backend/node02 \
  -d '{"status":"drain"}'

Step 2: Migrate VMs (PlasmaVMC)

# List VMs on node
ssh root@node02.example.com 'systemctl list-units | grep plasmavmc-vm@'

# Migrate each VM
curl -k -X POST https://node01.example.com:9090/api/vms/vm-001/migrate \
  -d '{"target_node":"node03"}'

Step 3: Stop Services

ssh root@node02.example.com '
  systemctl stop plasmavmc.service
  systemctl stop chainfire.service
  systemctl stop flaredb.service
'

Step 4: Perform Maintenance

# Reboot for kernel update, hardware maintenance, etc.
ssh root@node02.example.com 'reboot'

Step 5: Re-enable Node

# Verify all services healthy
ssh root@node02.example.com 'systemctl status chainfire flaredb plasmavmc'

# Re-enable in load balancer
curl -k -X POST https://node01.example.com:9092/api/backend/node02 \
  -d '{"status":"active"}'

7.6 Decommissioning a Node

Step 1: Drain Node (see 7.5)

Step 2: Remove from Cluster

# Remove from Chainfire
curl -k -X DELETE https://node01.example.com:2379/admin/member/node02

# Remove from FlareDB
curl -k -X DELETE https://node01.example.com:2479/admin/member/node02

# Verify removal
curl -k https://node01.example.com:2379/admin/cluster/members | jq

Step 3: Power Off

# Via BMC
ipmitool -I lanplus -H 10.0.10.51 -U admin -P password chassis power off

# Or via SSH
ssh root@node02.example.com 'poweroff'

Step 4: Update Inventory

# Remove from node inventory
vim /srv/provisioning/inventory.json
# Remove node02 entry

# Remove from DNS
# Update DNS zone to remove node02.example.com

# Remove from monitoring
# Update Prometheus targets to remove node02

8. Troubleshooting

8.1 PXE Boot Failures

Symptom: Server does not obtain IP address or does not boot from network

Diagnosis:

# Monitor DHCP server logs
sudo journalctl -u dhcpd4 -f

# Monitor TFTP requests
sudo tcpdump -i eth0 -n port 69

# Check PXE server services
sudo systemctl status dhcpd4 atftpd nginx

Common Causes:

DHCP server not running: sudo systemctl start dhcpd4
Wrong network interface: Check interfaces in dhcpd.conf
Firewall blocking DHCP/TFTP: sudo iptables -L -n | grep -E "67|68|69"
PXE not enabled in BIOS: Enter BIOS and enable Network Boot
Network cable disconnected: Check physical connection

Solution:

# Restart all PXE services
sudo systemctl restart dhcpd4 atftpd nginx

# Verify DHCP configuration
sudo dhcpd -t -cf /etc/dhcp/dhcpd.conf

# Test TFTP
tftp localhost -c get undionly.kpxe /tmp/test.kpxe

# Power cycle server
ipmitool -I lanplus -H <bmc-ip> -U admin chassis power cycle

8.2 Installation Failures (nixos-anywhere)

Symptom: nixos-anywhere fails during disk partitioning, installation, or bootloader setup

Diagnosis:

# Check nixos-anywhere output for errors
# Common errors: disk not found, partition table errors, out of space

# SSH to installer for manual inspection
ssh root@10.0.100.50

# Check disk status
lsblk
dmesg | grep -i error

Common Causes:

Disk device wrong: Update disko.nix with correct device (e.g., /dev/nvme0n1)
Disk not wiped: Previous partition table conflicts
Out of disk space: Insufficient storage for Nix closures
Network issues: Cannot download packages from binary cache

Solution:

# Manual disk wipe (on installer)
ssh root@10.0.100.50 '
  wipefs -a /dev/sda
  sgdisk --zap-all /dev/sda
'

# Retry nixos-anywhere
nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node01 \
  --debug \
  root@10.0.100.50

8.3 Cluster Join Failures

Symptom: Node boots successfully but does not join cluster

Diagnosis:

# Check first-boot logs on node
ssh root@node01.example.com 'journalctl -u chainfire-cluster-join.service -u flaredb-cluster-join.service'

# Common errors:
# - "Health check timeout after 120s"
# - "Join request failed: connection refused"
# - "Configuration file not found"

Bootstrap Mode vs Join Mode:

Bootstrap: Node expects to create new cluster with peers
Join: Node expects to connect to existing leader

Common Causes:

Wrong bootstrap flag: Check cluster-config.json
Leader unreachable: Network/firewall issue
TLS certificate errors: Verify cert paths and validity
Service not starting: Check main service (chainfire.service)

Solution:

# Verify cluster-config.json
ssh root@node01.example.com 'cat /etc/nixos/secrets/cluster-config.json | jq'

# Test leader connectivity
ssh root@node04.example.com 'curl -k https://node01.example.com:2379/health'

# Check TLS certificates
ssh root@node04.example.com 'ls -l /etc/nixos/secrets/*.pem'

# Manual cluster join (if automation fails)
curl -k -X POST https://node01.example.com:2379/admin/member/add \
  -H "Content-Type: application/json" \
  -d '{"id":"node04","raft_addr":"10.0.200.20:2380"}'

8.4 Service Start Failures

Symptom: Service fails to start after boot

Diagnosis:

# Check service status
ssh root@node01.example.com 'systemctl status chainfire.service'

# View logs
ssh root@node01.example.com 'journalctl -u chainfire.service -n 100'

# Common errors:
# - "bind: address already in use" (port conflict)
# - "certificate verify failed" (TLS issue)
# - "permission denied" (file permissions)

Common Causes:

Port already in use: Another service using same port
Missing dependencies: Required service not running
Configuration error: Invalid config file
File permissions: Cannot read secrets

Solution:

# Check port usage
ssh root@node01.example.com 'ss -tlnp | grep 2379'

# Verify dependencies
ssh root@node01.example.com 'systemctl list-dependencies chainfire.service'

# Test configuration manually
ssh root@node01.example.com 'chainfire-server --config /etc/nixos/chainfire.toml --check-config'

# Fix permissions
ssh root@node01.example.com 'chmod 600 /etc/nixos/secrets/*-key.pem'

8.5 Network Connectivity Issues

Symptom: Nodes cannot communicate with each other or external services

Diagnosis:

# Test basic connectivity
ssh root@node01.example.com 'ping -c 3 node02.example.com'

# Test specific ports
ssh root@node01.example.com 'nc -zv node02.example.com 2379'

# Check firewall rules
ssh root@node01.example.com 'iptables -L -n | grep 2379'

# Check routing
ssh root@node01.example.com 'ip route show'

Common Causes:

Firewall blocking traffic: Missing iptables rules
Wrong IP address: Configuration mismatch
Network interface down: Interface not configured
DNS resolution failure: Cannot resolve hostnames

Solution:

# Add firewall rules
ssh root@node01.example.com '
  iptables -A INPUT -p tcp --dport 2379 -s 10.0.200.0/24 -j ACCEPT
  iptables -A INPUT -p tcp --dport 2380 -s 10.0.200.0/24 -j ACCEPT
  iptables-save > /etc/iptables/rules.v4
'

# Fix DNS resolution
ssh root@node01.example.com '
  echo "10.0.200.11 node02.example.com node02" >> /etc/hosts
'

# Restart networking
ssh root@node01.example.com 'systemctl restart systemd-networkd'

8.6 TLS Certificate Errors

Symptom: Services cannot establish TLS connections

Diagnosis:

# Test TLS connection
openssl s_client -connect node01.example.com:2379 -CAfile /srv/provisioning/ca-cert.pem

# Check certificate validity
ssh root@node01.example.com '
  openssl x509 -in /etc/nixos/secrets/node01-cert.pem -noout -dates
'

# Common errors:
# - "certificate verify failed" (wrong CA)
# - "certificate has expired" (cert expired)
# - "certificate subject name mismatch" (wrong CN)

Common Causes:

Expired certificate: Regenerate certificate
Wrong CA certificate: Verify CA cert is correct
Hostname mismatch: CN does not match hostname
File permissions: Cannot read certificate files

Solution:

# Regenerate certificate
openssl req -new -key /srv/provisioning/secrets/node01-key.pem \
  -out /srv/provisioning/secrets/node01-csr.pem \
  -subj "/CN=node01.example.com"

openssl x509 -req -in /srv/provisioning/secrets/node01-csr.pem \
  -CA /srv/provisioning/ca-cert.pem \
  -CAkey /srv/provisioning/ca-key.pem \
  -CAcreateserial \
  -out /srv/provisioning/secrets/node01-cert.pem \
  -days 365

# Copy to node
scp /srv/provisioning/secrets/node01-cert.pem root@node01.example.com:/etc/nixos/secrets/

# Restart service
ssh root@node01.example.com 'systemctl restart chainfire.service'

8.7 Performance Degradation

Symptom: Services are slow or unresponsive

Diagnosis:

# Check system load
ssh root@node01.example.com 'uptime'
ssh root@node01.example.com 'top -bn1 | head -20'

# Check disk I/O
ssh root@node01.example.com 'iostat -x 1 5'

# Check network bandwidth
ssh root@node01.example.com 'iftop -i eth1'

# Check Raft logs for slow operations
ssh root@node01.example.com 'journalctl -u chainfire.service | grep "slow operation"'

Common Causes:

High CPU usage: Too many requests, inefficient queries
Disk I/O bottleneck: Slow disk, too many writes
Network saturation: Bandwidth exhausted
Memory pressure: OOM killer active
Raft slow commits: Network latency between nodes

Solution:

# Add more resources (vertical scaling)
# Or add more nodes (horizontal scaling)

# Check for resource leaks
ssh root@node01.example.com 'systemctl status chainfire | grep Memory'

# Restart service to clear memory leaks (temporary)
ssh root@node01.example.com 'systemctl restart chainfire.service'

# Optimize disk I/O (enable write caching if safe)
ssh root@node01.example.com 'hdparm -W1 /dev/sda'

9. Rollback & Recovery

9.1 NixOS Generation Rollback

NixOS provides atomic rollback capability via generations:

List Available Generations:

ssh root@node01.example.com 'nixos-rebuild list-generations'
# Example output:
#   1   2025-12-10 10:30:00
#   2   2025-12-10 12:45:00 (current)

Rollback to Previous Generation:

# Rollback and reboot
ssh root@node01.example.com 'nixos-rebuild switch --rollback'

# Or boot into previous generation once (no permanent change)
ssh root@node01.example.com 'nixos-rebuild boot --rollback && reboot'

Rollback to Specific Generation:

ssh root@node01.example.com 'nix-env --switch-generation 1 -p /nix/var/nix/profiles/system'
ssh root@node01.example.com 'reboot'

9.2 Re-Provisioning from PXE

Complete re-provisioning wipes all data and reinstalls from scratch:

Step 1: Remove Node from Cluster

curl -k -X DELETE https://node01.example.com:2379/admin/member/node02
curl -k -X DELETE https://node01.example.com:2479/admin/member/node02

Step 2: Set Boot to PXE

ipmitool -I lanplus -H 10.0.10.51 -U admin chassis bootdev pxe

Step 3: Reboot Node

ssh root@node02.example.com 'reboot'
# Or via BMC
ipmitool -I lanplus -H 10.0.10.51 -U admin chassis power cycle

Step 4: Run nixos-anywhere

# Wait for PXE boot and SSH ready
sleep 90

nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node02 \
  root@10.0.100.51

9.3 Disaster Recovery Procedures

Complete Cluster Loss (All Nodes Down):

Step 1: Restore from Backup (if available)

# Restore Chainfire data
ssh root@node01.example.com '
  systemctl stop chainfire.service
  rm -rf /var/lib/chainfire/*
  tar -xzf /backup/chainfire-$(date +%Y%m%d).tar.gz -C /var/lib/chainfire/
  systemctl start chainfire.service
'

Step 2: Bootstrap New Cluster If no backup, re-provision all nodes as bootstrap:

# Update cluster-config.json for all nodes
# Set bootstrap=true, same initial_peers

# Provision all 3 nodes
for node in node01 node02 node03; do
  nix run github:nix-community/nixos-anywhere -- \
    --flake /srv/provisioning#$node \
    root@<node-ip> &
done
wait

Single Node Failure:

Step 1: Verify Cluster Quorum

# Check remaining nodes have quorum
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members | length'
# Expected: 2 (if 3-node cluster with 1 failure)

Step 2: Remove Failed Node

curl -k -X DELETE https://node01.example.com:2379/admin/member/node02

Step 3: Provision Replacement

# Use same node ID and configuration
nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node02 \
  root@10.0.100.51

9.4 Backup and Restore

Automated Backup Script:

#!/bin/bash
# /srv/provisioning/scripts/backup-cluster.sh

BACKUP_DIR="/backup/cluster-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BACKUP_DIR"

# Backup Chainfire data
for node in node01 node02 node03; do
  ssh root@$node.example.com \
    "tar -czf - /var/lib/chainfire" > "$BACKUP_DIR/chainfire-$node.tar.gz"
done

# Backup FlareDB data
for node in node01 node02 node03; do
  ssh root@$node.example.com \
    "tar -czf - /var/lib/flaredb" > "$BACKUP_DIR/flaredb-$node.tar.gz"
done

# Backup configurations
cp -r /srv/provisioning/nodes "$BACKUP_DIR/configs"

echo "Backup complete: $BACKUP_DIR"

Restore Script:

#!/bin/bash
# /srv/provisioning/scripts/restore-cluster.sh

BACKUP_DIR="$1"
if [ -z "$BACKUP_DIR" ]; then
  echo "Usage: $0 <backup-dir>"
  exit 1
fi

# Stop services on all nodes
for node in node01 node02 node03; do
  ssh root@$node.example.com 'systemctl stop chainfire flaredb'
done

# Restore Chainfire data
for node in node01 node02 node03; do
  cat "$BACKUP_DIR/chainfire-$node.tar.gz" | \
    ssh root@$node.example.com "cd / && tar -xzf -"
done

# Restore FlareDB data
for node in node01 node02 node03; do
  cat "$BACKUP_DIR/flaredb-$node.tar.gz" | \
    ssh root@$node.example.com "cd / && tar -xzf -"
done

# Restart services
for node in node01 node02 node03; do
  ssh root@$node.example.com 'systemctl start chainfire flaredb'
done

echo "Restore complete"

10. Security Best Practices

10.1 SSH Key Management

Generate Dedicated Provisioning Key:

ssh-keygen -t ed25519 -C "provisioning@example.com" -f ~/.ssh/id_ed25519_provisioning

Add to Netboot Image:

# In netboot-base.nix
users.users.root.openssh.authorizedKeys.keys = [
  "ssh-ed25519 AAAAC3Nza... provisioning@example.com"
];

Rotate Keys Regularly:

# Generate new key
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519_provisioning_new

# Add to all nodes
for node in node01 node02 node03; do
  ssh-copy-id -i ~/.ssh/id_ed25519_provisioning_new.pub root@$node.example.com
done

# Remove old key from authorized_keys
# Update netboot image with new key

10.2 TLS Certificate Rotation

Automated Rotation Script:

#!/bin/bash
# /srv/provisioning/scripts/rotate-certs.sh

# Generate new certificates
for node in node01 node02 node03; do
  openssl genrsa -out ${node}-key-new.pem 4096
  openssl req -new -key ${node}-key-new.pem -out ${node}-csr.pem \
    -subj "/CN=${node}.example.com"
  openssl x509 -req -in ${node}-csr.pem \
    -CA ca-cert.pem -CAkey ca-key.pem \
    -CAcreateserial -out ${node}-cert-new.pem -days 365
done

# Deploy new certificates (without restarting services yet)
for node in node01 node02 node03; do
  scp ${node}-cert-new.pem root@${node}.example.com:/etc/nixos/secrets/${node}-cert-new.pem
  scp ${node}-key-new.pem root@${node}.example.com:/etc/nixos/secrets/${node}-key-new.pem
done

# Update configuration to use new certs
# ... (NixOS configuration update) ...

# Rolling restart to apply new certificates
for node in node01 node02 node03; do
  ssh root@${node}.example.com 'systemctl restart chainfire flaredb iam'
  sleep 30  # Wait for stabilization
done

echo "Certificate rotation complete"

10.3 Secrets Management

Best Practices:

Store secrets outside Nix store (use /etc/nixos/secrets/)
Set restrictive permissions (0600 for private keys, 0400 for passwords)
Use environment variables for runtime secrets
Never commit secrets to Git
Use encrypted secrets (sops-nix or agenix)

Example with sops-nix:

# In configuration.nix
{
  imports = [ <sops-nix/modules/sops> ];

  sops.defaultSopsFile = ./secrets.yaml;
  sops.secrets."node01/tls-key" = {
    owner = "chainfire";
    mode = "0400";
  };

  services.chainfire.settings.tls.key_path = config.sops.secrets."node01/tls-key".path;
}

10.4 Network Isolation

VLAN Segmentation:

Management VLAN (10): BMC/IPMI, provisioning workstation
Provisioning VLAN (100): PXE boot, temporary
Production VLAN (200): Cluster services, inter-node communication
Client VLAN (300): External clients accessing services

Firewall Zones:

# Example nftables rules
table inet filter {
  chain input {
    type filter hook input priority 0; policy drop;

    # Management from trusted subnet only
    iifname "eth0" ip saddr 10.0.10.0/24 tcp dport 22 accept

    # Cluster traffic from cluster subnet only
    iifname "eth1" ip saddr 10.0.200.0/24 tcp dport { 2379, 2380, 2479, 2480 } accept

    # Client traffic from client subnet only
    iifname "eth2" ip saddr 10.0.300.0/24 tcp dport { 8080, 9090 } accept
  }
}

10.5 Audit Logging

Enable Structured Logging:

# In configuration.nix
services.chainfire.settings.logging = {
  level = "info";
  format = "json";
  output = "journal";
};

# Enable journald forwarding to SIEM
services.journald.extraConfig = ''
  ForwardToSyslog=yes
  Storage=persistent
  MaxRetentionSec=7days
'';

Audit Key Events:

Cluster membership changes
Node joins/leaves
Authentication failures
Configuration changes
TLS certificate errors

Log Aggregation:

# Forward logs to central logging server
# Example: rsyslog configuration
cat > /etc/rsyslog.d/50-remote.conf <<EOF
*.* @@logging-server.example.com:514
EOF
systemctl restart rsyslog

Appendix A: Service Port Reference

See NETWORK.md for complete port matrix.

Appendix B: Hardware Vendor Commands

See HARDWARE.md for vendor-specific BIOS configurations and IPMI commands.

Appendix C: Complete Command Reference

See COMMANDS.md for all commands organized by task.

Appendix D: Quick Reference Cards

See QUICKSTART.md for condensed deployment guide.

Appendix E: Deployment Flow Diagrams

See diagrams/deployment-flow.md for visual workflow.

Design Document: /home/centra/cloud/docs/por/T032-baremetal-provisioning/design.md
PXE Server: /home/centra/cloud/chainfire/baremetal/pxe-server/README.md
Image Builder: /home/centra/cloud/baremetal/image-builder/README.md
First-Boot Automation: /home/centra/cloud/baremetal/first-boot/README.md

End of Operator Runbook

62 KiB Raw Blame History

Bare-Metal Provisioning Operator Runbook

1. Overview

1.1 What This Runbook Covers

1.2 Prerequisites

1.3 Architecture Diagram

2. Hardware Requirements

2.1 Minimum Specifications Per Node

2.2 Recommended Production Specifications

2.3 Hardware Compatibility Matrix

2.4 BIOS/UEFI Settings

2.5 BMC/IPMI Requirements

3. Network Setup

3.1 Network Topology

3.2 DHCP Server Configuration

3.3 DNS Requirements

3.4 Firewall Rules

3.5 Static IP Allocation Strategy

3.6 Network Bandwidth Requirements

4. Pre-Deployment Checklist

4.1 Hardware Checklist

4.2 Network Checklist

4.3 PXE Server Checklist

4.4 Node Configuration Checklist

4.5 TLS Certificates Checklist

4.6 Provisioning Workstation Checklist

5. Deployment Workflow

5.1 Phase 1: PXE Server Setup

5.2 Phase 2: Build Netboot Images

5.3 Phase 3: Prepare Node Configurations

5.4 Phase 4: Bootstrap First 3 Nodes

5.5 Phase 5: Add Additional Nodes

6. Verification & Validation

6.1 Health Check Commands for All Services

6.2 Cluster Membership Verification

6.3 Raft Leader Election Check

6.4 TLS Certificate Validation

6.5 Network Connectivity Tests

6.6 Performance Smoke Tests

7. Common Operations

7.1 Adding a New Node

7.2 Replacing a Failed Node

7.3 Updating Node Configuration

7.4 Rolling Updates

7.5 Draining a Node for Maintenance

7.6 Decommissioning a Node

8. Troubleshooting

8.1 PXE Boot Failures

8.2 Installation Failures (nixos-anywhere)

8.3 Cluster Join Failures

8.4 Service Start Failures

8.5 Network Connectivity Issues

8.6 TLS Certificate Errors

8.7 Performance Degradation

9. Rollback & Recovery

9.1 NixOS Generation Rollback

9.2 Re-Provisioning from PXE

9.3 Disaster Recovery Procedures

9.4 Backup and Restore

10. Security Best Practices

10.1 SSH Key Management

10.2 TLS Certificate Rotation

10.3 Secrets Management

10.4 Network Isolation

10.5 Audit Logging

Appendix A: Service Port Reference

Appendix B: Hardware Vendor Commands

Appendix C: Complete Command Reference

Appendix D: Quick Reference Cards

Appendix E: Deployment Flow Diagrams

Appendix F: Related Documentation

62 KiB

Raw Blame History