photoncloud-monorepo/docs/por/T032-baremetal-provisioning/RUNBOOK.md

# Bare-Metal Provisioning Operator Runbook

**Document Version:** 1.0
**Last Updated:** 2025-12-10
**Status:** Production Ready
**Author:** PlasmaCloud Infrastructure Team

## 1. Overview

### 1.1 What This Runbook Covers

This runbook provides comprehensive, step-by-step instructions for deploying PlasmaCloud infrastructure on bare-metal servers using automated PXE-based provisioning. By following this guide, operators will be able to:

- Deploy a complete PlasmaCloud cluster from bare hardware to running services
- Bootstrap a 3-node Raft cluster (Chainfire + FlareDB)
- Add additional nodes to an existing cluster
- Validate cluster health and troubleshoot common issues
- Perform operational tasks (updates, maintenance, recovery)

### 1.2 Prerequisites

**Required Access and Permissions:**
- Root/sudo access on provisioning server
- Physical or IPMI/BMC access to bare-metal servers
- Network access to provisioning VLAN
- SSH key pair for nixos-anywhere

**Required Tools:**
- NixOS with flakes enabled (provisioning workstation)
- curl, jq, ssh client
- ipmitool (optional, for remote management)
- Serial console access tool (optional)

**Required Knowledge:**
- Basic understanding of PXE boot process
- Linux system administration
- Network configuration (DHCP, DNS, firewall)
- NixOS basics (declarative configuration, flakes)

### 1.3 Architecture Diagram

```
┌─────────────────────────────────────────────────────────────────────────┐
│                    Bare-Metal Provisioning Flow                         │
└─────────────────────────────────────────────────────────────────────────┘

Phase 1: PXE Boot                Phase 2: Installation
┌──────────────┐                  ┌──────────────────┐
│  Bare-Metal  │  1. DHCP Request │   DHCP Server    │
│   Server     ├─────────────────>│  (PXE Server)    │
│              │                  └──────────────────┘
│  (powered    │  2. TFTP Get                │
│   on, PXE    │     bootloader             │
│   enabled)   │<───────────────────────────┘
│              │
│  3. iPXE     │  4. HTTP Get      ┌──────────────────┐
│     loads    │     boot.ipxe     │   HTTP Server    │
│              ├──────────────────>│   (nginx)        │
│              │                   └──────────────────┘
│  5. iPXE     │  6. HTTP Get               │
│     menu     │     kernel+initrd          │
│              │<───────────────────────────┘
│              │
│  7. Boot     │
│     NixOS    │
│     Installer│
└──────┬───────┘
       │
       │  8. SSH Connection         ┌──────────────────┐
       └───────────────────────────>│  Provisioning    │
                                    │  Workstation     │
                                    │                  │
                                    │  9. Run          │
                                    │     nixos-       │
                                    │     anywhere     │
                                    └──────┬───────────┘
                                           │
                      ┌────────────────────┴────────────────────┐
                      │                                          │
                      v                                          v
       ┌──────────────────────────┐          ┌──────────────────────────┐
       │  10. Partition disks     │          │  11. Install NixOS       │
       │      (disko)             │          │      - Build system      │
       │  - GPT/LVM/LUKS          │          │      - Copy closures     │
       │  - Format filesystems    │          │      - Install bootloader│
       │  - Mount /mnt            │          │      - Inject secrets    │
       └──────────────────────────┘          └──────────────────────────┘

Phase 3: First Boot              Phase 4: Running Cluster
┌──────────────┐                 ┌──────────────────┐
│  Bare-Metal  │  12. Reboot     │   NixOS System   │
│   Server     │  ────────────>  │   (from disk)    │
└──────────────┘                 └──────────────────┘
                                          │
                      ┌───────────────────┴────────────────────┐
                      │  13. First-boot automation             │
                      │  - Chainfire cluster join/bootstrap    │
                      │  - FlareDB cluster join/bootstrap      │
                      │  - IAM initialization                  │
                      │  - Health checks                       │
                      └───────────────────┬────────────────────┘
                                          │
                                          v
                                 ┌──────────────────┐
                                 │  Running Cluster │
                                 │  - All services  │
                                 │    healthy       │
                                 │  - Raft quorum   │
                                 │  - TLS enabled   │
                                 └──────────────────┘
```

## 2. Hardware Requirements

### 2.1 Minimum Specifications Per Node

**Control Plane Nodes (3-5 recommended):**
- CPU: 8 cores / 16 threads (Intel Xeon or AMD EPYC)
- RAM: 32 GB DDR4 ECC
- Storage: 500 GB SSD (NVMe preferred)
- Network: 2x 10 GbE (bonded/redundant)
- BMC: IPMI 2.0 or Redfish compatible

**Worker Nodes:**
- CPU: 16+ cores / 32+ threads
- RAM: 64 GB+ DDR4 ECC
- Storage: 1 TB+ NVMe SSD
- Network: 2x 10 GbE or 2x 25 GbE
- BMC: IPMI 2.0 or Redfish compatible

**All-in-One (Development/Testing):**
- CPU: 16 cores / 32 threads
- RAM: 64 GB DDR4
- Storage: 1 TB SSD
- Network: 1x 10 GbE (minimum)
- BMC: Optional but recommended

### 2.2 Recommended Production Specifications

**Control Plane Nodes:**
- CPU: 16-32 cores (Intel Xeon Gold/Platinum or AMD EPYC)
- RAM: 64-128 GB DDR4 ECC
- Storage: 1-2 TB NVMe SSD (RAID1 for redundancy)
- Network: 2x 25 GbE (active/active bonding)
- BMC: Redfish with SOL (Serial-over-LAN)

**Worker Nodes:**
- CPU: 32-64 cores
- RAM: 128-256 GB DDR4 ECC
- Storage: 2-4 TB NVMe SSD
- Network: 2x 25 GbE or 2x 100 GbE
- GPU: Optional (NVIDIA/AMD for ML workloads)

### 2.3 Hardware Compatibility Matrix

| Vendor    | Model         | Tested | BIOS | UEFI | Notes                          |
|-----------|---------------|--------|------|------|--------------------------------|
| Dell      | PowerEdge R640| Yes    | Yes  | Yes  | Requires BIOS A19+             |
| Dell      | PowerEdge R650| Yes    | Yes  | Yes  | Best PXE compatibility         |
| HPE       | ProLiant DL360| Yes    | Yes  | Yes  | Disable Secure Boot            |
| HPE       | ProLiant DL380| Yes    | Yes  | Yes  | Latest firmware recommended    |
| Supermicro| SYS-2029U     | Yes    | Yes  | Yes  | Requires BMC 1.73+             |
| Lenovo    | ThinkSystem   | Partial| Yes  | Yes  | Some NIC issues on older models|
| Generic   | Whitebox x86  | Partial| Yes  | Maybe| UEFI support varies            |

### 2.4 BIOS/UEFI Settings

**Required Settings:**
- Boot Mode: UEFI (preferred) or Legacy BIOS
- PXE/Network Boot: Enabled on primary NIC
- Boot Order: Network → Disk
- Secure Boot: Disabled (for PXE boot)
- Virtualization: Enabled (VT-x/AMD-V)
- SR-IOV: Enabled (if using advanced networking)

**Dell-Specific (iDRAC):**
```
System BIOS → Boot Settings:
  Boot Mode: UEFI
  UEFI Network Stack: Enabled
  PXE Device 1: Integrated NIC 1

System BIOS → System Profile:
  Profile: Performance
```

**HPE-Specific (iLO):**
```
System Configuration → BIOS/Platform:
  Boot Mode: UEFI Mode
  Network Boot: Enabled
  PXE Support: UEFI Only

System Configuration → UEFI Boot Order:
  1. Network Adapter (NIC 1)
  2. Hard Disk
```

**Supermicro-Specific (IPMI):**
```
BIOS Setup → Boot:
  Boot mode select: UEFI
  UEFI Network Stack: Enabled
  Boot Option #1: UEFI Network

BIOS Setup → Advanced → CPU Configuration:
  Intel Virtualization Technology: Enabled
```

### 2.5 BMC/IPMI Requirements

**Mandatory Features:**
- Remote power control (on/off/reset)
- Boot device selection (PXE/disk)
- Remote console access (KVM-over-IP or SOL)

**Recommended Features:**
- Virtual media mounting
- Sensor monitoring (temperature, fans, PSU)
- Event logging
- SMTP alerting

**Network Configuration:**
- Dedicated BMC network (separate VLAN recommended)
- Static IP or DHCP reservation
- HTTPS access enabled
- Default credentials changed

## 3. Network Setup

### 3.1 Network Topology

**Single-Segment Topology (Simple):**
```
┌─────────────────────────────────────────────────────┐
│  Provisioning Server    PXE/DHCP/HTTP              │
│  10.0.100.10                                        │
└──────────────┬──────────────────────────────────────┘
               │
               │  Layer 2 Switch (unmanaged)
               │
    ┬──────────┴──────────┬─────────────┬
    │                     │             │
┌───┴────┐          ┌────┴─────┐  ┌───┴────┐
│ Node01 │          │  Node02  │  │ Node03 │
│10.0.100│          │ 10.0.100 │  │10.0.100│
│  .50   │          │   .51    │  │  .52   │
└────────┘          └──────────┘  └────────┘
```

**Multi-VLAN Topology (Production):**
```
┌──────────────────────────────────────────────────────┐
│  Management Network (VLAN 10)                        │
│  - Provisioning Server: 10.0.10.10                   │
│  - BMC/IPMI: 10.0.10.50-99                          │
└──────────────────┬───────────────────────────────────┘
                   │
┌──────────────────┴───────────────────────────────────┐
│  Provisioning Network (VLAN 100)                     │
│  - PXE Boot: 10.0.100.0/24                          │
│  - DHCP Range: 10.0.100.100-200                     │
└──────────────────┬───────────────────────────────────┘
                   │
┌──────────────────┴───────────────────────────────────┐
│  Production Network (VLAN 200)                       │
│  - Static IPs: 10.0.200.10-99                       │
│  - Service Traffic                                   │
└──────────────────┬───────────────────────────────────┘
                   │
          ┌────────┴────────┐
          │  L3 Switch      │
          │  (VLANs, Routing)│
          └────────┬─────────┘
                   │
       ┬───────────┴──────────┬─────────┬
       │                      │         │
  ┌────┴────┐           ┌────┴────┐   │
  │ Node01  │           │ Node02  │   │...
  │ eth0:   │           │ eth0:   │
  │  VLAN100│           │  VLAN100│
  │ eth1:   │           │ eth1:   │
  │  VLAN200│           │  VLAN200│
  └─────────┘           └─────────┘
```

### 3.2 DHCP Server Configuration

**ISC DHCP Configuration (`/etc/dhcp/dhcpd.conf`):**

```dhcp
# Global options
option architecture-type code 93 = unsigned integer 16;
default-lease-time 600;
max-lease-time 7200;
authoritative;

# Provisioning subnet
subnet 10.0.100.0 netmask 255.255.255.0 {
    range 10.0.100.100 10.0.100.200;
    option routers 10.0.100.1;
    option domain-name-servers 10.0.100.1, 8.8.8.8;
    option domain-name "prov.example.com";

    # PXE boot server
    next-server 10.0.100.10;

    # Architecture-specific boot file selection
    if exists user-class and option user-class = "iPXE" {
        # iPXE already loaded, provide boot script via HTTP
        filename "http://10.0.100.10:8080/boot/ipxe/boot.ipxe";
    } elsif option architecture-type = 00:00 {
        # BIOS (legacy) - load iPXE via TFTP
        filename "undionly.kpxe";
    } elsif option architecture-type = 00:07 {
        # UEFI x86_64 - load iPXE via TFTP
        filename "ipxe.efi";
    } elsif option architecture-type = 00:09 {
        # UEFI x86_64 (alternate) - load iPXE via TFTP
        filename "ipxe.efi";
    } else {
        # Fallback to UEFI
        filename "ipxe.efi";
    }
}

# Static reservations for control plane nodes
host node01 {
    hardware ethernet 52:54:00:12:34:56;
    fixed-address 10.0.100.50;
    option host-name "node01";
}

host node02 {
    hardware ethernet 52:54:00:12:34:57;
    fixed-address 10.0.100.51;
    option host-name "node02";
}

host node03 {
    hardware ethernet 52:54:00:12:34:58;
    fixed-address 10.0.100.52;
    option host-name "node03";
}
```

**Validation Commands:**
```bash
# Test DHCP configuration syntax
sudo dhcpd -t -cf /etc/dhcp/dhcpd.conf

# Start DHCP server
sudo systemctl start isc-dhcp-server
sudo systemctl enable isc-dhcp-server

# Monitor DHCP leases
sudo tail -f /var/lib/dhcp/dhcpd.leases

# Test DHCP response
sudo nmap --script broadcast-dhcp-discover -e eth0
```

### 3.3 DNS Requirements

**Forward DNS Zone (`example.com`):**
```zone
; Control plane nodes
node01.example.com.    IN  A    10.0.200.10
node02.example.com.    IN  A    10.0.200.11
node03.example.com.    IN  A    10.0.200.12

; Worker nodes
worker01.example.com.  IN  A    10.0.200.20
worker02.example.com.  IN  A    10.0.200.21

; Service VIPs (optional, for load balancing)
chainfire.example.com. IN  A    10.0.200.100
flaredb.example.com.   IN  A    10.0.200.101
iam.example.com.       IN  A    10.0.200.102
```

**Reverse DNS Zone (`200.0.10.in-addr.arpa`):**
```zone
; Control plane nodes
10.200.0.10.in-addr.arpa.  IN  PTR  node01.example.com.
11.200.0.10.in-addr.arpa.  IN  PTR  node02.example.com.
12.200.0.10.in-addr.arpa.  IN  PTR  node03.example.com.
```

**Validation:**
```bash
# Test forward resolution
dig +short node01.example.com

# Test reverse resolution
dig +short -x 10.0.200.10

# Test from target node after provisioning
ssh root@10.0.100.50 'hostname -f'
```

### 3.4 Firewall Rules

**Service Port Matrix (see NETWORK.md for complete reference):**

| Service      | API Port | Raft Port | Additional | Protocol |
|--------------|----------|-----------|------------|----------|
| Chainfire    | 2379     | 2380      | 2381 (gossip) | TCP   |
| FlareDB      | 2479     | 2480      | -          | TCP      |
| IAM          | 8080     | -         | -          | TCP      |
| PlasmaVMC    | 9090     | -         | -          | TCP      |
| PrismNET      | 9091     | -         | -          | TCP      |
| FlashDNS     | 53       | -         | -          | TCP/UDP  |
| FiberLB      | 9092     | -         | -          | TCP      |
| K8sHost      | 10250    | -         | -          | TCP      |

**iptables Rules (Provisioning Server):**
```bash
#!/bin/bash
# Provisioning server firewall rules

# Allow DHCP
iptables -A INPUT -p udp --dport 67 -j ACCEPT
iptables -A INPUT -p udp --dport 68 -j ACCEPT

# Allow TFTP
iptables -A INPUT -p udp --dport 69 -j ACCEPT

# Allow HTTP (boot server)
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
iptables -A INPUT -p tcp --dport 8080 -j ACCEPT

# Allow SSH (for nixos-anywhere)
iptables -A INPUT -p tcp --dport 22 -j ACCEPT
```

**iptables Rules (Cluster Nodes):**
```bash
#!/bin/bash
# Cluster node firewall rules

# Allow SSH (management)
iptables -A INPUT -p tcp --dport 22 -s 10.0.0.0/8 -j ACCEPT

# Allow Chainfire (from cluster subnet only)
iptables -A INPUT -p tcp --dport 2379 -s 10.0.200.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 2380 -s 10.0.200.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 2381 -s 10.0.200.0/24 -j ACCEPT

# Allow FlareDB
iptables -A INPUT -p tcp --dport 2479 -s 10.0.200.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 2480 -s 10.0.200.0/24 -j ACCEPT

# Allow IAM (from cluster and client subnets)
iptables -A INPUT -p tcp --dport 8080 -s 10.0.0.0/8 -j ACCEPT

# Drop all other traffic
iptables -A INPUT -j DROP
```

**nftables Rules (Modern Alternative):**
```nft
#!/usr/sbin/nft -f

flush ruleset

table inet filter {
    chain input {
        type filter hook input priority 0; policy drop;

        # Allow established connections
        ct state established,related accept

        # Allow loopback
        iif lo accept

        # Allow SSH
        tcp dport 22 ip saddr 10.0.0.0/8 accept

        # Allow cluster services from cluster subnet
        tcp dport { 2379, 2380, 2381, 2479, 2480 } ip saddr 10.0.200.0/24 accept

        # Allow IAM from internal network
        tcp dport 8080 ip saddr 10.0.0.0/8 accept
    }
}
```

### 3.5 Static IP Allocation Strategy

**IP Allocation Plan:**
```
10.0.100.0/24  - Provisioning network (DHCP during install)
  .1           - Gateway
  .10          - PXE/DHCP/HTTP server
  .50-.79      - Control plane nodes (static reservations)
  .80-.99      - Worker nodes (static reservations)
  .100-.200    - DHCP pool (temporary during provisioning)

10.0.200.0/24  - Production network (static IPs)
  .1           - Gateway
  .10-.19      - Control plane nodes
  .20-.99      - Worker nodes
  .100-.199    - Service VIPs
```

### 3.6 Network Bandwidth Requirements

**Per-Node During Provisioning:**
- PXE boot: ~200-500 MB (kernel + initrd)
- nixos-anywhere: ~1-5 GB (NixOS closures)
- Time: 5-15 minutes on 1 Gbps link

**Production Cluster:**
- Control plane: 1 Gbps minimum, 10 Gbps recommended
- Workers: 10 Gbps minimum, 25 Gbps recommended
- Inter-node latency: <1ms ideal, <5ms acceptable

## 4. Pre-Deployment Checklist

Complete this checklist before beginning deployment:

### 4.1 Hardware Checklist

- [ ] All servers racked and powered
- [ ] All network cables connected (data + BMC)
- [ ] All power supplies connected (redundant if available)
- [ ] BMC/IPMI network configured
- [ ] BMC credentials documented
- [ ] BIOS/UEFI settings configured per section 2.4
- [ ] PXE boot enabled and first in boot order
- [ ] Secure Boot disabled (if using UEFI)
- [ ] Hardware inventory recorded (MAC addresses, serial numbers)

### 4.2 Network Checklist

- [ ] Network switches configured (VLANs, trunking)
- [ ] DHCP server configured and tested
- [ ] DNS forward/reverse zones created
- [ ] Firewall rules configured
- [ ] Network connectivity verified (ping tests)
- [ ] Bandwidth validated (iperf between nodes)
- [ ] DHCP relay configured (if multi-subnet)
- [ ] NTP server configured for time sync

### 4.3 PXE Server Checklist

- [ ] PXE server deployed (see T032.S2)
- [ ] DHCP service running and healthy
- [ ] TFTP service running and healthy
- [ ] HTTP service running and healthy
- [ ] iPXE bootloaders downloaded (undionly.kpxe, ipxe.efi)
- [ ] NixOS netboot images built and uploaded (see T032.S3)
- [ ] Boot script configured (boot.ipxe)
- [ ] Health endpoints responding

**Validation:**
```bash
# On PXE server
sudo systemctl status isc-dhcp-server
sudo systemctl status atftpd
sudo systemctl status nginx

# Test HTTP access
curl http://10.0.100.10:8080/boot/ipxe/boot.ipxe
curl http://10.0.100.10:8080/health

# Test TFTP access
tftp 10.0.100.10 -c get undionly.kpxe /tmp/test.kpxe
```

### 4.4 Node Configuration Checklist

- [ ] Per-node NixOS configurations created (`/srv/provisioning/nodes/`)
- [ ] Hardware configurations generated or templated
- [ ] Disko disk layouts defined
- [ ] Network settings configured (static IPs, VLANs)
- [ ] Service selections defined (control-plane vs worker)
- [ ] Cluster configuration JSON files created
- [ ] Node inventory documented (MAC → hostname → role)

### 4.5 TLS Certificates Checklist

- [ ] CA certificate generated
- [ ] Per-node certificates generated
- [ ] Certificate files copied to secrets directories
- [ ] Certificate permissions set (0400 for private keys)
- [ ] Certificate expiry dates documented
- [ ] Rotation procedure documented

**Generate Certificates:**
```bash
# Generate CA (if not already done)
openssl genrsa -out ca-key.pem 4096
openssl req -x509 -new -nodes -key ca-key.pem -days 3650 \
  -out ca-cert.pem -subj "/CN=PlasmaCloud CA"

# Generate per-node certificate
for node in node01 node02 node03; do
  openssl genrsa -out ${node}-key.pem 4096
  openssl req -new -key ${node}-key.pem -out ${node}-csr.pem \
    -subj "/CN=${node}.example.com"
  openssl x509 -req -in ${node}-csr.pem -CA ca-cert.pem -CAkey ca-key.pem \
    -CAcreateserial -out ${node}-cert.pem -days 365
done
```

### 4.6 Provisioning Workstation Checklist

- [ ] NixOS or Nix package manager installed
- [ ] Nix flakes enabled
- [ ] SSH key pair generated for provisioning
- [ ] SSH public key added to netboot images
- [ ] Network access to provisioning VLAN
- [ ] Git repository cloned (if using version control)
- [ ] nixos-anywhere installed: `nix profile install github:nix-community/nixos-anywhere`

## 5. Deployment Workflow

### 5.1 Phase 1: PXE Server Setup

**Reference:** See `/home/centra/cloud/chainfire/baremetal/pxe-server/` (T032.S2)

**Step 1.1: Deploy PXE Server Using NixOS Module**

Create PXE server configuration:
```nix
# /etc/nixos/pxe-server.nix
{ config, pkgs, lib, ... }:

{
  imports = [
    /path/to/chainfire/baremetal/pxe-server/nixos-module.nix
  ];

  services.centra-pxe-server = {
    enable = true;
    interface = "eth0";
    serverAddress = "10.0.100.10";

    dhcp = {
      subnet = "10.0.100.0";
      netmask = "255.255.255.0";
      broadcast = "10.0.100.255";
      range = {
        start = "10.0.100.100";
        end = "10.0.100.200";
      };
      router = "10.0.100.1";
      domainNameServers = [ "10.0.100.1" "8.8.8.8" ];
    };

    nodes = {
      "52:54:00:12:34:56" = {
        profile = "control-plane";
        hostname = "node01";
        ipAddress = "10.0.100.50";
      };
      "52:54:00:12:34:57" = {
        profile = "control-plane";
        hostname = "node02";
        ipAddress = "10.0.100.51";
      };
      "52:54:00:12:34:58" = {
        profile = "control-plane";
        hostname = "node03";
        ipAddress = "10.0.100.52";
      };
    };
  };
}
```

Apply configuration:
```bash
sudo nixos-rebuild switch -I nixos-config=/etc/nixos/pxe-server.nix
```

**Step 1.2: Verify PXE Services**

```bash
# Check all services are running
sudo systemctl status dhcpd4.service
sudo systemctl status atftpd.service
sudo systemctl status nginx.service

# Test DHCP server
sudo journalctl -u dhcpd4 -f &
# Power on a test server and watch for DHCP requests

# Test TFTP server
tftp localhost -c get undionly.kpxe /tmp/test.kpxe
ls -lh /tmp/test.kpxe  # Should show ~100KB file

# Test HTTP server
curl http://localhost:8080/health
# Expected: {"status":"healthy","services":{"dhcp":"running","tftp":"running","http":"running"}}

curl http://localhost:8080/boot/ipxe/boot.ipxe
# Expected: iPXE boot script content
```

### 5.2 Phase 2: Build Netboot Images

**Reference:** See `/home/centra/cloud/baremetal/image-builder/` (T032.S3)

**Step 2.1: Build Images for All Profiles**

```bash
cd /home/centra/cloud/baremetal/image-builder

# Build all profiles
./build-images.sh

# Or build specific profile
./build-images.sh --profile control-plane
./build-images.sh --profile worker
./build-images.sh --profile all-in-one
```

**Expected Output:**
```
Building netboot image for control-plane...
Building initrd...
[... Nix build output ...]
✓ Build complete: artifacts/control-plane/initrd (234 MB)
✓ Build complete: artifacts/control-plane/bzImage (12 MB)
```

**Step 2.2: Copy Images to PXE Server**

```bash
# Automatic (if PXE server directory exists)
./build-images.sh --deploy

# Manual copy
sudo cp artifacts/control-plane/* /var/lib/pxe-boot/nixos/control-plane/
sudo cp artifacts/worker/* /var/lib/pxe-boot/nixos/worker/
sudo cp artifacts/all-in-one/* /var/lib/pxe-boot/nixos/all-in-one/
```

**Step 2.3: Verify Image Integrity**

```bash
# Check file sizes (should be reasonable)
ls -lh /var/lib/pxe-boot/nixos/*/

# Verify images are accessible via HTTP
curl -I http://10.0.100.10:8080/boot/nixos/control-plane/bzImage
# Expected: HTTP/1.1 200 OK, Content-Length: ~12000000

curl -I http://10.0.100.10:8080/boot/nixos/control-plane/initrd
# Expected: HTTP/1.1 200 OK, Content-Length: ~234000000
```

### 5.3 Phase 3: Prepare Node Configurations

**Step 3.1: Generate Node-Specific NixOS Configs**

Create directory structure:
```bash
mkdir -p /srv/provisioning/nodes/{node01,node02,node03}.example.com/{secrets,}
```

**Node Configuration Template (`nodes/node01.example.com/configuration.nix`):**
```nix
{ config, pkgs, lib, ... }:

{
  imports = [
    ../../profiles/control-plane.nix
    ../../common/base.nix
    ./hardware.nix
    ./disko.nix
  ];

  # Hostname and domain
  networking = {
    hostName = "node01";
    domain = "example.com";
    usePredictableInterfaceNames = false;  # Use eth0, eth1

    # Provisioning interface (temporary)
    interfaces.eth0 = {
      useDHCP = false;
      ipv4.addresses = [{
        address = "10.0.100.50";
        prefixLength = 24;
      }];
    };

    # Production interface
    interfaces.eth1 = {
      useDHCP = false;
      ipv4.addresses = [{
        address = "10.0.200.10";
        prefixLength = 24;
      }];
    };

    defaultGateway = "10.0.200.1";
    nameservers = [ "10.0.200.1" "8.8.8.8" ];
  };

  # Enable PlasmaCloud services
  services.chainfire = {
    enable = true;
    port = 2379;
    raftPort = 2380;
    gossipPort = 2381;
    settings = {
      node_id = "node01";
      cluster_name = "prod-cluster";
      tls = {
        cert_path = "/etc/nixos/secrets/node01-cert.pem";
        key_path = "/etc/nixos/secrets/node01-key.pem";
        ca_path = "/etc/nixos/secrets/ca-cert.pem";
      };
    };
  };

  services.flaredb = {
    enable = true;
    port = 2479;
    raftPort = 2480;
    settings = {
      node_id = "node01";
      cluster_name = "prod-cluster";
      chainfire_endpoint = "https://localhost:2379";
      tls = {
        cert_path = "/etc/nixos/secrets/node01-cert.pem";
        key_path = "/etc/nixos/secrets/node01-key.pem";
        ca_path = "/etc/nixos/secrets/ca-cert.pem";
      };
    };
  };

  services.iam = {
    enable = true;
    port = 8080;
    settings = {
      flaredb_endpoint = "https://localhost:2479";
      tls = {
        cert_path = "/etc/nixos/secrets/node01-cert.pem";
        key_path = "/etc/nixos/secrets/node01-key.pem";
        ca_path = "/etc/nixos/secrets/ca-cert.pem";
      };
    };
  };

  # Enable first-boot automation
  services.first-boot-automation = {
    enable = true;
    configFile = "/etc/nixos/secrets/cluster-config.json";
  };

  system.stateVersion = "24.11";
}
```

**Step 3.2: Create cluster-config.json for Each Node**

**Bootstrap Node (node01):**
```json
{
  "node_id": "node01",
  "node_role": "control-plane",
  "bootstrap": true,
  "cluster_name": "prod-cluster",
  "leader_url": "https://node01.example.com:2379",
  "raft_addr": "10.0.200.10:2380",
  "initial_peers": [
    "node01.example.com:2380",
    "node02.example.com:2380",
    "node03.example.com:2380"
  ],
  "flaredb_peers": [
    "node01.example.com:2480",
    "node02.example.com:2480",
    "node03.example.com:2480"
  ]
}
```

Copy to secrets:
```bash
cp cluster-config-node01.json /srv/provisioning/nodes/node01.example.com/secrets/cluster-config.json
cp cluster-config-node02.json /srv/provisioning/nodes/node02.example.com/secrets/cluster-config.json
cp cluster-config-node03.json /srv/provisioning/nodes/node03.example.com/secrets/cluster-config.json
```

**Step 3.3: Generate Disko Disk Layouts**

**Simple Single-Disk Layout (`nodes/node01.example.com/disko.nix`):**
```nix
{ disks ? [ "/dev/sda" ], ... }:
{
  disko.devices = {
    disk = {
      main = {
        type = "disk";
        device = builtins.head disks;
        content = {
          type = "gpt";
          partitions = {
            ESP = {
              size = "1G";
              type = "EF00";
              content = {
                type = "filesystem";
                format = "vfat";
                mountpoint = "/boot";
              };
            };
            root = {
              size = "100%";
              content = {
                type = "filesystem";
                format = "ext4";
                mountpoint = "/";
              };
            };
          };
        };
      };
    };
  };
}
```

**Step 3.4: Pre-Generate TLS Certificates**

```bash
# Copy per-node certificates
cp ca-cert.pem /srv/provisioning/nodes/node01.example.com/secrets/
cp node01-cert.pem /srv/provisioning/nodes/node01.example.com/secrets/
cp node01-key.pem /srv/provisioning/nodes/node01.example.com/secrets/

# Set permissions
chmod 644 /srv/provisioning/nodes/node01.example.com/secrets/*-cert.pem
chmod 644 /srv/provisioning/nodes/node01.example.com/secrets/ca-cert.pem
chmod 600 /srv/provisioning/nodes/node01.example.com/secrets/*-key.pem
```

### 5.4 Phase 4: Bootstrap First 3 Nodes

**Step 4.1: Power On Nodes via BMC**

```bash
# Using ipmitool (example for Dell/HP/Supermicro)
for ip in 10.0.10.50 10.0.10.51 10.0.10.52; do
  ipmitool -I lanplus -H $ip -U admin -P password chassis bootdev pxe options=persistent
  ipmitool -I lanplus -H $ip -U admin -P password chassis power on
done
```

**Step 4.2: Verify PXE Boot Success**

Watch serial console (if available):
```bash
# Connect via IPMI SOL
ipmitool -I lanplus -H 10.0.10.50 -U admin -P password sol activate

# Expected output:
# ... DHCP discovery ...
# ... TFTP download undionly.kpxe or ipxe.efi ...
# ... iPXE menu appears ...
# ... Kernel and initrd download ...
# ... NixOS installer boots ...
# ... SSH server starts ...
```

Verify installer is ready:
```bash
# Wait for nodes to appear in DHCP leases
sudo tail -f /var/lib/dhcp/dhcpd.leases

# Test SSH connectivity
ssh root@10.0.100.50 'uname -a'
# Expected: Linux node01 ... nixos
```

**Step 4.3: Run nixos-anywhere Simultaneously on All 3**

Create provisioning script:
```bash
#!/bin/bash
# /srv/provisioning/scripts/provision-bootstrap-nodes.sh

set -euo pipefail

NODES=("node01" "node02" "node03")
PROVISION_IPS=("10.0.100.50" "10.0.100.51" "10.0.100.52")
FLAKE_ROOT="/srv/provisioning"

for i in "${!NODES[@]}"; do
  node="${NODES[$i]}"
  ip="${PROVISION_IPS[$i]}"

  echo "Provisioning $node at $ip..."

  nix run github:nix-community/nixos-anywhere -- \
    --flake "$FLAKE_ROOT#$node" \
    --build-on-remote \
    root@$ip &
done

wait
echo "All nodes provisioned successfully!"
```

Run provisioning:
```bash
chmod +x /srv/provisioning/scripts/provision-bootstrap-nodes.sh
./provision-bootstrap-nodes.sh
```

**Expected output per node:**
```
Provisioning node01 at 10.0.100.50...
Connecting via SSH...
Running disko to partition disks...
Building NixOS system...
Installing bootloader...
Copying secrets...
Installation complete. Rebooting...
```

**Step 4.4: Wait for First-Boot Automation**

After reboot, nodes will boot from disk and run first-boot automation. Monitor progress:

```bash
# Watch logs on node01 (via SSH after it reboots)
ssh root@10.0.200.10  # Note: now on production network

# Check cluster join services
journalctl -u chainfire-cluster-join.service -f
journalctl -u flaredb-cluster-join.service -f

# Expected log output:
# {"level":"INFO","message":"Waiting for local chainfire service..."}
# {"level":"INFO","message":"Local chainfire healthy"}
# {"level":"INFO","message":"Bootstrap node, cluster initialized"}
# {"level":"INFO","message":"Cluster join complete"}
```

**Step 4.5: Verify Cluster Health**

```bash
# Check Chainfire cluster
curl -k https://node01.example.com:2379/admin/cluster/members | jq

# Expected output:
# {
#   "members": [
#     {"id":"node01","raft_addr":"10.0.200.10:2380","status":"healthy","role":"leader"},
#     {"id":"node02","raft_addr":"10.0.200.11:2380","status":"healthy","role":"follower"},
#     {"id":"node03","raft_addr":"10.0.200.12:2380","status":"healthy","role":"follower"}
#   ]
# }

# Check FlareDB cluster
curl -k https://node01.example.com:2479/admin/cluster/members | jq

# Check IAM service
curl -k https://node01.example.com:8080/health | jq
# Expected: {"status":"healthy","database":"connected"}
```

### 5.5 Phase 5: Add Additional Nodes

**Step 5.1: Prepare Join-Mode Configurations**

Create configuration for node04 (worker profile):
```json
{
  "node_id": "node04",
  "node_role": "worker",
  "bootstrap": false,
  "cluster_name": "prod-cluster",
  "leader_url": "https://node01.example.com:2379",
  "raft_addr": "10.0.200.20:2380"
}
```

**Step 5.2: Power On and Provision Nodes**

```bash
# Power on node via BMC
ipmitool -I lanplus -H 10.0.10.54 -U admin -P password chassis bootdev pxe
ipmitool -I lanplus -H 10.0.10.54 -U admin -P password chassis power on

# Wait for PXE boot and SSH ready
sleep 60

# Provision node
nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node04 \
  --build-on-remote \
  root@10.0.100.60
```

**Step 5.3: Verify Cluster Join via API**

```bash
# Check cluster members (should include node04)
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members[] | select(.id=="node04")'

# Expected:
# {"id":"node04","raft_addr":"10.0.200.20:2380","status":"healthy","role":"follower"}
```

**Step 5.4: Validate Replication and Service Distribution**

```bash
# Write test data on leader
curl -k -X PUT https://node01.example.com:2379/v1/kv/test \
  -H "Content-Type: application/json" \
  -d '{"value":"hello world"}'

# Read from follower (should be replicated)
curl -k https://node02.example.com:2379/v1/kv/test | jq

# Expected: {"key":"test","value":"hello world"}
```

## 6. Verification & Validation

### 6.1 Health Check Commands for All Services

**Chainfire:**
```bash
curl -k https://node01.example.com:2379/health | jq
# Expected: {"status":"healthy","raft":"leader","cluster_size":3}

# Check cluster membership
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members | length'
# Expected: 3 (for initial bootstrap)
```

**FlareDB:**
```bash
curl -k https://node01.example.com:2479/health | jq
# Expected: {"status":"healthy","raft":"leader","chainfire":"connected"}

# Query test metric
curl -k https://node01.example.com:2479/v1/query \
  -H "Content-Type: application/json" \
  -d '{"query":"up{job=\"node\"}","time":"now"}'
```

**IAM:**
```bash
curl -k https://node01.example.com:8080/health | jq
# Expected: {"status":"healthy","database":"connected","version":"1.0.0"}

# List users (requires authentication)
curl -k https://node01.example.com:8080/api/users \
  -H "Authorization: Bearer $IAM_TOKEN" | jq
```

**PlasmaVMC:**
```bash
curl -k https://node01.example.com:9090/health | jq
# Expected: {"status":"healthy","vms_running":0}

# List VMs
curl -k https://node01.example.com:9090/api/vms | jq
```

**PrismNET:**
```bash
curl -k https://node01.example.com:9091/health | jq
# Expected: {"status":"healthy","networks":0}
```

**FlashDNS:**
```bash
dig @node01.example.com example.com
# Expected: DNS response with ANSWER section

# Health check
curl -k https://node01.example.com:853/health | jq
```

**FiberLB:**
```bash
curl -k https://node01.example.com:9092/health | jq
# Expected: {"status":"healthy","backends":0}
```

**K8sHost:**
```bash
kubectl --kubeconfig=/etc/kubernetes/admin.conf get nodes
# Expected: Node list including this node
```

### 6.2 Cluster Membership Verification

```bash
#!/bin/bash
# /srv/provisioning/scripts/verify-cluster.sh

echo "Checking Chainfire cluster..."
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members[] | {id, status, role}'

echo ""
echo "Checking FlareDB cluster..."
curl -k https://node01.example.com:2479/admin/cluster/members | jq '.members[] | {id, status, role}'

echo ""
echo "Cluster health summary:"
echo "  Chainfire nodes: $(curl -sk https://node01.example.com:2379/admin/cluster/members | jq '.members | length')"
echo "  FlareDB nodes: $(curl -sk https://node01.example.com:2479/admin/cluster/members | jq '.members | length')"
echo "  Raft leaders: Chainfire=$(curl -sk https://node01.example.com:2379/admin/cluster/members | jq -r '.members[] | select(.role=="leader") | .id'), FlareDB=$(curl -sk https://node01.example.com:2479/admin/cluster/members | jq -r '.members[] | select(.role=="leader") | .id')"
```

### 6.3 Raft Leader Election Check

```bash
# Identify current leader
LEADER=$(curl -sk https://node01.example.com:2379/admin/cluster/members | jq -r '.members[] | select(.role=="leader") | .id')
echo "Current Chainfire leader: $LEADER"

# Verify all followers can reach leader
for node in node01 node02 node03; do
  echo "Checking $node..."
  curl -sk https://$node.example.com:2379/admin/cluster/leader | jq
done
```

### 6.4 TLS Certificate Validation

```bash
# Check certificate expiry
for node in node01 node02 node03; do
  echo "Checking $node certificate..."
  echo | openssl s_client -connect $node.example.com:2379 2>/dev/null | openssl x509 -noout -dates
done

# Verify certificate chain
echo | openssl s_client -connect node01.example.com:2379 -CAfile /srv/provisioning/ca-cert.pem -verify 1
# Expected: Verify return code: 0 (ok)
```

### 6.5 Network Connectivity Tests

```bash
# Test inter-node connectivity (from node01)
ssh root@node01.example.com '
  for node in node02 node03; do
    echo "Testing connectivity to $node..."
    nc -zv $node.example.com 2379
    nc -zv $node.example.com 2380
  done
'

# Test bandwidth (iperf3)
ssh root@node02.example.com 'iperf3 -s' &
ssh root@node01.example.com 'iperf3 -c node02.example.com -t 10'
# Expected: ~10 Gbps on 10GbE, ~1 Gbps on 1GbE
```

### 6.6 Performance Smoke Tests

**Chainfire Write Performance:**
```bash
# 1000 writes
time for i in {1..1000}; do
  curl -sk -X PUT https://node01.example.com:2379/v1/kv/test$i \
    -H "Content-Type: application/json" \
    -d "{\"value\":\"test data $i\"}" > /dev/null
done

# Expected: <10 seconds on healthy cluster
```

**FlareDB Query Performance:**
```bash
# Insert test metrics
curl -k -X POST https://node01.example.com:2479/v1/write \
  -H "Content-Type: application/json" \
  -d '{"metric":"test_metric","value":42,"timestamp":"'$(date -Iseconds)'"}'

# Query performance
time curl -k https://node01.example.com:2479/v1/query \
  -H "Content-Type: application/json" \
  -d '{"query":"test_metric","start":"1h","end":"now"}'

# Expected: <100ms response time
```

## 7. Common Operations

### 7.1 Adding a New Node

**Step 1: Prepare Node Configuration**
```bash
# Create node directory
mkdir -p /srv/provisioning/nodes/node05.example.com/secrets

# Copy template configuration
cp /srv/provisioning/nodes/node01.example.com/configuration.nix \
   /srv/provisioning/nodes/node05.example.com/

# Edit for new node
vim /srv/provisioning/nodes/node05.example.com/configuration.nix
# Update: hostName, ipAddresses, node_id
```

**Step 2: Generate Cluster Config (Join Mode)**
```json
{
  "node_id": "node05",
  "node_role": "worker",
  "bootstrap": false,
  "cluster_name": "prod-cluster",
  "leader_url": "https://node01.example.com:2379",
  "raft_addr": "10.0.200.21:2380"
}
```

**Step 3: Provision Node**
```bash
# Power on and PXE boot
ipmitool -I lanplus -H 10.0.10.55 -U admin -P password chassis bootdev pxe
ipmitool -I lanplus -H 10.0.10.55 -U admin -P password chassis power on

# Wait for SSH
sleep 60

# Run nixos-anywhere
nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node05 \
  root@10.0.100.65
```

**Step 4: Verify Join**
```bash
# Check cluster membership
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members[] | select(.id=="node05")'
```

### 7.2 Replacing a Failed Node

**Step 1: Remove Failed Node from Cluster**
```bash
# Remove from Chainfire cluster
curl -k -X DELETE https://node01.example.com:2379/admin/member/node02

# Remove from FlareDB cluster
curl -k -X DELETE https://node01.example.com:2479/admin/member/node02
```

**Step 2: Physically Replace Hardware**
- Power off old node
- Remove from rack
- Install new node
- Connect all cables
- Configure BMC

**Step 3: Provision Replacement Node**
```bash
# Use same node ID and configuration
nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node02 \
  root@10.0.100.51
```

**Step 4: Verify Rejoin**
```bash
# Cluster should automatically add node during first-boot
curl -k https://node01.example.com:2379/admin/cluster/members | jq
```

### 7.3 Updating Node Configuration

**Step 1: Edit Configuration**
```bash
vim /srv/provisioning/nodes/node01.example.com/configuration.nix
# Make changes (e.g., add service, change network config)
```

**Step 2: Build and Deploy**
```bash
# Build configuration locally
nix build /srv/provisioning#node01

# Deploy to node (from node or remote)
nixos-rebuild switch --flake /srv/provisioning#node01
```

**Step 3: Verify Changes**
```bash
# Check active configuration
ssh root@node01.example.com 'nixos-rebuild list-generations'

# Test services still healthy
curl -k https://node01.example.com:2379/health | jq
```

### 7.4 Rolling Updates

**Update Process (One Node at a Time):**

```bash
#!/bin/bash
# /srv/provisioning/scripts/rolling-update.sh

NODES=("node01" "node02" "node03")

for node in "${NODES[@]}"; do
  echo "Updating $node..."

  # Build new configuration
  nix build /srv/provisioning#$node

  # Deploy (test mode first)
  ssh root@$node.example.com "nixos-rebuild test --flake /srv/provisioning#$node"

  # Verify health
  if ! curl -k https://$node.example.com:2379/health | jq -e '.status == "healthy"'; then
    echo "ERROR: $node unhealthy after test, aborting"
    ssh root@$node.example.com "nixos-rebuild switch --rollback"
    exit 1
  fi

  # Apply permanently
  ssh root@$node.example.com "nixos-rebuild switch --flake /srv/provisioning#$node"

  # Wait for reboot if kernel changed
  echo "Waiting 30s for stabilization..."
  sleep 30

  # Final health check
  curl -k https://$node.example.com:2379/health | jq

  echo "$node updated successfully"
done
```

### 7.5 Draining a Node for Maintenance

**Step 1: Mark Node for Drain**
```bash
# Disable node in load balancer (if using one)
curl -k -X POST https://node01.example.com:9092/api/backend/node02 \
  -d '{"status":"drain"}'
```

**Step 2: Migrate VMs (PlasmaVMC)**
```bash
# List VMs on node
ssh root@node02.example.com 'systemctl list-units | grep plasmavmc-vm@'

# Migrate each VM
curl -k -X POST https://node01.example.com:9090/api/vms/vm-001/migrate \
  -d '{"target_node":"node03"}'
```

**Step 3: Stop Services**
```bash
ssh root@node02.example.com '
  systemctl stop plasmavmc.service
  systemctl stop chainfire.service
  systemctl stop flaredb.service
'
```

**Step 4: Perform Maintenance**
```bash
# Reboot for kernel update, hardware maintenance, etc.
ssh root@node02.example.com 'reboot'
```

**Step 5: Re-enable Node**
```bash
# Verify all services healthy
ssh root@node02.example.com 'systemctl status chainfire flaredb plasmavmc'

# Re-enable in load balancer
curl -k -X POST https://node01.example.com:9092/api/backend/node02 \
  -d '{"status":"active"}'
```

### 7.6 Decommissioning a Node

**Step 1: Drain Node (see 7.5)**

**Step 2: Remove from Cluster**
```bash
# Remove from Chainfire
curl -k -X DELETE https://node01.example.com:2379/admin/member/node02

# Remove from FlareDB
curl -k -X DELETE https://node01.example.com:2479/admin/member/node02

# Verify removal
curl -k https://node01.example.com:2379/admin/cluster/members | jq
```

**Step 3: Power Off**
```bash
# Via BMC
ipmitool -I lanplus -H 10.0.10.51 -U admin -P password chassis power off

# Or via SSH
ssh root@node02.example.com 'poweroff'
```

**Step 4: Update Inventory**
```bash
# Remove from node inventory
vim /srv/provisioning/inventory.json
# Remove node02 entry

# Remove from DNS
# Update DNS zone to remove node02.example.com

# Remove from monitoring
# Update Prometheus targets to remove node02
```

## 8. Troubleshooting

### 8.1 PXE Boot Failures

**Symptom:** Server does not obtain IP address or does not boot from network

**Diagnosis:**
```bash
# Monitor DHCP server logs
sudo journalctl -u dhcpd4 -f

# Monitor TFTP requests
sudo tcpdump -i eth0 -n port 69

# Check PXE server services
sudo systemctl status dhcpd4 atftpd nginx
```

**Common Causes:**
1. **DHCP server not running:** `sudo systemctl start dhcpd4`
2. **Wrong network interface:** Check `interfaces` in dhcpd.conf
3. **Firewall blocking DHCP/TFTP:** `sudo iptables -L -n | grep -E "67|68|69"`
4. **PXE not enabled in BIOS:** Enter BIOS and enable Network Boot
5. **Network cable disconnected:** Check physical connection

**Solution:**
```bash
# Restart all PXE services
sudo systemctl restart dhcpd4 atftpd nginx

# Verify DHCP configuration
sudo dhcpd -t -cf /etc/dhcp/dhcpd.conf

# Test TFTP
tftp localhost -c get undionly.kpxe /tmp/test.kpxe

# Power cycle server
ipmitool -I lanplus -H <bmc-ip> -U admin chassis power cycle
```

### 8.2 Installation Failures (nixos-anywhere)

**Symptom:** nixos-anywhere fails during disk partitioning, installation, or bootloader setup

**Diagnosis:**
```bash
# Check nixos-anywhere output for errors
# Common errors: disk not found, partition table errors, out of space

# SSH to installer for manual inspection
ssh root@10.0.100.50

# Check disk status
lsblk
dmesg | grep -i error
```

**Common Causes:**
1. **Disk device wrong:** Update disko.nix with correct device (e.g., /dev/nvme0n1)
2. **Disk not wiped:** Previous partition table conflicts
3. **Out of disk space:** Insufficient storage for Nix closures
4. **Network issues:** Cannot download packages from binary cache

**Solution:**
```bash
# Manual disk wipe (on installer)
ssh root@10.0.100.50 '
  wipefs -a /dev/sda
  sgdisk --zap-all /dev/sda
'

# Retry nixos-anywhere
nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node01 \
  --debug \
  root@10.0.100.50
```

### 8.3 Cluster Join Failures

**Symptom:** Node boots successfully but does not join cluster

**Diagnosis:**
```bash
# Check first-boot logs on node
ssh root@node01.example.com 'journalctl -u chainfire-cluster-join.service -u flaredb-cluster-join.service'

# Common errors:
# - "Health check timeout after 120s"
# - "Join request failed: connection refused"
# - "Configuration file not found"
```

**Bootstrap Mode vs Join Mode:**
- **Bootstrap:** Node expects to create new cluster with peers
- **Join:** Node expects to connect to existing leader

**Common Causes:**
1. **Wrong bootstrap flag:** Check cluster-config.json
2. **Leader unreachable:** Network/firewall issue
3. **TLS certificate errors:** Verify cert paths and validity
4. **Service not starting:** Check main service (chainfire.service)

**Solution:**
```bash
# Verify cluster-config.json
ssh root@node01.example.com 'cat /etc/nixos/secrets/cluster-config.json | jq'

# Test leader connectivity
ssh root@node04.example.com 'curl -k https://node01.example.com:2379/health'

# Check TLS certificates
ssh root@node04.example.com 'ls -l /etc/nixos/secrets/*.pem'

# Manual cluster join (if automation fails)
curl -k -X POST https://node01.example.com:2379/admin/member/add \
  -H "Content-Type: application/json" \
  -d '{"id":"node04","raft_addr":"10.0.200.20:2380"}'
```

### 8.4 Service Start Failures

**Symptom:** Service fails to start after boot

**Diagnosis:**
```bash
# Check service status
ssh root@node01.example.com 'systemctl status chainfire.service'

# View logs
ssh root@node01.example.com 'journalctl -u chainfire.service -n 100'

# Common errors:
# - "bind: address already in use" (port conflict)
# - "certificate verify failed" (TLS issue)
# - "permission denied" (file permissions)
```

**Common Causes:**
1. **Port already in use:** Another service using same port
2. **Missing dependencies:** Required service not running
3. **Configuration error:** Invalid config file
4. **File permissions:** Cannot read secrets

**Solution:**
```bash
# Check port usage
ssh root@node01.example.com 'ss -tlnp | grep 2379'

# Verify dependencies
ssh root@node01.example.com 'systemctl list-dependencies chainfire.service'

# Test configuration manually
ssh root@node01.example.com 'chainfire-server --config /etc/nixos/chainfire.toml --check-config'

# Fix permissions
ssh root@node01.example.com 'chmod 600 /etc/nixos/secrets/*-key.pem'
```

### 8.5 Network Connectivity Issues

**Symptom:** Nodes cannot communicate with each other or external services

**Diagnosis:**
```bash
# Test basic connectivity
ssh root@node01.example.com 'ping -c 3 node02.example.com'

# Test specific ports
ssh root@node01.example.com 'nc -zv node02.example.com 2379'

# Check firewall rules
ssh root@node01.example.com 'iptables -L -n | grep 2379'

# Check routing
ssh root@node01.example.com 'ip route show'
```

**Common Causes:**
1. **Firewall blocking traffic:** Missing iptables rules
2. **Wrong IP address:** Configuration mismatch
3. **Network interface down:** Interface not configured
4. **DNS resolution failure:** Cannot resolve hostnames

**Solution:**
```bash
# Add firewall rules
ssh root@node01.example.com '
  iptables -A INPUT -p tcp --dport 2379 -s 10.0.200.0/24 -j ACCEPT
  iptables -A INPUT -p tcp --dport 2380 -s 10.0.200.0/24 -j ACCEPT
  iptables-save > /etc/iptables/rules.v4
'

# Fix DNS resolution
ssh root@node01.example.com '
  echo "10.0.200.11 node02.example.com node02" >> /etc/hosts
'

# Restart networking
ssh root@node01.example.com 'systemctl restart systemd-networkd'
```

### 8.6 TLS Certificate Errors

**Symptom:** Services cannot establish TLS connections

**Diagnosis:**
```bash
# Test TLS connection
openssl s_client -connect node01.example.com:2379 -CAfile /srv/provisioning/ca-cert.pem

# Check certificate validity
ssh root@node01.example.com '
  openssl x509 -in /etc/nixos/secrets/node01-cert.pem -noout -dates
'

# Common errors:
# - "certificate verify failed" (wrong CA)
# - "certificate has expired" (cert expired)
# - "certificate subject name mismatch" (wrong CN)
```

**Common Causes:**
1. **Expired certificate:** Regenerate certificate
2. **Wrong CA certificate:** Verify CA cert is correct
3. **Hostname mismatch:** CN does not match hostname
4. **File permissions:** Cannot read certificate files

**Solution:**
```bash
# Regenerate certificate
openssl req -new -key /srv/provisioning/secrets/node01-key.pem \
  -out /srv/provisioning/secrets/node01-csr.pem \
  -subj "/CN=node01.example.com"

openssl x509 -req -in /srv/provisioning/secrets/node01-csr.pem \
  -CA /srv/provisioning/ca-cert.pem \
  -CAkey /srv/provisioning/ca-key.pem \
  -CAcreateserial \
  -out /srv/provisioning/secrets/node01-cert.pem \
  -days 365

# Copy to node
scp /srv/provisioning/secrets/node01-cert.pem root@node01.example.com:/etc/nixos/secrets/

# Restart service
ssh root@node01.example.com 'systemctl restart chainfire.service'
```

### 8.7 Performance Degradation

**Symptom:** Services are slow or unresponsive

**Diagnosis:**
```bash
# Check system load
ssh root@node01.example.com 'uptime'
ssh root@node01.example.com 'top -bn1 | head -20'

# Check disk I/O
ssh root@node01.example.com 'iostat -x 1 5'

# Check network bandwidth
ssh root@node01.example.com 'iftop -i eth1'

# Check Raft logs for slow operations
ssh root@node01.example.com 'journalctl -u chainfire.service | grep "slow operation"'
```

**Common Causes:**
1. **High CPU usage:** Too many requests, inefficient queries
2. **Disk I/O bottleneck:** Slow disk, too many writes
3. **Network saturation:** Bandwidth exhausted
4. **Memory pressure:** OOM killer active
5. **Raft slow commits:** Network latency between nodes

**Solution:**
```bash
# Add more resources (vertical scaling)
# Or add more nodes (horizontal scaling)

# Check for resource leaks
ssh root@node01.example.com 'systemctl status chainfire | grep Memory'

# Restart service to clear memory leaks (temporary)
ssh root@node01.example.com 'systemctl restart chainfire.service'

# Optimize disk I/O (enable write caching if safe)
ssh root@node01.example.com 'hdparm -W1 /dev/sda'
```

## 9. Rollback & Recovery

### 9.1 NixOS Generation Rollback

NixOS provides atomic rollback capability via generations:

**List Available Generations:**
```bash
ssh root@node01.example.com 'nixos-rebuild list-generations'
# Example output:
#   1   2025-12-10 10:30:00
#   2   2025-12-10 12:45:00 (current)
```

**Rollback to Previous Generation:**
```bash
# Rollback and reboot
ssh root@node01.example.com 'nixos-rebuild switch --rollback'

# Or boot into previous generation once (no permanent change)
ssh root@node01.example.com 'nixos-rebuild boot --rollback && reboot'
```

**Rollback to Specific Generation:**
```bash
ssh root@node01.example.com 'nix-env --switch-generation 1 -p /nix/var/nix/profiles/system'
ssh root@node01.example.com 'reboot'
```

### 9.2 Re-Provisioning from PXE

Complete re-provisioning wipes all data and reinstalls from scratch:

**Step 1: Remove Node from Cluster**
```bash
curl -k -X DELETE https://node01.example.com:2379/admin/member/node02
curl -k -X DELETE https://node01.example.com:2479/admin/member/node02
```

**Step 2: Set Boot to PXE**
```bash
ipmitool -I lanplus -H 10.0.10.51 -U admin chassis bootdev pxe
```

**Step 3: Reboot Node**
```bash
ssh root@node02.example.com 'reboot'
# Or via BMC
ipmitool -I lanplus -H 10.0.10.51 -U admin chassis power cycle
```

**Step 4: Run nixos-anywhere**
```bash
# Wait for PXE boot and SSH ready
sleep 90

nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node02 \
  root@10.0.100.51
```

### 9.3 Disaster Recovery Procedures

**Complete Cluster Loss (All Nodes Down):**

**Step 1: Restore from Backup (if available)**
```bash
# Restore Chainfire data
ssh root@node01.example.com '
  systemctl stop chainfire.service
  rm -rf /var/lib/chainfire/*
  tar -xzf /backup/chainfire-$(date +%Y%m%d).tar.gz -C /var/lib/chainfire/
  systemctl start chainfire.service
'
```

**Step 2: Bootstrap New Cluster**
If no backup, re-provision all nodes as bootstrap:
```bash
# Update cluster-config.json for all nodes
# Set bootstrap=true, same initial_peers

# Provision all 3 nodes
for node in node01 node02 node03; do
  nix run github:nix-community/nixos-anywhere -- \
    --flake /srv/provisioning#$node \
    root@<node-ip> &
done
wait
```

**Single Node Failure:**

**Step 1: Verify Cluster Quorum**
```bash
# Check remaining nodes have quorum
curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members | length'
# Expected: 2 (if 3-node cluster with 1 failure)
```

**Step 2: Remove Failed Node**
```bash
curl -k -X DELETE https://node01.example.com:2379/admin/member/node02
```

**Step 3: Provision Replacement**
```bash
# Use same node ID and configuration
nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node02 \
  root@10.0.100.51
```

### 9.4 Backup and Restore

**Automated Backup Script:**
```bash
#!/bin/bash
# /srv/provisioning/scripts/backup-cluster.sh

BACKUP_DIR="/backup/cluster-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BACKUP_DIR"

# Backup Chainfire data
for node in node01 node02 node03; do
  ssh root@$node.example.com \
    "tar -czf - /var/lib/chainfire" > "$BACKUP_DIR/chainfire-$node.tar.gz"
done

# Backup FlareDB data
for node in node01 node02 node03; do
  ssh root@$node.example.com \
    "tar -czf - /var/lib/flaredb" > "$BACKUP_DIR/flaredb-$node.tar.gz"
done

# Backup configurations
cp -r /srv/provisioning/nodes "$BACKUP_DIR/configs"

echo "Backup complete: $BACKUP_DIR"
```

**Restore Script:**
```bash
#!/bin/bash
# /srv/provisioning/scripts/restore-cluster.sh

BACKUP_DIR="$1"
if [ -z "$BACKUP_DIR" ]; then
  echo "Usage: $0 <backup-dir>"
  exit 1
fi

# Stop services on all nodes
for node in node01 node02 node03; do
  ssh root@$node.example.com 'systemctl stop chainfire flaredb'
done

# Restore Chainfire data
for node in node01 node02 node03; do
  cat "$BACKUP_DIR/chainfire-$node.tar.gz" | \
    ssh root@$node.example.com "cd / && tar -xzf -"
done

# Restore FlareDB data
for node in node01 node02 node03; do
  cat "$BACKUP_DIR/flaredb-$node.tar.gz" | \
    ssh root@$node.example.com "cd / && tar -xzf -"
done

# Restart services
for node in node01 node02 node03; do
  ssh root@$node.example.com 'systemctl start chainfire flaredb'
done

echo "Restore complete"
```

## 10. Security Best Practices

### 10.1 SSH Key Management

**Generate Dedicated Provisioning Key:**
```bash
ssh-keygen -t ed25519 -C "provisioning@example.com" -f ~/.ssh/id_ed25519_provisioning
```

**Add to Netboot Image:**
```nix
# In netboot-base.nix
users.users.root.openssh.authorizedKeys.keys = [
  "ssh-ed25519 AAAAC3Nza... provisioning@example.com"
];
```

**Rotate Keys Regularly:**
```bash
# Generate new key
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519_provisioning_new

# Add to all nodes
for node in node01 node02 node03; do
  ssh-copy-id -i ~/.ssh/id_ed25519_provisioning_new.pub root@$node.example.com
done

# Remove old key from authorized_keys
# Update netboot image with new key
```

### 10.2 TLS Certificate Rotation

**Automated Rotation Script:**
```bash
#!/bin/bash
# /srv/provisioning/scripts/rotate-certs.sh

# Generate new certificates
for node in node01 node02 node03; do
  openssl genrsa -out ${node}-key-new.pem 4096
  openssl req -new -key ${node}-key-new.pem -out ${node}-csr.pem \
    -subj "/CN=${node}.example.com"
  openssl x509 -req -in ${node}-csr.pem \
    -CA ca-cert.pem -CAkey ca-key.pem \
    -CAcreateserial -out ${node}-cert-new.pem -days 365
done

# Deploy new certificates (without restarting services yet)
for node in node01 node02 node03; do
  scp ${node}-cert-new.pem root@${node}.example.com:/etc/nixos/secrets/${node}-cert-new.pem
  scp ${node}-key-new.pem root@${node}.example.com:/etc/nixos/secrets/${node}-key-new.pem
done

# Update configuration to use new certs
# ... (NixOS configuration update) ...

# Rolling restart to apply new certificates
for node in node01 node02 node03; do
  ssh root@${node}.example.com 'systemctl restart chainfire flaredb iam'
  sleep 30  # Wait for stabilization
done

echo "Certificate rotation complete"
```

### 10.3 Secrets Management

**Best Practices:**
- Store secrets outside Nix store (use `/etc/nixos/secrets/`)
- Set restrictive permissions (0600 for private keys, 0400 for passwords)
- Use environment variables for runtime secrets
- Never commit secrets to Git
- Use encrypted secrets (sops-nix or agenix)

**Example with sops-nix:**
```nix
# In configuration.nix
{
  imports = [ <sops-nix/modules/sops> ];

  sops.defaultSopsFile = ./secrets.yaml;
  sops.secrets."node01/tls-key" = {
    owner = "chainfire";
    mode = "0400";
  };

  services.chainfire.settings.tls.key_path = config.sops.secrets."node01/tls-key".path;
}
```

### 10.4 Network Isolation

**VLAN Segmentation:**
- Management VLAN (10): BMC/IPMI, provisioning workstation
- Provisioning VLAN (100): PXE boot, temporary
- Production VLAN (200): Cluster services, inter-node communication
- Client VLAN (300): External clients accessing services

**Firewall Zones:**
```bash
# Example nftables rules
table inet filter {
  chain input {
    type filter hook input priority 0; policy drop;

    # Management from trusted subnet only
    iifname "eth0" ip saddr 10.0.10.0/24 tcp dport 22 accept

    # Cluster traffic from cluster subnet only
    iifname "eth1" ip saddr 10.0.200.0/24 tcp dport { 2379, 2380, 2479, 2480 } accept

    # Client traffic from client subnet only
    iifname "eth2" ip saddr 10.0.300.0/24 tcp dport { 8080, 9090 } accept
  }
}
```

### 10.5 Audit Logging

**Enable Structured Logging:**
```nix
# In configuration.nix
services.chainfire.settings.logging = {
  level = "info";
  format = "json";
  output = "journal";
};

# Enable journald forwarding to SIEM
services.journald.extraConfig = ''
  ForwardToSyslog=yes
  Storage=persistent
  MaxRetentionSec=7days
'';
```

**Audit Key Events:**
- Cluster membership changes
- Node joins/leaves
- Authentication failures
- Configuration changes
- TLS certificate errors

**Log Aggregation:**
```bash
# Forward logs to central logging server
# Example: rsyslog configuration
cat > /etc/rsyslog.d/50-remote.conf <<EOF
*.* @@logging-server.example.com:514
EOF
systemctl restart rsyslog
```

---

## Appendix A: Service Port Reference

See [NETWORK.md](NETWORK.md) for complete port matrix.

## Appendix B: Hardware Vendor Commands

See [HARDWARE.md](HARDWARE.md) for vendor-specific BIOS configurations and IPMI commands.

## Appendix C: Complete Command Reference

See [COMMANDS.md](COMMANDS.md) for all commands organized by task.

## Appendix D: Quick Reference Cards

See [QUICKSTART.md](QUICKSTART.md) for condensed deployment guide.

## Appendix E: Deployment Flow Diagrams

See [diagrams/deployment-flow.md](diagrams/deployment-flow.md) for visual workflow.

## Appendix F: Related Documentation

- **Design Document:** `/home/centra/cloud/docs/por/T032-baremetal-provisioning/design.md`
- **PXE Server:** `/home/centra/cloud/chainfire/baremetal/pxe-server/README.md`
- **Image Builder:** `/home/centra/cloud/baremetal/image-builder/README.md`
- **First-Boot Automation:** `/home/centra/cloud/baremetal/first-boot/README.md`

---

**End of Operator Runbook**