photoncloud-monorepo/docs/por/T032-baremetal-provisioning/design.md
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

47 KiB

T032 Bare-Metal Provisioning Design Document

Status: Draft Author: peerB Created: 2025-12-10 Last Updated: 2025-12-10

1. Architecture Overview

This document outlines the design for automated bare-metal provisioning of the PlasmaCloud platform, which consists of 8 core services (Chainfire, FlareDB, IAM, PlasmaVMC, NovaNET, FlashDNS, FiberLB, and K8sHost). The provisioning system leverages NixOS's declarative configuration capabilities to enable fully automated deployment from bare hardware to a running, clustered platform.

The high-level flow follows this sequence: PXE Boot → kexec NixOS Installer → disko Disk Partitioning → nixos-anywhere Installation → First-Boot Configuration → Running Cluster. A bare-metal server performs a network boot via PXE/iPXE, which loads a minimal NixOS installer into RAM using kexec. The installer then connects to a provisioning server, which uses nixos-anywhere to declaratively partition disks (via disko), install NixOS with pre-configured services, and inject node-specific configuration (SSH keys, network settings, cluster join parameters, TLS certificates). On first boot, the system automatically joins existing Raft clusters (Chainfire/FlareDB) or bootstraps new ones, and all 8 services start with proper dependencies and TLS enabled.

The key components are:

  • PXE/iPXE Boot Server: Serves boot binaries and configuration scripts via TFTP/HTTP
  • nixos-anywhere: SSH-based remote installation tool that orchestrates the entire deployment
  • disko: Declarative disk partitioning engine integrated with nixos-anywhere
  • kexec: Linux kernel feature enabling fast boot into NixOS installer without full reboot
  • NixOS Flake (from T024): Provides all service packages and NixOS modules
  • Configuration Injection System: Manages node-specific secrets, network config, and cluster metadata
  • First-Boot Automation: Systemd units that perform cluster join and service initialization

2. PXE Boot Flow

2.1 Boot Sequence

┌─────────────┐
│ Bare Metal  │
│   Server    │
└──────┬──────┘
       │ 1. UEFI/BIOS PXE ROM
       ▼
┌──────────────┐
│ DHCP Server  │  Option 93: Client Architecture (0=BIOS, 7=UEFI x64)
│              │  Option 67: Boot filename (undionly.kpxe or ipxe.efi)
│              │  Option 66: TFTP server address
└──────┬───────┘
       │ 2. DHCP OFFER with boot parameters
       ▼
┌──────────────┐
│ TFTP/HTTP    │
│   Server     │  Serves: undionly.kpxe (BIOS) or ipxe.efi (UEFI)
└──────┬───────┘
       │ 3. Download iPXE bootloader
       ▼
┌──────────────┐
│ iPXE Running │  User-class="iPXE" in DHCP request
│  (in RAM)    │
└──────┬───────┘
       │ 4. Second DHCP request (now with iPXE user-class)
       ▼
┌──────────────┐
│ DHCP Server  │  Detects user-class="iPXE"
│              │  Option 67: http://boot.server/boot.ipxe
└──────┬───────┘
       │ 5. DHCP OFFER with script URL
       ▼
┌──────────────┐
│ HTTP Server  │  Serves: boot.ipxe (iPXE script)
└──────┬───────┘
       │ 6. Download and execute boot script
       ▼
┌──────────────┐
│ iPXE Script  │  Loads: NixOS kernel + initrd + kexec
│  Execution   │
└──────┬───────┘
       │ 7. kexec into NixOS installer
       ▼
┌──────────────┐
│ NixOS Live   │  SSH enabled, waiting for nixos-anywhere
│  Installer   │
└──────────────┘

2.2 DHCP Configuration Requirements

The DHCP server must support architecture-specific boot file selection and iPXE user-class detection. For ISC DHCP server (/etc/dhcp/dhcpd.conf):

# Architecture detection (RFC 4578)
option architecture-type code 93 = unsigned integer 16;

# iPXE detection
option user-class code 77 = string;

subnet 10.0.0.0 netmask 255.255.255.0 {
  range 10.0.0.100 10.0.0.200;

  option routers 10.0.0.1;
  option domain-name-servers 10.0.0.1;

  # Boot server
  next-server 10.0.0.2;  # TFTP/HTTP server IP

  # Chainloading logic
  if exists user-class and option user-class = "iPXE" {
    # iPXE is already loaded, provide boot script via HTTP
    filename "http://10.0.0.2:8080/boot.ipxe";
  } elsif option architecture-type = 00:00 {
    # BIOS (legacy) - load iPXE via TFTP
    filename "undionly.kpxe";
  } elsif option architecture-type = 00:07 {
    # UEFI x86_64 - load iPXE via TFTP
    filename "ipxe.efi";
  } elsif option architecture-type = 00:09 {
    # UEFI x86_64 (alternate) - load iPXE via TFTP
    filename "ipxe.efi";
  } else {
    # Fallback
    filename "ipxe.efi";
  }
}

Key Points:

  • Option 93 (architecture-type): Distinguishes BIOS (0x0000) vs UEFI (0x0007/0x0009)
  • Option 66 (next-server): TFTP server IP for initial boot files
  • Option 67 (filename): Boot file name, changes based on architecture and iPXE presence
  • User-class detection: Prevents infinite loop (iPXE downloading itself)
  • HTTP chainloading: After iPXE loads, switch to HTTP for faster downloads

2.3 iPXE Script Structure

The boot script (/srv/boot/boot.ipxe) provides a menu for deployment profiles:

#!ipxe

# Variables
set boot-server 10.0.0.2:8080
set nix-cache http://${boot-server}/nix-cache

# Display system info
echo System information:
echo - Platform: ${platform}
echo - Architecture: ${buildarch}
echo - MAC: ${net0/mac}
echo - IP: ${net0/ip}
echo

# Menu with timeout
:menu
menu PlasmaCloud Bare-Metal Provisioning
item --gap -- ──────────── Deployment Profiles ────────────
item control-plane   Install Control Plane Node (Chainfire + FlareDB + IAM)
item worker          Install Worker Node (PlasmaVMC + NovaNET + Storage)
item all-in-one      Install All-in-One (All 8 Services)
item shell           Boot to NixOS Installer Shell
item --gap -- ─────────────────────────────────────────────
item --key r reboot  Reboot System
choose --timeout 30000 --default all-in-one target || goto menu

# Execute selection
goto ${target}

:control-plane
echo Booting Control Plane installer...
set profile control-plane
goto boot

:worker
echo Booting Worker Node installer...
set profile worker
goto boot

:all-in-one
echo Booting All-in-One installer...
set profile all-in-one
goto boot

:shell
echo Booting to installer shell...
set profile shell
goto boot

:boot
# Load NixOS netboot artifacts (from nixos-images or custom build)
kernel http://${boot-server}/nixos/bzImage init=/nix/store/...-nixos-system/init loglevel=4 console=ttyS0 console=tty0 nixos.profile=${profile}
initrd http://${boot-server}/nixos/initrd
boot

:reboot
reboot

:failed
echo Boot failed, dropping to shell...
sleep 10
shell

Features:

  • Multi-profile support: Different service combinations per node type
  • Hardware detection: Shows MAC/IP for inventory tracking
  • Timeout with default: Unattended deployment after 30 seconds
  • Kernel parameters: Pass profile to NixOS installer for conditional configuration
  • Error handling: Falls back to shell on failure

2.4 HTTP vs TFTP Trade-offs

Aspect TFTP HTTP
Speed ~1-5 MB/s (UDP, no windowing) ~50-100+ MB/s (TCP with pipelining)
Reliability Low (UDP, prone to timeouts) High (TCP with retries)
Firmware Support Universal (all PXE ROMs) UEFI 2.5+ only (HTTP Boot)
Complexity Simple protocol, minimal config Requires web server (nginx/apache)
Use Case Initial iPXE binary (~100KB) Kernel/initrd/images (~100-500MB)

Recommended Hybrid Approach:

  1. TFTP for initial iPXE binary delivery (universal compatibility)
  2. HTTP for all subsequent artifacts (kernel, initrd, scripts, packages)
  3. Configure iPXE with embedded HTTP support
  4. NixOS netboot images served via HTTP with range request support for resumability

UEFI HTTP Boot Alternative: For pure UEFI environments, skip TFTP entirely by using DHCP Option 60 (Vendor Class = "HTTPClient") and Option 67 (HTTP URI). However, this lacks BIOS compatibility and requires newer firmware (2015+).

3. Image Generation Strategy

3.1 Building NixOS Netboot Images

NixOS provides built-in netboot image generation. We extend this to include PlasmaCloud services:

Option 1: Custom Netboot Configuration (Recommended)

Create nix/images/netboot.nix:

{ config, pkgs, lib, modulesPath, ... }:

{
  imports = [
    "${modulesPath}/installer/netboot/netboot-minimal.nix"
    ../../nix/modules  # PlasmaCloud service modules
  ];

  # Networking for installer phase
  networking = {
    usePredictableInterfaceNames = false;  # Use eth0 instead of enpXsY
    useDHCP = true;
    firewall.enable = false;  # Open during installation
  };

  # SSH for nixos-anywhere
  services.openssh = {
    enable = true;
    settings = {
      PermitRootLogin = "yes";
      PasswordAuthentication = false;  # Key-based only
    };
  };

  # Authorized keys for provisioning server
  users.users.root.openssh.authorizedKeys.keys = [
    "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIProvisioning Server Key..."
  ];

  # Minimal kernel for hardware support
  boot.kernelPackages = pkgs.linuxPackages_latest;
  boot.supportedFilesystems = [ "ext4" "xfs" "btrfs" "zfs" ];

  # Include disko for disk management
  environment.systemPackages = with pkgs; [
    disko
    parted
    cryptsetup
    lvm2
  ];

  # Disable unnecessary services for installer
  documentation.enable = false;
  documentation.nixos.enable = false;
  sound.enable = false;

  # Build artifacts needed for netboot
  system.build = {
    netbootRamdisk = config.system.build.initialRamdisk;
    kernel = config.system.build.kernel;
    netbootIpxeScript = pkgs.writeText "netboot.ipxe" ''
      #!ipxe
      kernel \${boot-url}/bzImage init=${config.system.build.toplevel}/init ${toString config.boot.kernelParams}
      initrd \${boot-url}/initrd
      boot
    '';
  };
}

Build the netboot artifacts:

nix build .#nixosConfigurations.netboot.config.system.build.netbootRamdisk
nix build .#nixosConfigurations.netboot.config.system.build.kernel

# Copy to HTTP server
cp result/bzImage /srv/boot/nixos/
cp result/initrd /srv/boot/nixos/

Option 2: Use Pre-built Images (Faster Development)

The nix-community/nixos-images project provides pre-built netboot images:

# Use their iPXE chainload directly
chain https://github.com/nix-community/nixos-images/releases/download/nixos-unstable/netboot-x86_64-linux.ipxe

# Or download artifacts
curl -L https://github.com/nix-community/nixos-images/releases/download/nixos-unstable/bzImage -o /srv/boot/nixos/bzImage
curl -L https://github.com/nix-community/nixos-images/releases/download/nixos-unstable/initrd -o /srv/boot/nixos/initrd

3.2 Configuration Injection Approach

Configuration must be injected at installation time (not baked into netboot image) to support:

  • Node-specific networking (static IPs, VLANs)
  • Cluster join parameters (existing Raft leader addresses)
  • TLS certificates (unique per node)
  • Hardware-specific disk layouts

Three-Phase Configuration Model:

Phase 1: Netboot Image (Generic)

  • Universal kernel with broad hardware support
  • SSH server with provisioning key
  • disko + installer tools
  • No node-specific data

Phase 2: nixos-anywhere Deployment (Node-Specific)

  • Pull node configuration from provisioning server based on MAC/hostname
  • Partition disks per disko spec
  • Install NixOS with flake: github:yourorg/plasmacloud#node-hostname
  • Inject secrets: /etc/nixos/secrets/ (TLS certs, cluster tokens)

Phase 3: First Boot (Service Initialization)

  • systemd service reads /etc/nixos/secrets/cluster-config.json
  • Auto-join Chainfire cluster (or bootstrap if first node)
  • FlareDB joins after Chainfire is healthy
  • IAM initializes with FlareDB backend
  • Other services start with proper dependencies

Configuration Repository Structure:

/srv/provisioning/
├── nodes/
│   ├── node01.example.com/
│   │   ├── hardware.nix          # Generated from nixos-generate-config
│   │   ├── configuration.nix     # Node-specific service config
│   │   ├── disko.nix              # Disk layout
│   │   └── secrets/
│   │       ├── tls-cert.pem
│   │       ├── tls-key.pem
│   │       ├── tls-ca.pem
│   │       └── cluster-config.json
│   └── node02.example.com/
│       └── ...
├── profiles/
│   ├── control-plane.nix         # Chainfire + FlareDB + IAM
│   ├── worker.nix                # PlasmaVMC + storage
│   └── all-in-one.nix            # All 8 services
└── common/
    ├── base.nix                  # Common settings (SSH, users, firewall)
    └── networking.nix            # Network defaults

Node Configuration Example (nodes/node01.example.com/configuration.nix):

{ config, pkgs, lib, ... }:

{
  imports = [
    ../../profiles/control-plane.nix
    ../../common/base.nix
    ./hardware.nix
    ./disko.nix
  ];

  networking = {
    hostName = "node01";
    domain = "example.com";
    interfaces.eth0 = {
      useDHCP = false;
      ipv4.addresses = [{
        address = "10.0.1.10";
        prefixLength = 24;
      }];
    };
    defaultGateway = "10.0.1.1";
    nameservers = [ "10.0.1.1" ];
  };

  # Service configuration
  services.chainfire = {
    enable = true;
    port = 2379;
    raftPort = 2380;
    gossipPort = 2381;
    settings = {
      node_id = "node01";
      cluster_name = "prod-cluster";
      # Initial cluster peers (for bootstrap)
      initial_peers = [
        "node01.example.com:2380"
        "node02.example.com:2380"
        "node03.example.com:2380"
      ];
      tls = {
        cert_path = "/etc/nixos/secrets/tls-cert.pem";
        key_path = "/etc/nixos/secrets/tls-key.pem";
        ca_path = "/etc/nixos/secrets/tls-ca.pem";
      };
    };
  };

  services.flaredb = {
    enable = true;
    port = 2479;
    raftPort = 2480;
    settings = {
      node_id = "node01";
      cluster_name = "prod-cluster";
      chainfire_endpoint = "https://localhost:2379";
      tls = {
        cert_path = "/etc/nixos/secrets/tls-cert.pem";
        key_path = "/etc/nixos/secrets/tls-key.pem";
        ca_path = "/etc/nixos/secrets/tls-ca.pem";
      };
    };
  };

  services.iam = {
    enable = true;
    port = 8080;
    settings = {
      flaredb_endpoint = "https://localhost:2479";
      tls = {
        cert_path = "/etc/nixos/secrets/tls-cert.pem";
        key_path = "/etc/nixos/secrets/tls-key.pem";
      };
    };
  };

  system.stateVersion = "24.11";
}

3.3 Hardware Detection vs Explicit Hardware Config

Hardware Detection (Automatic):

During installation, nixos-generate-config scans hardware and creates hardware-configuration.nix:

# On live installer, after disk setup
nixos-generate-config --root /mnt --show-hardware-config > /tmp/hardware.nix

# Upload to provisioning server
curl -X POST -F "file=@/tmp/hardware.nix" http://provisioning-server/api/hardware/node01

Explicit Hardware Config (Declarative):

For homogeneous hardware (e.g., fleet of identical servers), use a template:

# profiles/hardware/dell-r640.nix
{ config, lib, pkgs, modulesPath, ... }:

{
  imports = [ (modulesPath + "/installer/scan/not-detected.nix") ];

  boot.initrd.availableKernelModules = [ "xhci_pci" "ahci" "nvme" "usbhid" "sd_mod" ];
  boot.kernelModules = [ "kvm-intel" ];

  # Network interfaces (predictable naming)
  networking.interfaces = {
    enp59s0f0 = {};  # 10GbE Port 1
    enp59s0f1 = {};  # 10GbE Port 2
  };

  # CPU microcode updates
  hardware.cpu.intel.updateMicrocode = true;

  # Power management
  powerManagement.cpuFreqGovernor = "performance";

  nixpkgs.hostPlatform = "x86_64-linux";
}

Recommendation:

  • Phase 1 (Development): Auto-detect hardware for flexibility
  • Phase 2 (Production): Standardize on explicit hardware profiles for consistency and faster deployments

3.4 Image Size Optimization

Netboot images must fit in RAM (typically 1-4 GB available after kexec). Strategies:

1. Exclude Documentation and Locales:

documentation.enable = false;
documentation.nixos.enable = false;
i18n.supportedLocales = [ "en_US.UTF-8/UTF-8" ];

2. Minimal Kernel:

boot.kernelPackages = pkgs.linuxPackages_latest;
boot.kernelParams = [ "modprobe.blacklist=nouveau" ];  # Exclude unused drivers

3. Squashfs Compression: NixOS netboot uses squashfs for the Nix store, achieving ~2.5x compression:

# Automatically applied by netboot-minimal.nix
system.build.squashfsStore = ...;  # Default: gzip compression

4. On-Demand Package Fetching: Instead of bundling all packages, fetch from HTTP substituter during installation:

nix.settings.substituters = [ "http://10.0.0.2:8080/nix-cache" ];
nix.settings.trusted-public-keys = [ "cache-key-here" ];

Expected Sizes:

  • Minimal installer (no services): ~150-250 MB (initrd)
  • Installer + PlasmaCloud packages: ~400-600 MB (with on-demand fetch)
  • Full offline installer: ~1-2 GB (includes all service closures)

4. Installation Flow

4.1 Step-by-Step Process

1. PXE Boot to NixOS Installer (Automated)

  • Server powers on, sends DHCP request
  • DHCP provides iPXE binary (via TFTP)
  • iPXE loads, sends second DHCP request with user-class
  • DHCP provides boot script URL (via HTTP)
  • iPXE downloads script, executes, loads kernel+initrd
  • kexec into NixOS installer (in RAM, ~30-60 seconds)
  • Installer boots, acquires IP via DHCP, starts SSH server

2. Provisioning Server Detects Node (Semi-Automated)

Provisioning server monitors DHCP leases or receives webhook from installer:

# Installer sends registration on boot (custom init script)
curl -X POST http://provisioning-server/api/register \
  -d '{"mac":"aa:bb:cc:dd:ee:ff","ip":"10.0.0.100","hostname":"node01"}'

Provisioning server looks up node in inventory:

# /srv/provisioning/inventory.json
{
  "nodes": {
    "aa:bb:cc:dd:ee:ff": {
      "hostname": "node01.example.com",
      "profile": "control-plane",
      "config_path": "/srv/provisioning/nodes/node01.example.com"
    }
  }
}

3. Run nixos-anywhere (Automated)

Provisioning server executes nixos-anywhere:

#!/bin/bash
# /srv/provisioning/scripts/provision-node.sh

NODE_MAC="$1"
NODE_IP=$(get_ip_from_dhcp "$NODE_MAC")
NODE_HOSTNAME=$(lookup_hostname "$NODE_MAC")
CONFIG_PATH="/srv/provisioning/nodes/$NODE_HOSTNAME"

# Copy secrets to installer (will be injected during install)
ssh root@$NODE_IP "mkdir -p /tmp/secrets"
scp $CONFIG_PATH/secrets/* root@$NODE_IP:/tmp/secrets/

# Run nixos-anywhere with disko
nix run github:nix-community/nixos-anywhere -- \
  --flake "/srv/provisioning#$NODE_HOSTNAME" \
  --build-on-remote \
  --disk-encryption-keys /tmp/disk.key <(cat $CONFIG_PATH/secrets/disk-encryption.key) \
  root@$NODE_IP

nixos-anywhere performs:

  • Detects existing OS (if any)
  • Loads kexec if needed (already done via PXE)
  • Runs disko to partition disks (based on $CONFIG_PATH/disko.nix)
  • Builds NixOS system closure (either locally or on target)
  • Copies closure to /mnt (mounted root)
  • Installs bootloader (GRUB/systemd-boot)
  • Copies secrets to /mnt/etc/nixos/secrets/
  • Unmounts, reboots

4. First Boot into Installed System (Automated)

Server reboots from disk (GRUB/systemd-boot), loads NixOS:

  • systemd starts
  • chainfire.service starts (waits 30s for network)
  • If initial_peers matches only self → bootstrap new cluster
  • If initial_peers includes others → attempt to join existing cluster
  • flaredb.service starts after chainfire is healthy
  • iam.service starts after flaredb is healthy
  • Other services start based on profile

First-boot cluster join logic (systemd unit):

# /etc/nixos/first-boot-cluster-join.nix
{ config, lib, pkgs, ... }:

let
  clusterConfig = builtins.fromJSON (builtins.readFile /etc/nixos/secrets/cluster-config.json);
in
{
  systemd.services.chainfire-cluster-join = {
    description = "Chainfire Cluster Join";
    after = [ "network-online.target" "chainfire.service" ];
    wants = [ "network-online.target" ];
    wantedBy = [ "multi-user.target" ];

    serviceConfig = {
      Type = "oneshot";
      RemainAfterExit = true;
    };

    script = ''
      # Wait for local chainfire to be ready
      until ${pkgs.curl}/bin/curl -k https://localhost:2379/health; do
        echo "Waiting for local chainfire..."
        sleep 5
      done

      # Check if this is the first node (bootstrap)
      if [ "${clusterConfig.bootstrap}" = "true" ]; then
        echo "Bootstrap node, cluster already initialized"
        exit 0
      fi

      # Join existing cluster
      LEADER_URL="${clusterConfig.leader_url}"
      NODE_ID="${clusterConfig.node_id}"
      RAFT_ADDR="${clusterConfig.raft_addr}"

      ${pkgs.curl}/bin/curl -k -X POST "$LEADER_URL/admin/member/add" \
        -H "Content-Type: application/json" \
        -d "{\"id\":\"$NODE_ID\",\"raft_addr\":\"$RAFT_ADDR\"}"

      echo "Cluster join initiated"
    '';
  };

  # Similar for flaredb
  systemd.services.flaredb-cluster-join = {
    description = "FlareDB Cluster Join";
    after = [ "chainfire-cluster-join.service" "flaredb.service" ];
    requires = [ "chainfire-cluster-join.service" ];
    # ... similar logic
  };
}

5. Validation (Manual/Automated)

Provisioning server polls health endpoints:

# Health check script
curl -k https://10.0.1.10:2379/health  # Chainfire
curl -k https://10.0.1.10:2479/health  # FlareDB
curl -k https://10.0.1.10:8080/health  # IAM

# Cluster status
curl -k https://10.0.1.10:2379/admin/cluster/members | jq

4.2 Error Handling and Recovery

Boot Failures:

  • Symptom: Server stuck in PXE boot loop
  • Diagnosis: Check DHCP server logs, verify TFTP/HTTP server accessibility
  • Recovery: Fix DHCP config, restart services, retry boot

Disk Partitioning Failures:

  • Symptom: nixos-anywhere fails during disko phase
  • Diagnosis: SSH to installer, run dmesg | grep -i error, check disk accessibility
  • Recovery: Adjust disko config (e.g., wrong disk device), re-run nixos-anywhere

Installation Failures:

  • Symptom: nixos-anywhere fails during installation phase
  • Diagnosis: Check nixos-anywhere output, SSH to /mnt to inspect
  • Recovery: Fix configuration errors, re-run nixos-anywhere (will reformat)

Cluster Join Failures:

  • Symptom: Service starts but not in cluster
  • Diagnosis: journalctl -u chainfire-cluster-join, check leader reachability
  • Recovery: Manually run join command, verify TLS certs, check firewall

Rollback Strategy:

  • NixOS generations provide atomic rollback: nixos-rebuild switch --rollback
  • For catastrophic failure: Re-provision from PXE (data loss if not replicated)

4.3 Network Requirements

DHCP:

  • Option 66/67 for PXE boot
  • Option 93 for architecture detection
  • User-class filtering for iPXE chainload
  • Static reservations for production nodes (optional)

DNS:

  • Forward and reverse DNS for all nodes (required for TLS cert CN verification)
  • Example: node01.example.com10.0.1.10, 10.0.1.10node01.example.com

Firewall:

  • Allow TFTP (UDP 69) from nodes to boot server
  • Allow HTTP (TCP 80/8080) from nodes to boot/provisioning server
  • Allow SSH (TCP 22) from provisioning server to nodes
  • Allow service ports (2379-2381, 2479-2480, 8080, etc.) between cluster nodes

Internet Access:

  • During installation: Required for Nix binary cache (cache.nixos.org) unless using local cache
  • After installation: Optional (recommended for updates), can run air-gapped with local cache
  • Workaround: Set up local binary cache: nix-serve + nginx

Bandwidth:

  • PXE boot: ~200 MB (kernel + initrd) per node, sequential is acceptable
  • Installation: ~1-5 GB (Nix closures) per node, parallel ok if cache is local
  • Recommendation: 1 Gbps link between provisioning server and nodes

5. Integration Points

5.1 T024 NixOS Modules

The NixOS modules from T024 (nix/modules/*.nix) provide declarative service configuration. They are included in node configurations:

{ config, pkgs, lib, ... }:

{
  imports = [
    # Import PlasmaCloud service modules
    inputs.plasmacloud.nixosModules.default
  ];

  # Enable services declaratively
  services.chainfire.enable = true;
  services.flaredb.enable = true;
  services.iam.enable = true;
  # ... etc
}

Module Integration Strategy:

  1. Flake Inputs: Node configurations reference the PlasmaCloud flake:

    # flake.nix for provisioning repo
    inputs.plasmacloud.url = "github:yourorg/plasmacloud";
    # or path-based for development
    inputs.plasmacloud.url = "path:/path/to/plasmacloud/repo";
    
  2. Service Packages: Packages are injected via overlay:

    nixpkgs.overlays = [ inputs.plasmacloud.overlays.default ];
    # Now pkgs.chainfire-server, pkgs.flaredb-server, etc. are available
    
  3. Dependency Graph: systemd units respect T024 dependencies:

    chainfire.service
      ↓ requires/after
    flaredb.service
      ↓ requires/after
    iam.service
      ↓ requires/after
    plasmavmc.service, flashdns.service, ... (parallel)
    
  4. Configuration Schema: Use services.<name>.settings for service-specific config:

    services.chainfire.settings = {
      node_id = "node01";
      cluster_name = "prod";
      tls = { ... };
    };
    

5.2 T027 Config Unification

T027 established a unified configuration approach (clap + config file/env). This integrates with NixOS in two ways:

1. NixOS Module → Config File Generation:

The NixOS module translates services.<name>.settings to a config file:

# In nix/modules/chainfire.nix
systemd.services.chainfire = {
  preStart = ''
    # Generate config file from settings
    cat > /var/lib/chainfire/config.toml <<EOF
    node_id = "${cfg.settings.node_id}"
    cluster_name = "${cfg.settings.cluster_name}"

    [tls]
    cert_path = "${cfg.settings.tls.cert_path}"
    key_path = "${cfg.settings.tls.key_path}"
    ca_path = "${cfg.settings.tls.ca_path or ""}"
    EOF
  '';

  serviceConfig.ExecStart = "${cfg.package}/bin/chainfire-server --config /var/lib/chainfire/config.toml";
};

2. Environment Variable Injection:

For secrets not suitable for Nix store:

systemd.services.chainfire.serviceConfig = {
  EnvironmentFile = "/etc/nixos/secrets/chainfire.env";
  # File contains: CHAINFIRE_API_TOKEN=secret123
};

Best Practices:

  • Public config: Use services.<name>.settings (stored in Nix store, world-readable)
  • Secrets: Use EnvironmentFile or systemd credentials
  • Hybrid: Config file with placeholders, secrets injected at runtime

5.3 T031 TLS Certificates

T031 added TLS to all 8 services. Provisioning must handle certificate distribution:

Certificate Provisioning Strategies:

Option 1: Pre-Generated Certificates (Simple)

  1. Generate certs on provisioning server per node:

    # /srv/provisioning/scripts/generate-certs.sh node01.example.com
    openssl req -x509 -newkey rsa:4096 -nodes \
      -keyout node01-key.pem -out node01-cert.pem \
      -days 365 -subj "/CN=node01.example.com"
    
  2. Copy to node secrets directory:

    cp node01-*.pem /srv/provisioning/nodes/node01.example.com/secrets/
    
  3. nixos-anywhere installs them to /etc/nixos/secrets/ (mode 0400, owner root)

  4. NixOS module references them:

    services.chainfire.settings.tls = {
      cert_path = "/etc/nixos/secrets/tls-cert.pem";
      key_path = "/etc/nixos/secrets/tls-key.pem";
      ca_path = "/etc/nixos/secrets/tls-ca.pem";
    };
    

Option 2: ACME (Let's Encrypt) for External Services

For internet-facing services (e.g., PlasmaVMC API):

security.acme = {
  acceptTerms = true;
  defaults.email = "admin@example.com";
};

services.plasmavmc.settings.tls = {
  cert_path = config.security.acme.certs."plasmavmc.example.com".directory + "/cert.pem";
  key_path = config.security.acme.certs."plasmavmc.example.com".directory + "/key.pem";
};

security.acme.certs."plasmavmc.example.com" = {
  domain = "plasmavmc.example.com";
  # Use DNS-01 challenge for internal servers
  dnsProvider = "cloudflare";
  credentialsFile = "/etc/nixos/secrets/cloudflare-api-token";
};

Option 3: Internal CA with Cert-Manager (Advanced)

  1. Deploy cert-manager as a service on control plane
  2. Generate per-node CSRs during first boot
  3. Cert-manager signs and distributes certs
  4. Systemd timer renews certs before expiry

Recommendation:

  • Phase 1 (MVP): Pre-generated certs (Option 1)
  • Phase 2 (Production): ACME for external + internal CA for internal (Option 2+3)

5.4 Chainfire/FlareDB Cluster Join

Bootstrap (First 3 Nodes):

First node (node01):

services.chainfire.settings = {
  node_id = "node01";
  initial_peers = [
    "node01.example.com:2380"
    "node02.example.com:2380"
    "node03.example.com:2380"
  ];
  bootstrap = true;  # This node starts the cluster
};

Subsequent nodes (node02, node03):

services.chainfire.settings = {
  node_id = "node02";
  initial_peers = [
    "node01.example.com:2380"
    "node02.example.com:2380"
    "node03.example.com:2380"
  ];
  bootstrap = false;  # Join existing cluster
};

Runtime Join (After Bootstrap):

New nodes added to running cluster:

  1. Provision node with bootstrap = false, initial_peers = []
  2. First-boot service calls leader's admin API:
    curl -k -X POST https://node01.example.com:2379/admin/member/add \
      -H "Content-Type: application/json" \
      -d '{"id":"node04","raft_addr":"node04.example.com:2380"}'
    
  3. Node receives cluster state, starts Raft
  4. Leader replicates to new node

FlareDB Follows Same Pattern:

FlareDB depends on Chainfire for coordination but maintains its own Raft cluster:

services.flaredb.settings = {
  node_id = "node01";
  chainfire_endpoint = "https://localhost:2379";
  initial_peers = [ "node01:2480" "node02:2480" "node03:2480" ];
};

Critical: Ensure chainfire.service is healthy before starting flaredb.service (enforced by systemd requires/after).

5.5 IAM Bootstrap

IAM requires initial admin user creation. Two approaches:

Option 1: First-Boot Initialization Script

systemd.services.iam-bootstrap = {
  description = "IAM Initial Admin User";
  after = [ "iam.service" ];
  wantedBy = [ "multi-user.target" ];

  serviceConfig = {
    Type = "oneshot";
    RemainAfterExit = true;
  };

  script = ''
    # Check if admin exists
    if ${pkgs.curl}/bin/curl -k https://localhost:8080/api/users/admin 2>&1 | grep -q "not found"; then
      # Create admin user
      ADMIN_PASSWORD=$(cat /etc/nixos/secrets/iam-admin-password)
      ${pkgs.curl}/bin/curl -k -X POST https://localhost:8080/api/users \
        -H "Content-Type: application/json" \
        -d "{\"username\":\"admin\",\"password\":\"$ADMIN_PASSWORD\",\"role\":\"admin\"}"
      echo "Admin user created"
    else
      echo "Admin user already exists"
    fi
  '';
};

Option 2: Environment Variable for Default Admin

IAM service creates admin on first start if DB is empty:

// In iam-server main.rs
if user_count() == 0 {
    let admin_password = env::var("IAM_INITIAL_ADMIN_PASSWORD")
        .expect("IAM_INITIAL_ADMIN_PASSWORD must be set for first boot");
    create_user("admin", &admin_password, Role::Admin)?;
    info!("Initial admin user created");
}
systemd.services.iam.serviceConfig = {
  EnvironmentFile = "/etc/nixos/secrets/iam.env";
  # File contains: IAM_INITIAL_ADMIN_PASSWORD=random-secure-password
};

Recommendation: Use Option 2 (environment variable) for simplicity. Generate random password during node provisioning, store in secrets.

6. Alternatives Considered

6.1 nixos-anywhere vs Custom Installer

nixos-anywhere (Chosen):

  • Pros:
    • Mature, actively maintained by nix-community
    • Handles kexec, disko integration, bootloader install automatically
    • SSH-based, works from any OS (no need for NixOS on provisioning server)
    • Supports remote builds and disk encryption out of box
    • Well-documented with many examples
  • Cons:
    • Requires SSH access (not suitable for zero-touch provisioning without PXE+SSH)
    • Opinionated workflow (less flexible than custom scripts)
    • Dependency on external project (but very stable)

Custom Installer (Rejected):

  • Pros:
    • Full control over installation flow
    • Could implement zero-touch (e.g., installer pulls config from server without SSH)
    • Tailored to PlasmaCloud-specific needs
  • Cons:
    • Significant development effort (partitioning, bootloader, error handling)
    • Reinvents well-tested code (disko, kexec integration)
    • Maintenance burden (keep up with NixOS changes)
    • Higher risk of bugs (partitioning is error-prone)

Decision: Use nixos-anywhere for reliability and speed. The SSH requirement is acceptable since PXE boot already provides network access, and adding SSH keys to the netboot image is straightforward.

6.2 Disk Management Tools

disko (Chosen):

  • Pros:
    • Declarative, fits NixOS philosophy
    • Integrates with nixos-anywhere out of box
    • Supports complex layouts (RAID, LVM, LUKS, ZFS, btrfs)
    • Idempotent (can reformat or verify existing layout)
  • Cons:
    • Nix-based DSL (learning curve)
    • Limited to Linux filesystems (no Windows support, not relevant here)

Kickstart/Preseed (Rejected):

  • Used by Fedora/Debian installers
  • Not NixOS-native, would require custom integration

Terraform with Libvirt (Rejected):

  • Good for VMs, not bare metal
  • Doesn't handle disk partitioning directly

Decision: disko is the clear choice for NixOS deployments.

6.3 Boot Methods

iPXE over TFTP/HTTP (Chosen):

  • Pros:
    • Universal support (BIOS + UEFI)
    • Flexible scripting (boot menus, conditional logic)
    • HTTP support for fast downloads
    • Open source, widely deployed
  • Cons:
    • Requires DHCP configuration (Option 66/67 setup)
    • Chainloading adds complexity (but solved problem)

UEFI HTTP Boot (Rejected):

  • Pros:
    • Native UEFI, no TFTP needed
    • Simpler DHCP config (just Option 60/67)
  • Cons:
    • UEFI only (no BIOS support)
    • Firmware support inconsistent (pre-2015 servers)
    • Less flexible than iPXE scripting

Preboot USB (Rejected):

  • Manual, not scalable for fleet deployment
  • Useful for one-off installs only

Decision: iPXE for flexibility and compatibility. UEFI HTTP Boot could be considered later for pure UEFI fleets.

6.4 Configuration Management

NixOS Flakes (Chosen):

  • Pros:
    • Native to NixOS, declarative
    • Reproducible builds with lock files
    • Git-based, version controlled
    • No external agent needed (systemd handles state)
  • Cons:
    • Steep learning curve for operators unfamiliar with Nix
    • Less dynamic than Ansible (changes require rebuild)

Ansible (Rejected for Provisioning, Useful for Orchestration):

  • Pros:
    • Agentless, SSH-based
    • Large ecosystem of modules
    • Dynamic, easy to patch running systems
  • Cons:
    • Imperative (harder to guarantee state)
    • Doesn't integrate with NixOS packages/modules
    • Adds another tool to stack

Terraform (Rejected):

  • Infrastructure-as-code, not config management
  • Better for cloud VMs than bare metal

Decision: Use NixOS flakes for provisioning and base config. Ansible may be added later for operational tasks (e.g., rolling updates, health checks) that don't fit NixOS's declarative model.

7. Open Questions / Decisions Needed

7.1 Hardware Inventory Management

Question: How do we map MAC addresses to node roles and configurations?

Options:

  1. Manual Inventory File: Operator maintains JSON/YAML with MAC → hostname → config mapping
  2. Auto-Discovery: First boot prompts operator to assign role (e.g., via serial console or web UI)
  3. External CMDB: Integrate with existing Configuration Management Database (e.g., NetBox, Nautobot)

Recommendation: Start with manual inventory file (simple), migrate to CMDB integration in Phase 2.

7.2 Secrets Management

Question: How are secrets (TLS keys, passwords) generated, stored, and rotated?

Options:

  1. File-Based (Current): Secrets in /srv/provisioning/nodes/*/secrets/, copied during install
  2. Vault Integration: Fetch secrets from HashiCorp Vault at boot time
  3. systemd Credentials: Use systemd's encrypted credentials feature (requires systemd 250+)

Recommendation: Phase 1 uses file-based (simple, works today). Phase 2 adds Vault for production (centralized, auditable, rotation support).

7.3 Network Boot Security

Question: How do we prevent rogue nodes from joining the cluster?

Concerns:

  • Attacker boots unauthorized server on network
  • Installer has SSH key, could be accessed
  • Node joins cluster with malicious intent

Mitigations:

  1. MAC Whitelist: DHCP only serves known MAC addresses
  2. Network Segmentation: PXE boot on isolated provisioning VLAN
  3. SSH Key Per Node: Each node has unique authorized_keys in netboot image (complex)
  4. Cluster Authentication: Raft join requires cluster token (not yet implemented)

Recommendation: Use MAC whitelist + provisioning VLAN for Phase 1. Add cluster join tokens in Phase 2 (requires Chainfire/FlareDB changes).

7.4 Multi-Datacenter Deployment

Question: How does provisioning work across geographically distributed datacenters?

Challenges:

  • WAN latency for Nix cache fetches
  • PXE boot requires local DHCP/TFTP
  • Cluster join across WAN (Raft latency)

Options:

  1. Replicated Provisioning Server: Deploy boot server in each datacenter, sync configs
  2. Central Provisioning with Local Cache: Single source of truth, local Nix cache mirrors
  3. Per-DC Clusters: Each datacenter is independent cluster, federated at application layer

Recommendation: Defer to Phase 2. Phase 1 assumes single datacenter or low-latency LAN.

7.5 Disk Encryption

Question: Should disks be encrypted at rest?

Trade-offs:

  • Pros: Compliance (GDPR, PCI-DSS), protection against physical theft
  • Cons: Key management complexity, can't auto-reboot (manual unlock), performance overhead (~5-10%)

Options:

  1. No Encryption: Rely on physical security
  2. LUKS with Network Unlock: Tang/Clevis for automated unlocking (requires network on boot)
  3. LUKS with Manual Unlock: Operator enters passphrase via KVM/IPMI

Recommendation: Optional, configurable per deployment. Provide disko template for LUKS, let operator decide.

7.6 Rolling Updates

Question: How do we update a running cluster without downtime?

Challenges:

  • Raft requires quorum (can't update majority simultaneously)
  • Service dependencies (Chainfire → FlareDB → others)
  • NixOS rebuild requires reboot (for kernel/init changes)

Strategy:

  1. Update one node at a time (rolling)
  2. Verify health before proceeding to next
  3. Use nixos-rebuild test first (activates without bootloader change), then switch after validation

Tooling:

  • Ansible playbook for orchestration
  • Health check scripts (curl endpoints + check Raft status)
  • Rollback plan (NixOS generations + Raft snapshot restore)

Recommendation: Document as runbook in Phase 1, implement automated rolling update in Phase 2 (T033?).

7.7 Monitoring and Alerting

Question: How do we monitor provisioning success/failure?

Options:

  1. Manual: Operator watches terminal, checks health endpoints
  2. Log Aggregation: Collect installer logs, index in Loki/Elasticsearch
  3. Event Webhook: Installer posts events to monitoring system (Grafana, PagerDuty)

Recommendation: Phase 1 uses manual monitoring. Phase 2 adds structured logging + webhooks for fleet deployments.

7.8 Compatibility with Existing Infrastructure

Question: Can this provisioning system coexist with existing PXE infrastructure (e.g., for other OS deployments)?

Concerns:

  • Existing DHCP config may conflict
  • TFTP server may serve other boot files
  • Network team may control PXE infrastructure

Solutions:

  1. Dedicated Provisioning VLAN: PlasmaCloud nodes on separate network
  2. Conditional DHCP: Use vendor-class or subnet matching to route to correct boot server
  3. Multi-Boot Menu: iPXE menu includes options for PlasmaCloud and other OSes

Recommendation: Document network requirements, provide example DHCP config for common scenarios (dedicated VLAN, shared infrastructure). Coordinate with network team.


Appendices

A. Example Disko Configuration

Single Disk with GPT and ext4:

# nodes/node01/disko.nix
{ disks ? [ "/dev/sda" ], ... }:
{
  disko.devices = {
    disk = {
      main = {
        type = "disk";
        device = builtins.head disks;
        content = {
          type = "gpt";
          partitions = {
            ESP = {
              size = "512M";
              type = "EF00";
              content = {
                type = "filesystem";
                format = "vfat";
                mountpoint = "/boot";
              };
            };
            root = {
              size = "100%";
              content = {
                type = "filesystem";
                format = "ext4";
                mountpoint = "/";
              };
            };
          };
        };
      };
    };
  };
}

RAID1 with LUKS Encryption:

{ disks ? [ "/dev/sda" "/dev/sdb" ], ... }:
{
  disko.devices = {
    disk = {
      disk1 = {
        device = builtins.elemAt disks 0;
        type = "disk";
        content = {
          type = "gpt";
          partitions = {
            boot = {
              size = "1M";
              type = "EF02"; # BIOS boot
            };
            mdraid = {
              size = "100%";
              content = {
                type = "mdraid";
                name = "raid1";
              };
            };
          };
        };
      };
      disk2 = {
        device = builtins.elemAt disks 1;
        type = "disk";
        content = {
          type = "gpt";
          partitions = {
            boot = {
              size = "1M";
              type = "EF02";
            };
            mdraid = {
              size = "100%";
              content = {
                type = "mdraid";
                name = "raid1";
              };
            };
          };
        };
      };
    };
    mdadm = {
      raid1 = {
        type = "mdadm";
        level = 1;
        content = {
          type = "luks";
          name = "cryptroot";
          settings.allowDiscards = true;
          content = {
            type = "filesystem";
            format = "ext4";
            mountpoint = "/";
          };
        };
      };
    };
  };
}

B. Complete nixos-anywhere Command Examples

Basic Deployment:

nix run github:nix-community/nixos-anywhere -- \
  --flake .#node01 \
  root@10.0.0.100

With Build on Remote (Slow Local Machine):

nix run github:nix-community/nixos-anywhere -- \
  --flake .#node01 \
  --build-on-remote \
  root@10.0.0.100

With Disk Encryption Key:

nix run github:nix-community/nixos-anywhere -- \
  --flake .#node01 \
  --disk-encryption-keys /tmp/luks.key <(cat /secrets/node01-luks.key) \
  root@10.0.0.100

Debug Mode (Keep Installer After Failure):

nix run github:nix-community/nixos-anywhere -- \
  --flake .#node01 \
  --debug \
  --no-reboot \
  root@10.0.0.100

C. Provisioning Server Setup Script

#!/bin/bash
# /srv/provisioning/scripts/setup-provisioning-server.sh

set -euo pipefail

# Install dependencies
apt-get update
apt-get install -y nginx tftpd-hpa dnsmasq curl

# Configure TFTP
cat > /etc/default/tftpd-hpa <<EOF
TFTP_USERNAME="tftp"
TFTP_DIRECTORY="/srv/boot/tftp"
TFTP_ADDRESS="0.0.0.0:69"
TFTP_OPTIONS="--secure"
EOF

mkdir -p /srv/boot/tftp
systemctl restart tftpd-hpa

# Download iPXE binaries
curl -L http://boot.ipxe.org/undionly.kpxe -o /srv/boot/tftp/undionly.kpxe
curl -L http://boot.ipxe.org/ipxe.efi -o /srv/boot/tftp/ipxe.efi

# Configure nginx for HTTP boot
cat > /etc/nginx/sites-available/pxe <<EOF
server {
  listen 8080;
  server_name _;
  root /srv/boot;

  location / {
    autoindex on;
    try_files \$uri \$uri/ =404;
  }

  # Enable range requests for large files
  location ~* \.(iso|img|bin|efi|kpxe)$ {
    add_header Accept-Ranges bytes;
  }
}
EOF

ln -sf /etc/nginx/sites-available/pxe /etc/nginx/sites-enabled/
systemctl restart nginx

# Create directory structure
mkdir -p /srv/boot/{nixos,nix-cache,scripts}
mkdir -p /srv/provisioning/{nodes,profiles,common,scripts}

echo "Provisioning server setup complete!"
echo "Next steps:"
echo "1. Configure DHCP server (see design doc Section 2.2)"
echo "2. Build NixOS netboot image (see Section 3.1)"
echo "3. Create node configurations (see Section 3.2)"

D. First-Boot Cluster Config JSON Schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Cluster Configuration",
  "type": "object",
  "properties": {
    "node_id": {
      "type": "string",
      "description": "Unique identifier for this node"
    },
    "bootstrap": {
      "type": "boolean",
      "description": "True if this node should bootstrap a new cluster"
    },
    "leader_url": {
      "type": "string",
      "format": "uri",
      "description": "URL of existing cluster leader (for join)"
    },
    "raft_addr": {
      "type": "string",
      "description": "This node's Raft address (host:port)"
    },
    "cluster_token": {
      "type": "string",
      "description": "Shared secret for cluster authentication (future)"
    }
  },
  "required": ["node_id", "bootstrap"],
  "if": {
    "properties": { "bootstrap": { "const": false } }
  },
  "then": {
    "required": ["leader_url", "raft_addr"]
  }
}

Example for bootstrap node:

{
  "node_id": "node01",
  "bootstrap": true,
  "raft_addr": "node01.example.com:2380"
}

Example for joining node:

{
  "node_id": "node04",
  "bootstrap": false,
  "leader_url": "https://node01.example.com:2379",
  "raft_addr": "node04.example.com:2380"
}

E. References and Further Reading

Primary Documentation:

Technical Specifications:

Community Resources:

Related Blog Posts:


Revision History

Version Date Author Changes
0.1 2025-12-10 peerB Initial draft

End of Design Document