photoncloud-monorepo/docs/por/T032-baremetal-provisioning/design.md
centra d2149b6249 fix(lightningstor): Fix SigV4 canonicalization for AWS S3 auth
- Replace form_urlencoded with RFC 3986 compliant URI encoding
- Implement aws_uri_encode() matching AWS SigV4 spec exactly
- Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded
- All other chars percent-encoded with uppercase hex
- Preserve slashes in paths, encode in query params
- Normalize empty paths to '/' per AWS spec
- Fix test expectations (body hash, HMAC values)
- Add comprehensive SigV4 signature determinism test

This fixes the canonicalization mismatch that caused signature
validation failures in T047. Auth can now be enabled for production.

Refs: T058.S1
2025-12-12 06:23:46 +09:00

1553 lines
47 KiB
Markdown

# T032 Bare-Metal Provisioning Design Document
**Status:** Draft
**Author:** peerB
**Created:** 2025-12-10
**Last Updated:** 2025-12-10
## 1. Architecture Overview
This document outlines the design for automated bare-metal provisioning of the PlasmaCloud platform, which consists of 8 core services (Chainfire, FlareDB, IAM, PlasmaVMC, PrismNET, FlashDNS, FiberLB, and K8sHost). The provisioning system leverages NixOS's declarative configuration capabilities to enable fully automated deployment from bare hardware to a running, clustered platform.
The high-level flow follows this sequence: **PXE Boot → kexec NixOS Installer → disko Disk Partitioning → nixos-anywhere Installation → First-Boot Configuration → Running Cluster**. A bare-metal server performs a network boot via PXE/iPXE, which loads a minimal NixOS installer into RAM using kexec. The installer then connects to a provisioning server, which uses nixos-anywhere to declaratively partition disks (via disko), install NixOS with pre-configured services, and inject node-specific configuration (SSH keys, network settings, cluster join parameters, TLS certificates). On first boot, the system automatically joins existing Raft clusters (Chainfire/FlareDB) or bootstraps new ones, and all 8 services start with proper dependencies and TLS enabled.
The key components are:
- **PXE/iPXE Boot Server**: Serves boot binaries and configuration scripts via TFTP/HTTP
- **nixos-anywhere**: SSH-based remote installation tool that orchestrates the entire deployment
- **disko**: Declarative disk partitioning engine integrated with nixos-anywhere
- **kexec**: Linux kernel feature enabling fast boot into NixOS installer without full reboot
- **NixOS Flake** (from T024): Provides all service packages and NixOS modules
- **Configuration Injection System**: Manages node-specific secrets, network config, and cluster metadata
- **First-Boot Automation**: Systemd units that perform cluster join and service initialization
## 2. PXE Boot Flow
### 2.1 Boot Sequence
```
┌─────────────┐
│ Bare Metal │
│ Server │
└──────┬──────┘
│ 1. UEFI/BIOS PXE ROM
┌──────────────┐
│ DHCP Server │ Option 93: Client Architecture (0=BIOS, 7=UEFI x64)
│ │ Option 67: Boot filename (undionly.kpxe or ipxe.efi)
│ │ Option 66: TFTP server address
└──────┬───────┘
│ 2. DHCP OFFER with boot parameters
┌──────────────┐
│ TFTP/HTTP │
│ Server │ Serves: undionly.kpxe (BIOS) or ipxe.efi (UEFI)
└──────┬───────┘
│ 3. Download iPXE bootloader
┌──────────────┐
│ iPXE Running │ User-class="iPXE" in DHCP request
│ (in RAM) │
└──────┬───────┘
│ 4. Second DHCP request (now with iPXE user-class)
┌──────────────┐
│ DHCP Server │ Detects user-class="iPXE"
│ │ Option 67: http://boot.server/boot.ipxe
└──────┬───────┘
│ 5. DHCP OFFER with script URL
┌──────────────┐
│ HTTP Server │ Serves: boot.ipxe (iPXE script)
└──────┬───────┘
│ 6. Download and execute boot script
┌──────────────┐
│ iPXE Script │ Loads: NixOS kernel + initrd + kexec
│ Execution │
└──────┬───────┘
│ 7. kexec into NixOS installer
┌──────────────┐
│ NixOS Live │ SSH enabled, waiting for nixos-anywhere
│ Installer │
└──────────────┘
```
### 2.2 DHCP Configuration Requirements
The DHCP server must support architecture-specific boot file selection and iPXE user-class detection. For ISC DHCP server (`/etc/dhcp/dhcpd.conf`):
```dhcp
# Architecture detection (RFC 4578)
option architecture-type code 93 = unsigned integer 16;
# iPXE detection
option user-class code 77 = string;
subnet 10.0.0.0 netmask 255.255.255.0 {
range 10.0.0.100 10.0.0.200;
option routers 10.0.0.1;
option domain-name-servers 10.0.0.1;
# Boot server
next-server 10.0.0.2; # TFTP/HTTP server IP
# Chainloading logic
if exists user-class and option user-class = "iPXE" {
# iPXE is already loaded, provide boot script via HTTP
filename "http://10.0.0.2:8080/boot.ipxe";
} elsif option architecture-type = 00:00 {
# BIOS (legacy) - load iPXE via TFTP
filename "undionly.kpxe";
} elsif option architecture-type = 00:07 {
# UEFI x86_64 - load iPXE via TFTP
filename "ipxe.efi";
} elsif option architecture-type = 00:09 {
# UEFI x86_64 (alternate) - load iPXE via TFTP
filename "ipxe.efi";
} else {
# Fallback
filename "ipxe.efi";
}
}
```
**Key Points:**
- **Option 93** (architecture-type): Distinguishes BIOS (0x0000) vs UEFI (0x0007/0x0009)
- **Option 66** (next-server): TFTP server IP for initial boot files
- **Option 67** (filename): Boot file name, changes based on architecture and iPXE presence
- **User-class detection**: Prevents infinite loop (iPXE downloading itself)
- **HTTP chainloading**: After iPXE loads, switch to HTTP for faster downloads
### 2.3 iPXE Script Structure
The boot script (`/srv/boot/boot.ipxe`) provides a menu for deployment profiles:
```ipxe
#!ipxe
# Variables
set boot-server 10.0.0.2:8080
set nix-cache http://${boot-server}/nix-cache
# Display system info
echo System information:
echo - Platform: ${platform}
echo - Architecture: ${buildarch}
echo - MAC: ${net0/mac}
echo - IP: ${net0/ip}
echo
# Menu with timeout
:menu
menu PlasmaCloud Bare-Metal Provisioning
item --gap -- ──────────── Deployment Profiles ────────────
item control-plane Install Control Plane Node (Chainfire + FlareDB + IAM)
item worker Install Worker Node (PlasmaVMC + PrismNET + Storage)
item all-in-one Install All-in-One (All 8 Services)
item shell Boot to NixOS Installer Shell
item --gap -- ─────────────────────────────────────────────
item --key r reboot Reboot System
choose --timeout 30000 --default all-in-one target || goto menu
# Execute selection
goto ${target}
:control-plane
echo Booting Control Plane installer...
set profile control-plane
goto boot
:worker
echo Booting Worker Node installer...
set profile worker
goto boot
:all-in-one
echo Booting All-in-One installer...
set profile all-in-one
goto boot
:shell
echo Booting to installer shell...
set profile shell
goto boot
:boot
# Load NixOS netboot artifacts (from nixos-images or custom build)
kernel http://${boot-server}/nixos/bzImage init=/nix/store/...-nixos-system/init loglevel=4 console=ttyS0 console=tty0 nixos.profile=${profile}
initrd http://${boot-server}/nixos/initrd
boot
:reboot
reboot
:failed
echo Boot failed, dropping to shell...
sleep 10
shell
```
**Features:**
- **Multi-profile support**: Different service combinations per node type
- **Hardware detection**: Shows MAC/IP for inventory tracking
- **Timeout with default**: Unattended deployment after 30 seconds
- **Kernel parameters**: Pass profile to NixOS installer for conditional configuration
- **Error handling**: Falls back to shell on failure
### 2.4 HTTP vs TFTP Trade-offs
| Aspect | TFTP | HTTP |
|--------|------|------|
| **Speed** | ~1-5 MB/s (UDP, no windowing) | ~50-100+ MB/s (TCP with pipelining) |
| **Reliability** | Low (UDP, prone to timeouts) | High (TCP with retries) |
| **Firmware Support** | Universal (all PXE ROMs) | UEFI 2.5+ only (HTTP Boot) |
| **Complexity** | Simple protocol, minimal config | Requires web server (nginx/apache) |
| **Use Case** | Initial iPXE binary (~100KB) | Kernel/initrd/images (~100-500MB) |
**Recommended Hybrid Approach:**
1. **TFTP** for initial iPXE binary delivery (universal compatibility)
2. **HTTP** for all subsequent artifacts (kernel, initrd, scripts, packages)
3. Configure iPXE with embedded HTTP support
4. NixOS netboot images served via HTTP with range request support for resumability
**UEFI HTTP Boot Alternative:**
For pure UEFI environments, skip TFTP entirely by using DHCP Option 60 (Vendor Class = "HTTPClient") and Option 67 (HTTP URI). However, this lacks BIOS compatibility and requires newer firmware (2015+).
## 3. Image Generation Strategy
### 3.1 Building NixOS Netboot Images
NixOS provides built-in netboot image generation. We extend this to include PlasmaCloud services:
**Option 1: Custom Netboot Configuration (Recommended)**
Create `nix/images/netboot.nix`:
```nix
{ config, pkgs, lib, modulesPath, ... }:
{
imports = [
"${modulesPath}/installer/netboot/netboot-minimal.nix"
../../nix/modules # PlasmaCloud service modules
];
# Networking for installer phase
networking = {
usePredictableInterfaceNames = false; # Use eth0 instead of enpXsY
useDHCP = true;
firewall.enable = false; # Open during installation
};
# SSH for nixos-anywhere
services.openssh = {
enable = true;
settings = {
PermitRootLogin = "yes";
PasswordAuthentication = false; # Key-based only
};
};
# Authorized keys for provisioning server
users.users.root.openssh.authorizedKeys.keys = [
"ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIProvisioning Server Key..."
];
# Minimal kernel for hardware support
boot.kernelPackages = pkgs.linuxPackages_latest;
boot.supportedFilesystems = [ "ext4" "xfs" "btrfs" "zfs" ];
# Include disko for disk management
environment.systemPackages = with pkgs; [
disko
parted
cryptsetup
lvm2
];
# Disable unnecessary services for installer
documentation.enable = false;
documentation.nixos.enable = false;
sound.enable = false;
# Build artifacts needed for netboot
system.build = {
netbootRamdisk = config.system.build.initialRamdisk;
kernel = config.system.build.kernel;
netbootIpxeScript = pkgs.writeText "netboot.ipxe" ''
#!ipxe
kernel \${boot-url}/bzImage init=${config.system.build.toplevel}/init ${toString config.boot.kernelParams}
initrd \${boot-url}/initrd
boot
'';
};
}
```
Build the netboot artifacts:
```bash
nix build .#nixosConfigurations.netboot.config.system.build.netbootRamdisk
nix build .#nixosConfigurations.netboot.config.system.build.kernel
# Copy to HTTP server
cp result/bzImage /srv/boot/nixos/
cp result/initrd /srv/boot/nixos/
```
**Option 2: Use Pre-built Images (Faster Development)**
The [nix-community/nixos-images](https://github.com/nix-community/nixos-images) project provides pre-built netboot images:
```bash
# Use their iPXE chainload directly
chain https://github.com/nix-community/nixos-images/releases/download/nixos-unstable/netboot-x86_64-linux.ipxe
# Or download artifacts
curl -L https://github.com/nix-community/nixos-images/releases/download/nixos-unstable/bzImage -o /srv/boot/nixos/bzImage
curl -L https://github.com/nix-community/nixos-images/releases/download/nixos-unstable/initrd -o /srv/boot/nixos/initrd
```
### 3.2 Configuration Injection Approach
Configuration must be injected at installation time (not baked into netboot image) to support:
- Node-specific networking (static IPs, VLANs)
- Cluster join parameters (existing Raft leader addresses)
- TLS certificates (unique per node)
- Hardware-specific disk layouts
**Three-Phase Configuration Model:**
**Phase 1: Netboot Image (Generic)**
- Universal kernel with broad hardware support
- SSH server with provisioning key
- disko + installer tools
- No node-specific data
**Phase 2: nixos-anywhere Deployment (Node-Specific)**
- Pull node configuration from provisioning server based on MAC/hostname
- Partition disks per disko spec
- Install NixOS with flake: `github:yourorg/plasmacloud#node-hostname`
- Inject secrets: `/etc/nixos/secrets/` (TLS certs, cluster tokens)
**Phase 3: First Boot (Service Initialization)**
- systemd service reads `/etc/nixos/secrets/cluster-config.json`
- Auto-join Chainfire cluster (or bootstrap if first node)
- FlareDB joins after Chainfire is healthy
- IAM initializes with FlareDB backend
- Other services start with proper dependencies
**Configuration Repository Structure:**
```
/srv/provisioning/
├── nodes/
│ ├── node01.example.com/
│ │ ├── hardware.nix # Generated from nixos-generate-config
│ │ ├── configuration.nix # Node-specific service config
│ │ ├── disko.nix # Disk layout
│ │ └── secrets/
│ │ ├── tls-cert.pem
│ │ ├── tls-key.pem
│ │ ├── tls-ca.pem
│ │ └── cluster-config.json
│ └── node02.example.com/
│ └── ...
├── profiles/
│ ├── control-plane.nix # Chainfire + FlareDB + IAM
│ ├── worker.nix # PlasmaVMC + storage
│ └── all-in-one.nix # All 8 services
└── common/
├── base.nix # Common settings (SSH, users, firewall)
└── networking.nix # Network defaults
```
**Node Configuration Example (`nodes/node01.example.com/configuration.nix`):**
```nix
{ config, pkgs, lib, ... }:
{
imports = [
../../profiles/control-plane.nix
../../common/base.nix
./hardware.nix
./disko.nix
];
networking = {
hostName = "node01";
domain = "example.com";
interfaces.eth0 = {
useDHCP = false;
ipv4.addresses = [{
address = "10.0.1.10";
prefixLength = 24;
}];
};
defaultGateway = "10.0.1.1";
nameservers = [ "10.0.1.1" ];
};
# Service configuration
services.chainfire = {
enable = true;
port = 2379;
raftPort = 2380;
gossipPort = 2381;
settings = {
node_id = "node01";
cluster_name = "prod-cluster";
# Initial cluster peers (for bootstrap)
initial_peers = [
"node01.example.com:2380"
"node02.example.com:2380"
"node03.example.com:2380"
];
tls = {
cert_path = "/etc/nixos/secrets/tls-cert.pem";
key_path = "/etc/nixos/secrets/tls-key.pem";
ca_path = "/etc/nixos/secrets/tls-ca.pem";
};
};
};
services.flaredb = {
enable = true;
port = 2479;
raftPort = 2480;
settings = {
node_id = "node01";
cluster_name = "prod-cluster";
chainfire_endpoint = "https://localhost:2379";
tls = {
cert_path = "/etc/nixos/secrets/tls-cert.pem";
key_path = "/etc/nixos/secrets/tls-key.pem";
ca_path = "/etc/nixos/secrets/tls-ca.pem";
};
};
};
services.iam = {
enable = true;
port = 8080;
settings = {
flaredb_endpoint = "https://localhost:2479";
tls = {
cert_path = "/etc/nixos/secrets/tls-cert.pem";
key_path = "/etc/nixos/secrets/tls-key.pem";
};
};
};
system.stateVersion = "24.11";
}
```
### 3.3 Hardware Detection vs Explicit Hardware Config
**Hardware Detection (Automatic):**
During installation, `nixos-generate-config` scans hardware and creates `hardware-configuration.nix`:
```bash
# On live installer, after disk setup
nixos-generate-config --root /mnt --show-hardware-config > /tmp/hardware.nix
# Upload to provisioning server
curl -X POST -F "file=@/tmp/hardware.nix" http://provisioning-server/api/hardware/node01
```
**Explicit Hardware Config (Declarative):**
For homogeneous hardware (e.g., fleet of identical servers), use a template:
```nix
# profiles/hardware/dell-r640.nix
{ config, lib, pkgs, modulesPath, ... }:
{
imports = [ (modulesPath + "/installer/scan/not-detected.nix") ];
boot.initrd.availableKernelModules = [ "xhci_pci" "ahci" "nvme" "usbhid" "sd_mod" ];
boot.kernelModules = [ "kvm-intel" ];
# Network interfaces (predictable naming)
networking.interfaces = {
enp59s0f0 = {}; # 10GbE Port 1
enp59s0f1 = {}; # 10GbE Port 2
};
# CPU microcode updates
hardware.cpu.intel.updateMicrocode = true;
# Power management
powerManagement.cpuFreqGovernor = "performance";
nixpkgs.hostPlatform = "x86_64-linux";
}
```
**Recommendation:**
- **Phase 1 (Development):** Auto-detect hardware for flexibility
- **Phase 2 (Production):** Standardize on explicit hardware profiles for consistency and faster deployments
### 3.4 Image Size Optimization
Netboot images must fit in RAM (typically 1-4 GB available after kexec). Strategies:
**1. Exclude Documentation and Locales:**
```nix
documentation.enable = false;
documentation.nixos.enable = false;
i18n.supportedLocales = [ "en_US.UTF-8/UTF-8" ];
```
**2. Minimal Kernel:**
```nix
boot.kernelPackages = pkgs.linuxPackages_latest;
boot.kernelParams = [ "modprobe.blacklist=nouveau" ]; # Exclude unused drivers
```
**3. Squashfs Compression:**
NixOS netboot uses squashfs for the Nix store, achieving ~2.5x compression:
```nix
# Automatically applied by netboot-minimal.nix
system.build.squashfsStore = ...; # Default: gzip compression
```
**4. On-Demand Package Fetching:**
Instead of bundling all packages, fetch from HTTP substituter during installation:
```nix
nix.settings.substituters = [ "http://10.0.0.2:8080/nix-cache" ];
nix.settings.trusted-public-keys = [ "cache-key-here" ];
```
**Expected Sizes:**
- **Minimal installer (no services):** ~150-250 MB (initrd)
- **Installer + PlasmaCloud packages:** ~400-600 MB (with on-demand fetch)
- **Full offline installer:** ~1-2 GB (includes all service closures)
## 4. Installation Flow
### 4.1 Step-by-Step Process
**1. PXE Boot to NixOS Installer (Automated)**
- Server powers on, sends DHCP request
- DHCP provides iPXE binary (via TFTP)
- iPXE loads, sends second DHCP request with user-class
- DHCP provides boot script URL (via HTTP)
- iPXE downloads script, executes, loads kernel+initrd
- kexec into NixOS installer (in RAM, ~30-60 seconds)
- Installer boots, acquires IP via DHCP, starts SSH server
**2. Provisioning Server Detects Node (Semi-Automated)**
Provisioning server monitors DHCP leases or receives webhook from installer:
```bash
# Installer sends registration on boot (custom init script)
curl -X POST http://provisioning-server/api/register \
-d '{"mac":"aa:bb:cc:dd:ee:ff","ip":"10.0.0.100","hostname":"node01"}'
```
Provisioning server looks up node in inventory:
```bash
# /srv/provisioning/inventory.json
{
"nodes": {
"aa:bb:cc:dd:ee:ff": {
"hostname": "node01.example.com",
"profile": "control-plane",
"config_path": "/srv/provisioning/nodes/node01.example.com"
}
}
}
```
**3. Run nixos-anywhere (Automated)**
Provisioning server executes nixos-anywhere:
```bash
#!/bin/bash
# /srv/provisioning/scripts/provision-node.sh
NODE_MAC="$1"
NODE_IP=$(get_ip_from_dhcp "$NODE_MAC")
NODE_HOSTNAME=$(lookup_hostname "$NODE_MAC")
CONFIG_PATH="/srv/provisioning/nodes/$NODE_HOSTNAME"
# Copy secrets to installer (will be injected during install)
ssh root@$NODE_IP "mkdir -p /tmp/secrets"
scp $CONFIG_PATH/secrets/* root@$NODE_IP:/tmp/secrets/
# Run nixos-anywhere with disko
nix run github:nix-community/nixos-anywhere -- \
--flake "/srv/provisioning#$NODE_HOSTNAME" \
--build-on-remote \
--disk-encryption-keys /tmp/disk.key <(cat $CONFIG_PATH/secrets/disk-encryption.key) \
root@$NODE_IP
```
nixos-anywhere performs:
- Detects existing OS (if any)
- Loads kexec if needed (already done via PXE)
- Runs disko to partition disks (based on `$CONFIG_PATH/disko.nix`)
- Builds NixOS system closure (either locally or on target)
- Copies closure to `/mnt` (mounted root)
- Installs bootloader (GRUB/systemd-boot)
- Copies secrets to `/mnt/etc/nixos/secrets/`
- Unmounts, reboots
**4. First Boot into Installed System (Automated)**
Server reboots from disk (GRUB/systemd-boot), loads NixOS:
- systemd starts
- `chainfire.service` starts (waits 30s for network)
- If `initial_peers` matches only self → bootstrap new cluster
- If `initial_peers` includes others → attempt to join existing cluster
- `flaredb.service` starts after chainfire is healthy
- `iam.service` starts after flaredb is healthy
- Other services start based on profile
**First-boot cluster join logic** (systemd unit):
```nix
# /etc/nixos/first-boot-cluster-join.nix
{ config, lib, pkgs, ... }:
let
clusterConfig = builtins.fromJSON (builtins.readFile /etc/nixos/secrets/cluster-config.json);
in
{
systemd.services.chainfire-cluster-join = {
description = "Chainfire Cluster Join";
after = [ "network-online.target" "chainfire.service" ];
wants = [ "network-online.target" ];
wantedBy = [ "multi-user.target" ];
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
};
script = ''
# Wait for local chainfire to be ready
until ${pkgs.curl}/bin/curl -k https://localhost:2379/health; do
echo "Waiting for local chainfire..."
sleep 5
done
# Check if this is the first node (bootstrap)
if [ "${clusterConfig.bootstrap}" = "true" ]; then
echo "Bootstrap node, cluster already initialized"
exit 0
fi
# Join existing cluster
LEADER_URL="${clusterConfig.leader_url}"
NODE_ID="${clusterConfig.node_id}"
RAFT_ADDR="${clusterConfig.raft_addr}"
${pkgs.curl}/bin/curl -k -X POST "$LEADER_URL/admin/member/add" \
-H "Content-Type: application/json" \
-d "{\"id\":\"$NODE_ID\",\"raft_addr\":\"$RAFT_ADDR\"}"
echo "Cluster join initiated"
'';
};
# Similar for flaredb
systemd.services.flaredb-cluster-join = {
description = "FlareDB Cluster Join";
after = [ "chainfire-cluster-join.service" "flaredb.service" ];
requires = [ "chainfire-cluster-join.service" ];
# ... similar logic
};
}
```
**5. Validation (Manual/Automated)**
Provisioning server polls health endpoints:
```bash
# Health check script
curl -k https://10.0.1.10:2379/health # Chainfire
curl -k https://10.0.1.10:2479/health # FlareDB
curl -k https://10.0.1.10:8080/health # IAM
# Cluster status
curl -k https://10.0.1.10:2379/admin/cluster/members | jq
```
### 4.2 Error Handling and Recovery
**Boot Failures:**
- **Symptom:** Server stuck in PXE boot loop
- **Diagnosis:** Check DHCP server logs, verify TFTP/HTTP server accessibility
- **Recovery:** Fix DHCP config, restart services, retry boot
**Disk Partitioning Failures:**
- **Symptom:** nixos-anywhere fails during disko phase
- **Diagnosis:** SSH to installer, run `dmesg | grep -i error`, check disk accessibility
- **Recovery:** Adjust disko config (e.g., wrong disk device), re-run nixos-anywhere
**Installation Failures:**
- **Symptom:** nixos-anywhere fails during installation phase
- **Diagnosis:** Check nixos-anywhere output, SSH to `/mnt` to inspect
- **Recovery:** Fix configuration errors, re-run nixos-anywhere (will reformat)
**Cluster Join Failures:**
- **Symptom:** Service starts but not in cluster
- **Diagnosis:** `journalctl -u chainfire-cluster-join`, check leader reachability
- **Recovery:** Manually run join command, verify TLS certs, check firewall
**Rollback Strategy:**
- NixOS generations provide atomic rollback: `nixos-rebuild switch --rollback`
- For catastrophic failure: Re-provision from PXE (data loss if not replicated)
### 4.3 Network Requirements
**DHCP:**
- Option 66/67 for PXE boot
- Option 93 for architecture detection
- User-class filtering for iPXE chainload
- Static reservations for production nodes (optional)
**DNS:**
- Forward and reverse DNS for all nodes (required for TLS cert CN verification)
- Example: `node01.example.com``10.0.1.10`, `10.0.1.10``node01.example.com`
**Firewall:**
- Allow TFTP (UDP 69) from nodes to boot server
- Allow HTTP (TCP 80/8080) from nodes to boot/provisioning server
- Allow SSH (TCP 22) from provisioning server to nodes
- Allow service ports (2379-2381, 2479-2480, 8080, etc.) between cluster nodes
**Internet Access:**
- **During installation:** Required for Nix binary cache (cache.nixos.org) unless using local cache
- **After installation:** Optional (recommended for updates), can run air-gapped with local cache
- **Workaround:** Set up local binary cache: `nix-serve` + nginx
**Bandwidth:**
- **PXE boot:** ~200 MB (kernel + initrd) per node, sequential is acceptable
- **Installation:** ~1-5 GB (Nix closures) per node, parallel ok if cache is local
- **Recommendation:** 1 Gbps link between provisioning server and nodes
## 5. Integration Points
### 5.1 T024 NixOS Modules
The NixOS modules from T024 (`nix/modules/*.nix`) provide declarative service configuration. They are included in node configurations:
```nix
{ config, pkgs, lib, ... }:
{
imports = [
# Import PlasmaCloud service modules
inputs.plasmacloud.nixosModules.default
];
# Enable services declaratively
services.chainfire.enable = true;
services.flaredb.enable = true;
services.iam.enable = true;
# ... etc
}
```
**Module Integration Strategy:**
1. **Flake Inputs:** Node configurations reference the PlasmaCloud flake:
```nix
# flake.nix for provisioning repo
inputs.plasmacloud.url = "github:yourorg/plasmacloud";
# or path-based for development
inputs.plasmacloud.url = "path:/path/to/plasmacloud/repo";
```
2. **Service Packages:** Packages are injected via overlay:
```nix
nixpkgs.overlays = [ inputs.plasmacloud.overlays.default ];
# Now pkgs.chainfire-server, pkgs.flaredb-server, etc. are available
```
3. **Dependency Graph:** systemd units respect T024 dependencies:
```
chainfire.service
↓ requires/after
flaredb.service
↓ requires/after
iam.service
↓ requires/after
plasmavmc.service, flashdns.service, ... (parallel)
```
4. **Configuration Schema:** Use `services.<name>.settings` for service-specific config:
```nix
services.chainfire.settings = {
node_id = "node01";
cluster_name = "prod";
tls = { ... };
};
```
### 5.2 T027 Config Unification
T027 established a unified configuration approach (clap + config file/env). This integrates with NixOS in two ways:
**1. NixOS Module → Config File Generation:**
The NixOS module translates `services.<name>.settings` to a config file:
```nix
# In nix/modules/chainfire.nix
systemd.services.chainfire = {
preStart = ''
# Generate config file from settings
cat > /var/lib/chainfire/config.toml <<EOF
node_id = "${cfg.settings.node_id}"
cluster_name = "${cfg.settings.cluster_name}"
[tls]
cert_path = "${cfg.settings.tls.cert_path}"
key_path = "${cfg.settings.tls.key_path}"
ca_path = "${cfg.settings.tls.ca_path or ""}"
EOF
'';
serviceConfig.ExecStart = "${cfg.package}/bin/chainfire-server --config /var/lib/chainfire/config.toml";
};
```
**2. Environment Variable Injection:**
For secrets not suitable for Nix store:
```nix
systemd.services.chainfire.serviceConfig = {
EnvironmentFile = "/etc/nixos/secrets/chainfire.env";
# File contains: CHAINFIRE_API_TOKEN=secret123
};
```
**Best Practices:**
- **Public config:** Use `services.<name>.settings` (stored in Nix store, world-readable)
- **Secrets:** Use `EnvironmentFile` or systemd credentials
- **Hybrid:** Config file with placeholders, secrets injected at runtime
### 5.3 T031 TLS Certificates
T031 added TLS to all 8 services. Provisioning must handle certificate distribution:
**Certificate Provisioning Strategies:**
**Option 1: Pre-Generated Certificates (Simple)**
1. Generate certs on provisioning server per node:
```bash
# /srv/provisioning/scripts/generate-certs.sh node01.example.com
openssl req -x509 -newkey rsa:4096 -nodes \
-keyout node01-key.pem -out node01-cert.pem \
-days 365 -subj "/CN=node01.example.com"
```
2. Copy to node secrets directory:
```bash
cp node01-*.pem /srv/provisioning/nodes/node01.example.com/secrets/
```
3. nixos-anywhere installs them to `/etc/nixos/secrets/` (mode 0400, owner root)
4. NixOS module references them:
```nix
services.chainfire.settings.tls = {
cert_path = "/etc/nixos/secrets/tls-cert.pem";
key_path = "/etc/nixos/secrets/tls-key.pem";
ca_path = "/etc/nixos/secrets/tls-ca.pem";
};
```
**Option 2: ACME (Let's Encrypt) for External Services**
For internet-facing services (e.g., PlasmaVMC API):
```nix
security.acme = {
acceptTerms = true;
defaults.email = "admin@example.com";
};
services.plasmavmc.settings.tls = {
cert_path = config.security.acme.certs."plasmavmc.example.com".directory + "/cert.pem";
key_path = config.security.acme.certs."plasmavmc.example.com".directory + "/key.pem";
};
security.acme.certs."plasmavmc.example.com" = {
domain = "plasmavmc.example.com";
# Use DNS-01 challenge for internal servers
dnsProvider = "cloudflare";
credentialsFile = "/etc/nixos/secrets/cloudflare-api-token";
};
```
**Option 3: Internal CA with Cert-Manager (Advanced)**
1. Deploy cert-manager as a service on control plane
2. Generate per-node CSRs during first boot
3. Cert-manager signs and distributes certs
4. Systemd timer renews certs before expiry
**Recommendation:**
- **Phase 1 (MVP):** Pre-generated certs (Option 1)
- **Phase 2 (Production):** ACME for external + internal CA for internal (Option 2+3)
### 5.4 Chainfire/FlareDB Cluster Join
**Bootstrap (First 3 Nodes):**
First node (`node01`):
```nix
services.chainfire.settings = {
node_id = "node01";
initial_peers = [
"node01.example.com:2380"
"node02.example.com:2380"
"node03.example.com:2380"
];
bootstrap = true; # This node starts the cluster
};
```
Subsequent nodes (`node02`, `node03`):
```nix
services.chainfire.settings = {
node_id = "node02";
initial_peers = [
"node01.example.com:2380"
"node02.example.com:2380"
"node03.example.com:2380"
];
bootstrap = false; # Join existing cluster
};
```
**Runtime Join (After Bootstrap):**
New nodes added to running cluster:
1. Provision node with `bootstrap = false`, `initial_peers = []`
2. First-boot service calls leader's admin API:
```bash
curl -k -X POST https://node01.example.com:2379/admin/member/add \
-H "Content-Type: application/json" \
-d '{"id":"node04","raft_addr":"node04.example.com:2380"}'
```
3. Node receives cluster state, starts Raft
4. Leader replicates to new node
**FlareDB Follows Same Pattern:**
FlareDB depends on Chainfire for coordination but maintains its own Raft cluster:
```nix
services.flaredb.settings = {
node_id = "node01";
chainfire_endpoint = "https://localhost:2379";
initial_peers = [ "node01:2480" "node02:2480" "node03:2480" ];
};
```
**Critical:** Ensure `chainfire.service` is healthy before starting `flaredb.service` (enforced by systemd `requires`/`after`).
### 5.5 IAM Bootstrap
IAM requires initial admin user creation. Two approaches:
**Option 1: First-Boot Initialization Script**
```nix
systemd.services.iam-bootstrap = {
description = "IAM Initial Admin User";
after = [ "iam.service" ];
wantedBy = [ "multi-user.target" ];
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
};
script = ''
# Check if admin exists
if ${pkgs.curl}/bin/curl -k https://localhost:8080/api/users/admin 2>&1 | grep -q "not found"; then
# Create admin user
ADMIN_PASSWORD=$(cat /etc/nixos/secrets/iam-admin-password)
${pkgs.curl}/bin/curl -k -X POST https://localhost:8080/api/users \
-H "Content-Type: application/json" \
-d "{\"username\":\"admin\",\"password\":\"$ADMIN_PASSWORD\",\"role\":\"admin\"}"
echo "Admin user created"
else
echo "Admin user already exists"
fi
'';
};
```
**Option 2: Environment Variable for Default Admin**
IAM service creates admin on first start if DB is empty:
```rust
// In iam-server main.rs
if user_count() == 0 {
let admin_password = env::var("IAM_INITIAL_ADMIN_PASSWORD")
.expect("IAM_INITIAL_ADMIN_PASSWORD must be set for first boot");
create_user("admin", &admin_password, Role::Admin)?;
info!("Initial admin user created");
}
```
```nix
systemd.services.iam.serviceConfig = {
EnvironmentFile = "/etc/nixos/secrets/iam.env";
# File contains: IAM_INITIAL_ADMIN_PASSWORD=random-secure-password
};
```
**Recommendation:** Use Option 2 (environment variable) for simplicity. Generate random password during node provisioning, store in secrets.
## 6. Alternatives Considered
### 6.1 nixos-anywhere vs Custom Installer
**nixos-anywhere (Chosen):**
- **Pros:**
- Mature, actively maintained by nix-community
- Handles kexec, disko integration, bootloader install automatically
- SSH-based, works from any OS (no need for NixOS on provisioning server)
- Supports remote builds and disk encryption out of box
- Well-documented with many examples
- **Cons:**
- Requires SSH access (not suitable for zero-touch provisioning without PXE+SSH)
- Opinionated workflow (less flexible than custom scripts)
- Dependency on external project (but very stable)
**Custom Installer (Rejected):**
- **Pros:**
- Full control over installation flow
- Could implement zero-touch (e.g., installer pulls config from server without SSH)
- Tailored to PlasmaCloud-specific needs
- **Cons:**
- Significant development effort (partitioning, bootloader, error handling)
- Reinvents well-tested code (disko, kexec integration)
- Maintenance burden (keep up with NixOS changes)
- Higher risk of bugs (partitioning is error-prone)
**Decision:** Use nixos-anywhere for reliability and speed. The SSH requirement is acceptable since PXE boot already provides network access, and adding SSH keys to the netboot image is straightforward.
### 6.2 Disk Management Tools
**disko (Chosen):**
- **Pros:**
- Declarative, fits NixOS philosophy
- Integrates with nixos-anywhere out of box
- Supports complex layouts (RAID, LVM, LUKS, ZFS, btrfs)
- Idempotent (can reformat or verify existing layout)
- **Cons:**
- Nix-based DSL (learning curve)
- Limited to Linux filesystems (no Windows support, not relevant here)
**Kickstart/Preseed (Rejected):**
- Used by Fedora/Debian installers
- Not NixOS-native, would require custom integration
**Terraform with Libvirt (Rejected):**
- Good for VMs, not bare metal
- Doesn't handle disk partitioning directly
**Decision:** disko is the clear choice for NixOS deployments.
### 6.3 Boot Methods
**iPXE over TFTP/HTTP (Chosen):**
- **Pros:**
- Universal support (BIOS + UEFI)
- Flexible scripting (boot menus, conditional logic)
- HTTP support for fast downloads
- Open source, widely deployed
- **Cons:**
- Requires DHCP configuration (Option 66/67 setup)
- Chainloading adds complexity (but solved problem)
**UEFI HTTP Boot (Rejected):**
- **Pros:**
- Native UEFI, no TFTP needed
- Simpler DHCP config (just Option 60/67)
- **Cons:**
- UEFI only (no BIOS support)
- Firmware support inconsistent (pre-2015 servers)
- Less flexible than iPXE scripting
**Preboot USB (Rejected):**
- Manual, not scalable for fleet deployment
- Useful for one-off installs only
**Decision:** iPXE for flexibility and compatibility. UEFI HTTP Boot could be considered later for pure UEFI fleets.
### 6.4 Configuration Management
**NixOS Flakes (Chosen):**
- **Pros:**
- Native to NixOS, declarative
- Reproducible builds with lock files
- Git-based, version controlled
- No external agent needed (systemd handles state)
- **Cons:**
- Steep learning curve for operators unfamiliar with Nix
- Less dynamic than Ansible (changes require rebuild)
**Ansible (Rejected for Provisioning, Useful for Orchestration):**
- **Pros:**
- Agentless, SSH-based
- Large ecosystem of modules
- Dynamic, easy to patch running systems
- **Cons:**
- Imperative (harder to guarantee state)
- Doesn't integrate with NixOS packages/modules
- Adds another tool to stack
**Terraform (Rejected):**
- Infrastructure-as-code, not config management
- Better for cloud VMs than bare metal
**Decision:** Use NixOS flakes for provisioning and base config. Ansible may be added later for operational tasks (e.g., rolling updates, health checks) that don't fit NixOS's declarative model.
## 7. Open Questions / Decisions Needed
### 7.1 Hardware Inventory Management
**Question:** How do we map MAC addresses to node roles and configurations?
**Options:**
1. **Manual Inventory File:** Operator maintains JSON/YAML with MAC → hostname → config mapping
2. **Auto-Discovery:** First boot prompts operator to assign role (e.g., via serial console or web UI)
3. **External CMDB:** Integrate with existing Configuration Management Database (e.g., NetBox, Nautobot)
**Recommendation:** Start with manual inventory file (simple), migrate to CMDB integration in Phase 2.
### 7.2 Secrets Management
**Question:** How are secrets (TLS keys, passwords) generated, stored, and rotated?
**Options:**
1. **File-Based (Current):** Secrets in `/srv/provisioning/nodes/*/secrets/`, copied during install
2. **Vault Integration:** Fetch secrets from HashiCorp Vault at boot time
3. **systemd Credentials:** Use systemd's encrypted credentials feature (requires systemd 250+)
**Recommendation:** Phase 1 uses file-based (simple, works today). Phase 2 adds Vault for production (centralized, auditable, rotation support).
### 7.3 Network Boot Security
**Question:** How do we prevent rogue nodes from joining the cluster?
**Concerns:**
- Attacker boots unauthorized server on network
- Installer has SSH key, could be accessed
- Node joins cluster with malicious intent
**Mitigations:**
1. **MAC Whitelist:** DHCP only serves known MAC addresses
2. **Network Segmentation:** PXE boot on isolated provisioning VLAN
3. **SSH Key Per Node:** Each node has unique authorized_keys in netboot image (complex)
4. **Cluster Authentication:** Raft join requires cluster token (not yet implemented)
**Recommendation:** Use MAC whitelist + provisioning VLAN for Phase 1. Add cluster join tokens in Phase 2 (requires Chainfire/FlareDB changes).
### 7.4 Multi-Datacenter Deployment
**Question:** How does provisioning work across geographically distributed datacenters?
**Challenges:**
- WAN latency for Nix cache fetches
- PXE boot requires local DHCP/TFTP
- Cluster join across WAN (Raft latency)
**Options:**
1. **Replicated Provisioning Server:** Deploy boot server in each datacenter, sync configs
2. **Central Provisioning with Local Cache:** Single source of truth, local Nix cache mirrors
3. **Per-DC Clusters:** Each datacenter is independent cluster, federated at application layer
**Recommendation:** Defer to Phase 2. Phase 1 assumes single datacenter or low-latency LAN.
### 7.5 Disk Encryption
**Question:** Should disks be encrypted at rest?
**Trade-offs:**
- **Pros:** Compliance (GDPR, PCI-DSS), protection against physical theft
- **Cons:** Key management complexity, can't auto-reboot (manual unlock), performance overhead (~5-10%)
**Options:**
1. **No Encryption:** Rely on physical security
2. **LUKS with Network Unlock:** Tang/Clevis for automated unlocking (requires network on boot)
3. **LUKS with Manual Unlock:** Operator enters passphrase via KVM/IPMI
**Recommendation:** Optional, configurable per deployment. Provide disko template for LUKS, let operator decide.
### 7.6 Rolling Updates
**Question:** How do we update a running cluster without downtime?
**Challenges:**
- Raft requires quorum (can't update majority simultaneously)
- Service dependencies (Chainfire → FlareDB → others)
- NixOS rebuild requires reboot (for kernel/init changes)
**Strategy:**
1. Update one node at a time (rolling)
2. Verify health before proceeding to next
3. Use `nixos-rebuild test` first (activates without bootloader change), then `switch` after validation
**Tooling:**
- Ansible playbook for orchestration
- Health check scripts (curl endpoints + check Raft status)
- Rollback plan (NixOS generations + Raft snapshot restore)
**Recommendation:** Document as runbook in Phase 1, implement automated rolling update in Phase 2 (T033?).
### 7.7 Monitoring and Alerting
**Question:** How do we monitor provisioning success/failure?
**Options:**
1. **Manual:** Operator watches terminal, checks health endpoints
2. **Log Aggregation:** Collect installer logs, index in Loki/Elasticsearch
3. **Event Webhook:** Installer posts events to monitoring system (Grafana, PagerDuty)
**Recommendation:** Phase 1 uses manual monitoring. Phase 2 adds structured logging + webhooks for fleet deployments.
### 7.8 Compatibility with Existing Infrastructure
**Question:** Can this provisioning system coexist with existing PXE infrastructure (e.g., for other OS deployments)?
**Concerns:**
- Existing DHCP config may conflict
- TFTP server may serve other boot files
- Network team may control PXE infrastructure
**Solutions:**
1. **Dedicated Provisioning VLAN:** PlasmaCloud nodes on separate network
2. **Conditional DHCP:** Use vendor-class or subnet matching to route to correct boot server
3. **Multi-Boot Menu:** iPXE menu includes options for PlasmaCloud and other OSes
**Recommendation:** Document network requirements, provide example DHCP config for common scenarios (dedicated VLAN, shared infrastructure). Coordinate with network team.
---
## Appendices
### A. Example Disko Configuration
**Single Disk with GPT and ext4:**
```nix
# nodes/node01/disko.nix
{ disks ? [ "/dev/sda" ], ... }:
{
disko.devices = {
disk = {
main = {
type = "disk";
device = builtins.head disks;
content = {
type = "gpt";
partitions = {
ESP = {
size = "512M";
type = "EF00";
content = {
type = "filesystem";
format = "vfat";
mountpoint = "/boot";
};
};
root = {
size = "100%";
content = {
type = "filesystem";
format = "ext4";
mountpoint = "/";
};
};
};
};
};
};
};
}
```
**RAID1 with LUKS Encryption:**
```nix
{ disks ? [ "/dev/sda" "/dev/sdb" ], ... }:
{
disko.devices = {
disk = {
disk1 = {
device = builtins.elemAt disks 0;
type = "disk";
content = {
type = "gpt";
partitions = {
boot = {
size = "1M";
type = "EF02"; # BIOS boot
};
mdraid = {
size = "100%";
content = {
type = "mdraid";
name = "raid1";
};
};
};
};
};
disk2 = {
device = builtins.elemAt disks 1;
type = "disk";
content = {
type = "gpt";
partitions = {
boot = {
size = "1M";
type = "EF02";
};
mdraid = {
size = "100%";
content = {
type = "mdraid";
name = "raid1";
};
};
};
};
};
};
mdadm = {
raid1 = {
type = "mdadm";
level = 1;
content = {
type = "luks";
name = "cryptroot";
settings.allowDiscards = true;
content = {
type = "filesystem";
format = "ext4";
mountpoint = "/";
};
};
};
};
};
}
```
### B. Complete nixos-anywhere Command Examples
**Basic Deployment:**
```bash
nix run github:nix-community/nixos-anywhere -- \
--flake .#node01 \
root@10.0.0.100
```
**With Build on Remote (Slow Local Machine):**
```bash
nix run github:nix-community/nixos-anywhere -- \
--flake .#node01 \
--build-on-remote \
root@10.0.0.100
```
**With Disk Encryption Key:**
```bash
nix run github:nix-community/nixos-anywhere -- \
--flake .#node01 \
--disk-encryption-keys /tmp/luks.key <(cat /secrets/node01-luks.key) \
root@10.0.0.100
```
**Debug Mode (Keep Installer After Failure):**
```bash
nix run github:nix-community/nixos-anywhere -- \
--flake .#node01 \
--debug \
--no-reboot \
root@10.0.0.100
```
### C. Provisioning Server Setup Script
```bash
#!/bin/bash
# /srv/provisioning/scripts/setup-provisioning-server.sh
set -euo pipefail
# Install dependencies
apt-get update
apt-get install -y nginx tftpd-hpa dnsmasq curl
# Configure TFTP
cat > /etc/default/tftpd-hpa <<EOF
TFTP_USERNAME="tftp"
TFTP_DIRECTORY="/srv/boot/tftp"
TFTP_ADDRESS="0.0.0.0:69"
TFTP_OPTIONS="--secure"
EOF
mkdir -p /srv/boot/tftp
systemctl restart tftpd-hpa
# Download iPXE binaries
curl -L http://boot.ipxe.org/undionly.kpxe -o /srv/boot/tftp/undionly.kpxe
curl -L http://boot.ipxe.org/ipxe.efi -o /srv/boot/tftp/ipxe.efi
# Configure nginx for HTTP boot
cat > /etc/nginx/sites-available/pxe <<EOF
server {
listen 8080;
server_name _;
root /srv/boot;
location / {
autoindex on;
try_files \$uri \$uri/ =404;
}
# Enable range requests for large files
location ~* \.(iso|img|bin|efi|kpxe)$ {
add_header Accept-Ranges bytes;
}
}
EOF
ln -sf /etc/nginx/sites-available/pxe /etc/nginx/sites-enabled/
systemctl restart nginx
# Create directory structure
mkdir -p /srv/boot/{nixos,nix-cache,scripts}
mkdir -p /srv/provisioning/{nodes,profiles,common,scripts}
echo "Provisioning server setup complete!"
echo "Next steps:"
echo "1. Configure DHCP server (see design doc Section 2.2)"
echo "2. Build NixOS netboot image (see Section 3.1)"
echo "3. Create node configurations (see Section 3.2)"
```
### D. First-Boot Cluster Config JSON Schema
```json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Cluster Configuration",
"type": "object",
"properties": {
"node_id": {
"type": "string",
"description": "Unique identifier for this node"
},
"bootstrap": {
"type": "boolean",
"description": "True if this node should bootstrap a new cluster"
},
"leader_url": {
"type": "string",
"format": "uri",
"description": "URL of existing cluster leader (for join)"
},
"raft_addr": {
"type": "string",
"description": "This node's Raft address (host:port)"
},
"cluster_token": {
"type": "string",
"description": "Shared secret for cluster authentication (future)"
}
},
"required": ["node_id", "bootstrap"],
"if": {
"properties": { "bootstrap": { "const": false } }
},
"then": {
"required": ["leader_url", "raft_addr"]
}
}
```
**Example for bootstrap node:**
```json
{
"node_id": "node01",
"bootstrap": true,
"raft_addr": "node01.example.com:2380"
}
```
**Example for joining node:**
```json
{
"node_id": "node04",
"bootstrap": false,
"leader_url": "https://node01.example.com:2379",
"raft_addr": "node04.example.com:2380"
}
```
### E. References and Further Reading
**Primary Documentation:**
- [nixos-anywhere Quickstart](https://nix-community.github.io/nixos-anywhere/quickstart.html)
- [disko Documentation](https://github.com/nix-community/disko)
- [iPXE Examples](https://ipxe.org/examples)
- [NixOS Netboot](https://nixos.wiki/wiki/Netboot)
**Technical Specifications:**
- [RFC 4578 - DHCP Options for PXE](https://www.rfc-editor.org/rfc/rfc4578)
- [UEFI HTTP Boot Specification](https://uefi.org/specs/UEFI/2.10/32_Network_Protocols.html#http-boot)
**Community Resources:**
- [NixOS Discourse - Netboot Discussions](https://discourse.nixos.org/tag/netboot)
- [nixos-anywhere Examples](https://github.com/nix-community/nixos-anywhere/tree/main/docs)
**Related Blog Posts:**
- [iPXE Booting with NixOS (2024)](https://carlosvaz.com/posts/ipxe-booting-with-nixos/)
- [Remote Deployment with nixos-anywhere and disko](https://mich-murphy.com/nixos-anywhere-and-disko/)
---
## Revision History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 0.1 | 2025-12-10 | peerB | Initial draft |
---
**End of Design Document**