photoncloud-monorepo/baremetal/image-builder/OVERVIEW.md
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

14 KiB

PlasmaCloud Netboot Image Builder - Technical Overview

Introduction

This document provides a technical overview of the PlasmaCloud NixOS Image Builder, which generates bootable netboot images for bare-metal provisioning. This is part of T032 (Bare-Metal Provisioning) and specifically implements deliverable S3 (NixOS Image Builder).

System Architecture

High-Level Flow

┌─────────────────────┐
│  Nix Flake          │
│  (flake.nix)        │
└──────────┬──────────┘
           │
           ├─── nixosConfigurations
           │    ├── netboot-control-plane
           │    ├── netboot-worker
           │    └── netboot-all-in-one
           │
           ├─── packages (T024)
           │    ├── chainfire-server
           │    ├── flaredb-server
           │    └── ... (8 services)
           │
           └─── modules (T024)
                ├── chainfire.nix
                ├── flaredb.nix
                └── ... (8 modules)

         Build Process
              ↓

┌─────────────────────┐
│  build-images.sh    │
└──────────┬──────────┘
           │
           ├─── nix build netbootRamdisk
           ├─── nix build kernel
           └─── copy to artifacts/

         Output
              ↓

┌─────────────────────┐
│  Netboot Artifacts  │
├─────────────────────┤
│  bzImage (kernel)   │
│  initrd (ramdisk)   │
│  netboot.ipxe       │
└─────────────────────┘
           │
           ├─── PXE Server
           │    (HTTP/TFTP)
           │
           └─── Target Machine
                (PXE Boot)

Component Breakdown

1. Netboot Configurations

Located in nix/images/, these NixOS configurations define the netboot environment:

netboot-base.nix

Purpose: Common base configuration for all profiles

Key Features:

  • Extends netboot-minimal.nix from nixpkgs
  • SSH server with root login (key-based only)
  • Generic kernel with broad hardware support
  • Disk management tools (disko, parted, cryptsetup, lvm2)
  • Network configuration (DHCP, predictable interface names)
  • Serial console support (ttyS0, tty0)
  • Minimal system (no docs, no sound)

Package Inclusions:

disko, parted, gptfdisk     # Disk management
cryptsetup, lvm2            # Encryption and LVM
e2fsprogs, xfsprogs         # Filesystem tools
iproute2, curl, tcpdump     # Network tools
vim, tmux, htop             # System tools

Kernel Configuration:

boot.kernelPackages = pkgs.linuxPackages_latest;
boot.kernelParams = [
  "console=ttyS0,115200"
  "console=tty0"
  "loglevel=4"
];

netboot-control-plane.nix

Purpose: Full control plane deployment

Imports:

  • netboot-base.nix (base configuration)
  • ../modules (PlasmaCloud service modules)

Service Inclusions:

  • Chainfire (ports 2379, 2380, 2381)
  • FlareDB (ports 2479, 2480)
  • IAM (port 8080)
  • PlasmaVMC (port 8081)
  • NovaNET (port 8082)
  • FlashDNS (port 53)
  • FiberLB (port 8083)
  • LightningStor (port 8084)
  • K8sHost (port 8085)

Service State: All services disabled by default via lib.mkDefault false

Resource Limits (for netboot environment):

MemoryMax = "512M"
CPUQuota = "50%"

netboot-worker.nix

Purpose: Compute-focused worker nodes

Imports:

  • netboot-base.nix
  • ../modules

Service Inclusions:

  • PlasmaVMC (VM management)
  • NovaNET (SDN)

Additional Features:

  • KVM virtualization support
  • Open vSwitch for SDN
  • QEMU and libvirt tools
  • Optimized sysctl for VM workloads

Performance Tuning:

"fs.file-max" = 1000000;
"net.ipv4.ip_forward" = 1;
"net.core.netdev_max_backlog" = 5000;

netboot-all-in-one.nix

Purpose: Single-node deployment with all services

Imports:

  • netboot-base.nix
  • ../modules

Combines: All features from control-plane + worker

Use Cases:

  • Development environments
  • Small deployments
  • Edge locations
  • POC installations

2. Flake Integration

The main flake.nix exposes netboot configurations:

nixosConfigurations = {
  netboot-control-plane = nixpkgs.lib.nixosSystem {
    system = "x86_64-linux";
    modules = [ ./nix/images/netboot-control-plane.nix ];
  };

  netboot-worker = nixpkgs.lib.nixosSystem {
    system = "x86_64-linux";
    modules = [ ./nix/images/netboot-worker.nix ];
  };

  netboot-all-in-one = nixpkgs.lib.nixosSystem {
    system = "x86_64-linux";
    modules = [ ./nix/images/netboot-all-in-one.nix ];
  };
};

3. Build Script

build-images.sh orchestrates the build process:

Workflow:

  1. Parse command-line arguments (--profile, --output-dir)
  2. Create output directories
  3. For each profile:
    • Build netboot ramdisk: nix build ...netbootRamdisk
    • Build kernel: nix build ...kernel
    • Copy artifacts (bzImage, initrd)
    • Generate iPXE boot script
    • Calculate and display sizes
  4. Verify outputs (file existence, size sanity checks)
  5. Copy to PXE server (if available)
  6. Print summary

Build Commands:

nix build .#nixosConfigurations.netboot-$profile.config.system.build.netbootRamdisk
nix build .#nixosConfigurations.netboot-$profile.config.system.build.kernel

Output Structure:

artifacts/
├── control-plane/
│   ├── bzImage          # ~10-30 MB
│   ├── initrd           # ~100-300 MB
│   ├── netboot.ipxe     # iPXE script
│   ├── build.log        # Build log
│   ├── initrd-link      # Nix result symlink
│   └── kernel-link      # Nix result symlink
├── worker/
│   └── ... (same structure)
└── all-in-one/
    └── ... (same structure)

Integration Points

T024 NixOS Modules

The netboot configurations leverage T024 service modules:

Module Structure (example: chainfire.nix):

{
  options.services.chainfire = {
    enable = lib.mkEnableOption "chainfire service";
    port = lib.mkOption { ... };
    raftPort = lib.mkOption { ... };
    package = lib.mkOption { ... };
  };

  config = lib.mkIf cfg.enable {
    users.users.chainfire = { ... };
    systemd.services.chainfire = { ... };
  };
}

Package Availability:

# In netboot-control-plane.nix
environment.systemPackages = with pkgs; [
  chainfire-server  # From flake overlay
  flaredb-server    # From flake overlay
  # ...
];

T032.S2 PXE Infrastructure

The build script integrates with the PXE server:

Copy Workflow:

# Build script copies to:
chainfire/baremetal/pxe-server/assets/nixos/
├── control-plane/
│   ├── bzImage
│   └── initrd
├── worker/
│   ├── bzImage
│   └── initrd
└── all-in-one/
    ├── bzImage
    └── initrd

iPXE Boot Script (generated):

#!ipxe
kernel ${boot-server}/control-plane/bzImage init=/nix/store/*/init console=ttyS0,115200
initrd ${boot-server}/control-plane/initrd
boot

Build Process Deep Dive

NixOS Netboot Build Internals

  1. netboot-minimal.nix (from nixpkgs):

    • Provides base netboot functionality
    • Configures initrd with kexec support
    • Sets up squashfs for Nix store
  2. Our Extensions:

    • Add PlasmaCloud service packages
    • Configure SSH for nixos-anywhere
    • Include provisioning tools (disko, etc.)
    • Customize kernel and modules
  3. Build Outputs:

    • bzImage: Compressed Linux kernel
    • initrd: Squashfs-compressed initial ramdisk containing:
      • Minimal NixOS system
      • Nix store with service packages
      • Init scripts for booting

Size Optimization Strategies

Current Optimizations:

documentation.enable = false;          # -50MB
documentation.nixos.enable = false;    # -20MB
i18n.supportedLocales = [ "en_US" ];   # -100MB

Additional Strategies (if needed):

  • Use linuxPackages_hardened (smaller kernel)
  • Remove unused kernel modules
  • Compress with xz instead of gzip
  • On-demand package fetching from HTTP substituter

Expected Sizes:

  • Control Plane: ~250-350 MB (initrd)
  • Worker: ~150-250 MB (initrd)
  • All-in-One: ~300-400 MB (initrd)

Boot Flow

From PXE to Running System

1. PXE Boot
   ├─ DHCP discovers boot server
   ├─ TFTP loads iPXE binary
   └─ iPXE executes boot script

2. Netboot Download
   ├─ HTTP downloads bzImage (~20MB)
   ├─ HTTP downloads initrd (~200MB)
   └─ kexec into NixOS installer

3. NixOS Installer (in RAM)
   ├─ Init system starts
   ├─ Network configuration (DHCP)
   ├─ SSH server starts
   └─ Ready for nixos-anywhere

4. Installation (nixos-anywhere)
   ├─ SSH connection established
   ├─ Disk partitioning (disko)
   ├─ NixOS system installation
   ├─ Secret injection
   └─ Bootloader installation

5. First Boot (from disk)
   ├─ GRUB/systemd-boot loads
   ├─ Services start (enabled)
   ├─ Cluster join (if configured)
   └─ Running PlasmaCloud node

Customization Guide

Adding a New Service

Step 1: Create NixOS module

# nix/modules/myservice.nix
{ config, lib, pkgs, ... }:
{
  options.services.myservice = {
    enable = lib.mkEnableOption "myservice";
  };

  config = lib.mkIf cfg.enable {
    systemd.services.myservice = { ... };
  };
}

Step 2: Add to flake packages

# flake.nix
packages.myservice-server = buildRustWorkspace { ... };

Step 3: Include in netboot profile

# nix/images/netboot-control-plane.nix
environment.systemPackages = with pkgs; [
  myservice-server
];

services.myservice = {
  enable = lib.mkDefault false;
};

Creating a Custom Profile

Step 1: Create new netboot configuration

# nix/images/netboot-custom.nix
{ config, pkgs, lib, ... }:
{
  imports = [
    ./netboot-base.nix
    ../modules
  ];

  # Your customizations
  environment.systemPackages = [ ... ];
}

Step 2: Add to flake

# flake.nix
nixosConfigurations.netboot-custom = nixpkgs.lib.nixosSystem {
  system = "x86_64-linux";
  modules = [ ./nix/images/netboot-custom.nix ];
};

Step 3: Update build script

# build-images.sh
profiles_to_build=("control-plane" "worker" "all-in-one" "custom")

Security Model

Netboot Phase

Risk: Netboot image has root SSH access enabled

Mitigations:

  1. Key-based authentication only (no passwords)
  2. Isolated provisioning VLAN
  3. MAC address whitelist in DHCP
  4. Firewall disabled only during install

Post-Installation

Services remain disabled until final configuration enables them:

# In installed system configuration
services.chainfire.enable = true;  # Overrides lib.mkDefault false

Secret Management

Secrets are NOT embedded in netboot images:

# During nixos-anywhere installation:
scp secrets/* root@target:/tmp/secrets/

# Installed system references:
services.chainfire.settings.tls = {
  cert_path = "/etc/nixos/secrets/tls-cert.pem";
};

Performance Characteristics

Build Times

  • First build: 30-60 minutes (downloads all dependencies)
  • Incremental builds: 5-15 minutes (reuses cached artifacts)
  • With local cache: 2-5 minutes

Network Requirements

  • Initial download: ~2GB (nixpkgs + dependencies)
  • Netboot download: ~200-400MB per node
  • Installation: ~500MB-2GB (depending on services)

Hardware Requirements

Build Machine:

  • CPU: 4+ cores recommended
  • RAM: 8GB minimum, 16GB recommended
  • Disk: 50GB free space
  • Network: Broadband connection

Target Machine:

  • RAM: 4GB minimum for netboot (8GB+ for production)
  • Network: PXE boot support, DHCP
  • Disk: Depends on disko configuration

Testing Strategy

Verification Steps

  1. Syntax Validation:

    nix flake check
    
  2. Build Test:

    ./build-images.sh --profile control-plane
    
  3. Artifact Verification:

    file artifacts/control-plane/bzImage  # Should be Linux kernel
    file artifacts/control-plane/initrd   # Should be compressed data
    
  4. PXE Boot Test:

    • Boot VM from netboot image
    • Verify SSH access
    • Check available tools (disko, parted, etc.)
  5. Installation Test:

    • Run nixos-anywhere on test target
    • Verify successful installation
    • Check service availability

Troubleshooting Matrix

Symptom Possible Cause Solution
Build fails Missing flakes Enable experimental-features
Large initrd Too many packages Remove unused packages
SSH fails Wrong SSH key Update authorized_keys
Boot hangs Wrong kernel params Check console= settings
No network DHCP issues Verify useDHCP = true
Service missing Package not built Check flake overlay

Future Enhancements

Planned Improvements

  1. Image Variants:

    • Minimal installer (no services)
    • Debug variant (with extra tools)
    • Rescue mode (for recovery)
  2. Build Optimizations:

    • Parallel profile builds
    • Incremental rebuild detection
    • Binary cache integration
  3. Security Enhancements:

    • Per-node SSH keys
    • TPM-based secrets
    • Measured boot support
  4. Monitoring:

    • Build metrics collection
    • Size trend tracking
    • Performance benchmarking

References

Revision History

Version Date Author Changes
1.0 2025-12-10 T032.S3 Initial implementation