photoncloud-monorepo/baremetal/image-builder/OVERVIEW.md
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

570 lines
14 KiB
Markdown

# PlasmaCloud Netboot Image Builder - Technical Overview
## Introduction
This document provides a technical overview of the PlasmaCloud NixOS Image Builder, which generates bootable netboot images for bare-metal provisioning. This is part of T032 (Bare-Metal Provisioning) and specifically implements deliverable S3 (NixOS Image Builder).
## System Architecture
### High-Level Flow
```
┌─────────────────────┐
│ Nix Flake │
│ (flake.nix) │
└──────────┬──────────┘
├─── nixosConfigurations
│ ├── netboot-control-plane
│ ├── netboot-worker
│ └── netboot-all-in-one
├─── packages (T024)
│ ├── chainfire-server
│ ├── flaredb-server
│ └── ... (8 services)
└─── modules (T024)
├── chainfire.nix
├── flaredb.nix
└── ... (8 modules)
Build Process
┌─────────────────────┐
│ build-images.sh │
└──────────┬──────────┘
├─── nix build netbootRamdisk
├─── nix build kernel
└─── copy to artifacts/
Output
┌─────────────────────┐
│ Netboot Artifacts │
├─────────────────────┤
│ bzImage (kernel) │
│ initrd (ramdisk) │
│ netboot.ipxe │
└─────────────────────┘
├─── PXE Server
│ (HTTP/TFTP)
└─── Target Machine
(PXE Boot)
```
## Component Breakdown
### 1. Netboot Configurations
Located in `nix/images/`, these NixOS configurations define the netboot environment:
#### `netboot-base.nix`
**Purpose**: Common base configuration for all profiles
**Key Features**:
- Extends `netboot-minimal.nix` from nixpkgs
- SSH server with root login (key-based only)
- Generic kernel with broad hardware support
- Disk management tools (disko, parted, cryptsetup, lvm2)
- Network configuration (DHCP, predictable interface names)
- Serial console support (ttyS0, tty0)
- Minimal system (no docs, no sound)
**Package Inclusions**:
```nix
disko, parted, gptfdisk # Disk management
cryptsetup, lvm2 # Encryption and LVM
e2fsprogs, xfsprogs # Filesystem tools
iproute2, curl, tcpdump # Network tools
vim, tmux, htop # System tools
```
**Kernel Configuration**:
```nix
boot.kernelPackages = pkgs.linuxPackages_latest;
boot.kernelParams = [
"console=ttyS0,115200"
"console=tty0"
"loglevel=4"
];
```
#### `netboot-control-plane.nix`
**Purpose**: Full control plane deployment
**Imports**:
- `netboot-base.nix` (base configuration)
- `../modules` (PlasmaCloud service modules)
**Service Inclusions**:
- Chainfire (ports 2379, 2380, 2381)
- FlareDB (ports 2479, 2480)
- IAM (port 8080)
- PlasmaVMC (port 8081)
- NovaNET (port 8082)
- FlashDNS (port 53)
- FiberLB (port 8083)
- LightningStor (port 8084)
- K8sHost (port 8085)
**Service State**: All services **disabled** by default via `lib.mkDefault false`
**Resource Limits** (for netboot environment):
```nix
MemoryMax = "512M"
CPUQuota = "50%"
```
#### `netboot-worker.nix`
**Purpose**: Compute-focused worker nodes
**Imports**:
- `netboot-base.nix`
- `../modules`
**Service Inclusions**:
- PlasmaVMC (VM management)
- NovaNET (SDN)
**Additional Features**:
- KVM virtualization support
- Open vSwitch for SDN
- QEMU and libvirt tools
- Optimized sysctl for VM workloads
**Performance Tuning**:
```nix
"fs.file-max" = 1000000;
"net.ipv4.ip_forward" = 1;
"net.core.netdev_max_backlog" = 5000;
```
#### `netboot-all-in-one.nix`
**Purpose**: Single-node deployment with all services
**Imports**:
- `netboot-base.nix`
- `../modules`
**Combines**: All features from control-plane + worker
**Use Cases**:
- Development environments
- Small deployments
- Edge locations
- POC installations
### 2. Flake Integration
The main `flake.nix` exposes netboot configurations:
```nix
nixosConfigurations = {
netboot-control-plane = nixpkgs.lib.nixosSystem {
system = "x86_64-linux";
modules = [ ./nix/images/netboot-control-plane.nix ];
};
netboot-worker = nixpkgs.lib.nixosSystem {
system = "x86_64-linux";
modules = [ ./nix/images/netboot-worker.nix ];
};
netboot-all-in-one = nixpkgs.lib.nixosSystem {
system = "x86_64-linux";
modules = [ ./nix/images/netboot-all-in-one.nix ];
};
};
```
### 3. Build Script
`build-images.sh` orchestrates the build process:
**Workflow**:
1. Parse command-line arguments (--profile, --output-dir)
2. Create output directories
3. For each profile:
- Build netboot ramdisk: `nix build ...netbootRamdisk`
- Build kernel: `nix build ...kernel`
- Copy artifacts (bzImage, initrd)
- Generate iPXE boot script
- Calculate and display sizes
4. Verify outputs (file existence, size sanity checks)
5. Copy to PXE server (if available)
6. Print summary
**Build Commands**:
```bash
nix build .#nixosConfigurations.netboot-$profile.config.system.build.netbootRamdisk
nix build .#nixosConfigurations.netboot-$profile.config.system.build.kernel
```
**Output Structure**:
```
artifacts/
├── control-plane/
│ ├── bzImage # ~10-30 MB
│ ├── initrd # ~100-300 MB
│ ├── netboot.ipxe # iPXE script
│ ├── build.log # Build log
│ ├── initrd-link # Nix result symlink
│ └── kernel-link # Nix result symlink
├── worker/
│ └── ... (same structure)
└── all-in-one/
└── ... (same structure)
```
## Integration Points
### T024 NixOS Modules
The netboot configurations leverage T024 service modules:
**Module Structure** (example: chainfire.nix):
```nix
{
options.services.chainfire = {
enable = lib.mkEnableOption "chainfire service";
port = lib.mkOption { ... };
raftPort = lib.mkOption { ... };
package = lib.mkOption { ... };
};
config = lib.mkIf cfg.enable {
users.users.chainfire = { ... };
systemd.services.chainfire = { ... };
};
}
```
**Package Availability**:
```nix
# In netboot-control-plane.nix
environment.systemPackages = with pkgs; [
chainfire-server # From flake overlay
flaredb-server # From flake overlay
# ...
];
```
### T032.S2 PXE Infrastructure
The build script integrates with the PXE server:
**Copy Workflow**:
```bash
# Build script copies to:
chainfire/baremetal/pxe-server/assets/nixos/
├── control-plane/
│ ├── bzImage
│ └── initrd
├── worker/
│ ├── bzImage
│ └── initrd
└── all-in-one/
├── bzImage
└── initrd
```
**iPXE Boot Script** (generated):
```ipxe
#!ipxe
kernel ${boot-server}/control-plane/bzImage init=/nix/store/*/init console=ttyS0,115200
initrd ${boot-server}/control-plane/initrd
boot
```
## Build Process Deep Dive
### NixOS Netboot Build Internals
1. **netboot-minimal.nix** (from nixpkgs):
- Provides base netboot functionality
- Configures initrd with kexec support
- Sets up squashfs for Nix store
2. **Our Extensions**:
- Add PlasmaCloud service packages
- Configure SSH for nixos-anywhere
- Include provisioning tools (disko, etc.)
- Customize kernel and modules
3. **Build Outputs**:
- **bzImage**: Compressed Linux kernel
- **initrd**: Squashfs-compressed initial ramdisk containing:
- Minimal NixOS system
- Nix store with service packages
- Init scripts for booting
### Size Optimization Strategies
**Current Optimizations**:
```nix
documentation.enable = false; # -50MB
documentation.nixos.enable = false; # -20MB
i18n.supportedLocales = [ "en_US" ]; # -100MB
```
**Additional Strategies** (if needed):
- Use `linuxPackages_hardened` (smaller kernel)
- Remove unused kernel modules
- Compress with xz instead of gzip
- On-demand package fetching from HTTP substituter
**Expected Sizes**:
- **Control Plane**: ~250-350 MB (initrd)
- **Worker**: ~150-250 MB (initrd)
- **All-in-One**: ~300-400 MB (initrd)
## Boot Flow
### From PXE to Running System
```
1. PXE Boot
├─ DHCP discovers boot server
├─ TFTP loads iPXE binary
└─ iPXE executes boot script
2. Netboot Download
├─ HTTP downloads bzImage (~20MB)
├─ HTTP downloads initrd (~200MB)
└─ kexec into NixOS installer
3. NixOS Installer (in RAM)
├─ Init system starts
├─ Network configuration (DHCP)
├─ SSH server starts
└─ Ready for nixos-anywhere
4. Installation (nixos-anywhere)
├─ SSH connection established
├─ Disk partitioning (disko)
├─ NixOS system installation
├─ Secret injection
└─ Bootloader installation
5. First Boot (from disk)
├─ GRUB/systemd-boot loads
├─ Services start (enabled)
├─ Cluster join (if configured)
└─ Running PlasmaCloud node
```
## Customization Guide
### Adding a New Service
**Step 1**: Create NixOS module
```nix
# nix/modules/myservice.nix
{ config, lib, pkgs, ... }:
{
options.services.myservice = {
enable = lib.mkEnableOption "myservice";
};
config = lib.mkIf cfg.enable {
systemd.services.myservice = { ... };
};
}
```
**Step 2**: Add to flake packages
```nix
# flake.nix
packages.myservice-server = buildRustWorkspace { ... };
```
**Step 3**: Include in netboot profile
```nix
# nix/images/netboot-control-plane.nix
environment.systemPackages = with pkgs; [
myservice-server
];
services.myservice = {
enable = lib.mkDefault false;
};
```
### Creating a Custom Profile
**Step 1**: Create new netboot configuration
```nix
# nix/images/netboot-custom.nix
{ config, pkgs, lib, ... }:
{
imports = [
./netboot-base.nix
../modules
];
# Your customizations
environment.systemPackages = [ ... ];
}
```
**Step 2**: Add to flake
```nix
# flake.nix
nixosConfigurations.netboot-custom = nixpkgs.lib.nixosSystem {
system = "x86_64-linux";
modules = [ ./nix/images/netboot-custom.nix ];
};
```
**Step 3**: Update build script
```bash
# build-images.sh
profiles_to_build=("control-plane" "worker" "all-in-one" "custom")
```
## Security Model
### Netboot Phase
**Risk**: Netboot image has root SSH access enabled
**Mitigations**:
1. **Key-based authentication only** (no passwords)
2. **Isolated provisioning VLAN**
3. **MAC address whitelist in DHCP**
4. **Firewall disabled only during install**
### Post-Installation
Services remain disabled until final configuration enables them:
```nix
# In installed system configuration
services.chainfire.enable = true; # Overrides lib.mkDefault false
```
### Secret Management
Secrets are **NOT** embedded in netboot images:
```nix
# During nixos-anywhere installation:
scp secrets/* root@target:/tmp/secrets/
# Installed system references:
services.chainfire.settings.tls = {
cert_path = "/etc/nixos/secrets/tls-cert.pem";
};
```
## Performance Characteristics
### Build Times
- **First build**: 30-60 minutes (downloads all dependencies)
- **Incremental builds**: 5-15 minutes (reuses cached artifacts)
- **With local cache**: 2-5 minutes
### Network Requirements
- **Initial download**: ~2GB (nixpkgs + dependencies)
- **Netboot download**: ~200-400MB per node
- **Installation**: ~500MB-2GB (depending on services)
### Hardware Requirements
**Build Machine**:
- CPU: 4+ cores recommended
- RAM: 8GB minimum, 16GB recommended
- Disk: 50GB free space
- Network: Broadband connection
**Target Machine**:
- RAM: 4GB minimum for netboot (8GB+ for production)
- Network: PXE boot support, DHCP
- Disk: Depends on disko configuration
## Testing Strategy
### Verification Steps
1. **Syntax Validation**:
```bash
nix flake check
```
2. **Build Test**:
```bash
./build-images.sh --profile control-plane
```
3. **Artifact Verification**:
```bash
file artifacts/control-plane/bzImage # Should be Linux kernel
file artifacts/control-plane/initrd # Should be compressed data
```
4. **PXE Boot Test**:
- Boot VM from netboot image
- Verify SSH access
- Check available tools (disko, parted, etc.)
5. **Installation Test**:
- Run nixos-anywhere on test target
- Verify successful installation
- Check service availability
## Troubleshooting Matrix
| Symptom | Possible Cause | Solution |
|---------|---------------|----------|
| Build fails | Missing flakes | Enable experimental-features |
| Large initrd | Too many packages | Remove unused packages |
| SSH fails | Wrong SSH key | Update authorized_keys |
| Boot hangs | Wrong kernel params | Check console= settings |
| No network | DHCP issues | Verify useDHCP = true |
| Service missing | Package not built | Check flake overlay |
## Future Enhancements
### Planned Improvements
1. **Image Variants**:
- Minimal installer (no services)
- Debug variant (with extra tools)
- Rescue mode (for recovery)
2. **Build Optimizations**:
- Parallel profile builds
- Incremental rebuild detection
- Binary cache integration
3. **Security Enhancements**:
- Per-node SSH keys
- TPM-based secrets
- Measured boot support
4. **Monitoring**:
- Build metrics collection
- Size trend tracking
- Performance benchmarking
## References
- **NixOS Netboot**: https://nixos.wiki/wiki/Netboot
- **nixos-anywhere**: https://github.com/nix-community/nixos-anywhere
- **disko**: https://github.com/nix-community/disko
- **T032 Design**: `docs/por/T032-baremetal-provisioning/design.md`
- **T024 Modules**: `nix/modules/`
## Revision History
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0 | 2025-12-10 | T032.S3 | Initial implementation |