- netboot-base.nix with SSH key auth - Launch scripts for node01/02/03 - Node configuration.nix and disko.nix - Nix modules for first-boot automation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
12 KiB
12 KiB
Bare-Metal Provisioning Quick Start Guide
Target Audience: Experienced operators familiar with NixOS and PXE boot Time Required: 2-4 hours for 3-node cluster Last Updated: 2025-12-10
Prerequisites Checklist
- 3+ bare-metal servers with PXE boot enabled
- Network switch and cabling ready
- NixOS provisioning workstation with flakes enabled
- SSH key pair generated
- BMC/IPMI access configured (optional but recommended)
10-Step Deployment Process
Step 1: Deploy PXE Server (15 minutes)
# On provisioning server (NixOS)
git clone <plasmacloud-repo>
cd chainfire/baremetal/pxe-server
# Edit configuration
sudo vim /etc/nixos/pxe-config.nix
# Set: serverAddress, subnet, netmask, range, nodes (MAC addresses)
# Add module import
echo 'imports = [ ./chainfire/baremetal/pxe-server/nixos-module.nix ];' | \
sudo tee -a /etc/nixos/configuration.nix
# Apply configuration
sudo nixos-rebuild switch
Validate:
sudo systemctl status dhcpd4 atftpd nginx
curl http://localhost:8080/health
Step 2: Build Netboot Images (20 minutes)
cd baremetal/image-builder
# Build all profiles
./build-images.sh
# Deploy to PXE server
sudo cp artifacts/control-plane/* /var/lib/pxe-boot/nixos/control-plane/
sudo cp artifacts/worker/* /var/lib/pxe-boot/nixos/worker/
Validate:
curl -I http://localhost:8080/boot/nixos/control-plane/bzImage
ls -lh /var/lib/pxe-boot/nixos/*/
Step 3: Generate TLS Certificates (10 minutes)
# Generate CA
openssl genrsa -out ca-key.pem 4096
openssl req -x509 -new -nodes -key ca-key.pem -days 3650 \
-out ca-cert.pem -subj "/CN=PlasmaCloud CA"
# Generate per-node certificates
for node in node01 node02 node03; do
openssl genrsa -out ${node}-key.pem 4096
openssl req -new -key ${node}-key.pem -out ${node}-csr.pem \
-subj "/CN=${node}.example.com"
openssl x509 -req -in ${node}-csr.pem \
-CA ca-cert.pem -CAkey ca-key.pem \
-CAcreateserial -out ${node}-cert.pem -days 365
done
Step 4: Create Node Configurations (15 minutes)
mkdir -p /srv/provisioning/nodes/{node01,node02,node03}.example.com/secrets
# For each node, create:
# 1. configuration.nix (see template below)
# 2. disko.nix (disk layout)
# 3. secrets/cluster-config.json
# 4. Copy TLS certificates to secrets/
Minimal configuration.nix template:
{ config, pkgs, lib, ... }:
{
imports = [
../../profiles/control-plane.nix
../../common/base.nix
./disko.nix
];
networking = {
hostName = "node01";
domain = "example.com";
interfaces.eth0.ipv4.addresses = [{
address = "10.0.200.10";
prefixLength = 24;
}];
defaultGateway = "10.0.200.1";
nameservers = [ "10.0.200.1" ];
};
services.chainfire.enable = true;
services.flaredb.enable = true;
services.iam.enable = true;
services.first-boot-automation.enable = true;
system.stateVersion = "24.11";
}
cluster-config.json (bootstrap nodes):
{
"node_id": "node01",
"bootstrap": true,
"raft_addr": "10.0.200.10:2380",
"initial_peers": [
"node01.example.com:2380",
"node02.example.com:2380",
"node03.example.com:2380"
]
}
Step 5: Power On Nodes (5 minutes)
# Via BMC (example with ipmitool)
for ip in 10.0.10.50 10.0.10.51 10.0.10.52; do
ipmitool -I lanplus -H $ip -U admin -P password \
chassis bootdev pxe options=persistent
ipmitool -I lanplus -H $ip -U admin -P password chassis power on
done
# Or physically: Power on servers with PXE boot enabled in BIOS
Step 6: Verify PXE Boot (5 minutes)
Watch DHCP logs:
sudo journalctl -u dhcpd4 -f
Expected output:
DHCPDISCOVER from 52:54:00:12:34:56
DHCPOFFER to 10.0.100.50
DHCPREQUEST from 52:54:00:12:34:56
DHCPACK to 10.0.100.50
Test SSH to installer:
# Wait 60-90 seconds for boot
ssh root@10.0.100.50 'uname -a'
# Expected: Linux ... nixos
Step 7: Run nixos-anywhere (30-60 minutes)
# Provision all 3 nodes in parallel
for node in node01 node02 node03; do
nix run github:nix-community/nixos-anywhere -- \
--flake /srv/provisioning#${node} \
--build-on-remote \
root@10.0.100.5{0,1,2} & # Adjust IPs
done
wait
echo "Provisioning complete. Nodes will reboot automatically."
Step 8: Wait for First Boot (10 minutes)
Nodes will reboot from disk and run first-boot automation. Monitor:
# Wait for nodes to come online (check production IPs)
for ip in 10.0.200.{10,11,12}; do
until ssh root@$ip 'exit' 2>/dev/null; do
echo "Waiting for $ip..."
sleep 10
done
done
# Check cluster join logs
ssh root@10.0.200.10 'journalctl -u chainfire-cluster-join.service'
Step 9: Verify Cluster Health (5 minutes)
# Check Chainfire cluster
curl -k https://node01.example.com:2379/admin/cluster/members | jq
# Expected output:
# {
# "members": [
# {"id":"node01","raft_addr":"10.0.200.10:2380","status":"healthy","role":"leader"},
# {"id":"node02","raft_addr":"10.0.200.11:2380","status":"healthy","role":"follower"},
# {"id":"node03","raft_addr":"10.0.200.12:2380","status":"healthy","role":"follower"}
# ]
# }
# Check FlareDB cluster
curl -k https://node01.example.com:2479/admin/cluster/members | jq
# Check IAM service
curl -k https://node01.example.com:8080/health | jq
Step 10: Final Validation (5 minutes)
# Run comprehensive health check
/srv/provisioning/scripts/verify-cluster.sh
# Test write/read
curl -k -X PUT https://node01.example.com:2379/v1/kv/test \
-H "Content-Type: application/json" \
-d '{"value":"hello world"}'
curl -k https://node02.example.com:2379/v1/kv/test | jq
# Expected: {"key":"test","value":"hello world"}
Essential Commands
PXE Server Management
# Status
sudo systemctl status dhcpd4 atftpd nginx
# Restart services
sudo systemctl restart dhcpd4 atftpd nginx
# View DHCP leases
sudo cat /var/lib/dhcp/dhcpd.leases
# Monitor PXE boot
sudo tcpdump -i eth0 -n port 67 or port 68 or port 69
Node Provisioning
# Single node
nix run github:nix-community/nixos-anywhere -- \
--flake /srv/provisioning#node01 \
root@10.0.100.50
# With debug output
nix run github:nix-community/nixos-anywhere -- \
--flake /srv/provisioning#node01 \
--debug \
--no-reboot \
root@10.0.100.50
Cluster Operations
# List cluster members
curl -k https://node01.example.com:2379/admin/cluster/members | jq
# Add new member
curl -k -X POST https://node01.example.com:2379/admin/member/add \
-H "Content-Type: application/json" \
-d '{"id":"node04","raft_addr":"10.0.200.13:2380"}'
# Remove member
curl -k -X DELETE https://node01.example.com:2379/admin/member/node04
# Check leader
curl -k https://node01.example.com:2379/admin/cluster/leader | jq
Node Management
# Check service status
ssh root@node01.example.com 'systemctl status chainfire flaredb iam'
# View logs
ssh root@node01.example.com 'journalctl -u chainfire.service -f'
# Rollback NixOS generation
ssh root@node01.example.com 'nixos-rebuild switch --rollback'
# Reboot node
ssh root@node01.example.com 'reboot'
Health Checks
# All services on one node
for port in 2379 2479 8080 9090 9091; do
curl -k https://node01.example.com:$port/health 2>/dev/null | jq -c
done
# Cluster-wide health
for node in node01 node02 node03; do
echo "$node:"
curl -k https://${node}.example.com:2379/health | jq -c
done
Quick Troubleshooting Tips
PXE Boot Not Working
# Check DHCP server
sudo systemctl status dhcpd4
sudo journalctl -u dhcpd4 -n 50
# Test TFTP
tftp localhost -c get undionly.kpxe /tmp/test.kpxe
# Verify BIOS settings: PXE enabled, network first in boot order
nixos-anywhere Fails
# SSH to installer and check disks
ssh root@10.0.100.50 'lsblk'
# Wipe disk if needed
ssh root@10.0.100.50 'wipefs -a /dev/sda && sgdisk --zap-all /dev/sda'
# Retry with debug
nix run github:nix-community/nixos-anywhere -- \
--flake /srv/provisioning#node01 \
--debug \
root@10.0.100.50 2>&1 | tee provision.log
Cluster Join Fails
# Check first-boot logs
ssh root@node01.example.com 'journalctl -u chainfire-cluster-join.service'
# Verify cluster-config.json
ssh root@node01.example.com 'cat /etc/nixos/secrets/cluster-config.json | jq'
# Manual join
curl -k -X POST https://node01.example.com:2379/admin/member/add \
-H "Content-Type: application/json" \
-d '{"id":"node02","raft_addr":"10.0.200.11:2380"}'
Service Won't Start
# Check status and logs
ssh root@node01.example.com 'systemctl status chainfire.service'
ssh root@node01.example.com 'journalctl -u chainfire.service -n 100'
# Verify configuration
ssh root@node01.example.com 'ls -l /etc/nixos/secrets/'
# Check ports
ssh root@node01.example.com 'ss -tlnp | grep 2379'
Network Issues
# Test connectivity
ssh root@node01.example.com 'ping -c 3 node02.example.com'
# Check firewall
ssh root@node01.example.com 'iptables -L -n | grep 2379'
# Test specific port
ssh root@node01.example.com 'nc -zv node02.example.com 2379'
Common Pitfalls
-
Incorrect DHCP Configuration
- Symptom: Nodes get IP but don't download bootloader
- Fix: Verify
next-serverandfilenameoptions in dhcpd.conf
-
Wrong Bootstrap Flag
- Symptom: First 3 nodes fail to form cluster
- Fix: Ensure all 3 have
"bootstrap": truein cluster-config.json
-
Missing TLS Certificates
- Symptom: Services start but cannot communicate
- Fix: Verify certificates exist in
/etc/nixos/secrets/with correct permissions
-
Firewall Blocking Ports
- Symptom: Cluster members cannot reach each other
- Fix: Add iptables rules for ports 2379, 2380, 2479, 2480
-
PXE Boot Loops
- Symptom: Node keeps booting from network after installation
- Fix: Change BIOS boot order (disk before network) or use BMC to set boot device
Adding Additional Nodes
After bootstrap cluster is healthy:
# 1. Create node configuration (worker profile)
mkdir -p /srv/provisioning/nodes/node04.example.com/secrets
# 2. cluster-config.json with bootstrap=false
echo '{
"node_id": "node04",
"bootstrap": false,
"leader_url": "https://node01.example.com:2379",
"raft_addr": "10.0.200.13:2380"
}' > /srv/provisioning/nodes/node04.example.com/secrets/cluster-config.json
# 3. Power on and provision
ipmitool -I lanplus -H 10.0.10.54 -U admin chassis bootdev pxe
ipmitool -I lanplus -H 10.0.10.54 -U admin chassis power on
# Wait 60s
sleep 60
# 4. Run nixos-anywhere
nix run github:nix-community/nixos-anywhere -- \
--flake /srv/provisioning#node04 \
root@10.0.100.60
# 5. Verify join
curl -k https://node01.example.com:2379/admin/cluster/members | jq
Rolling Updates
#!/bin/bash
# Update one node at a time
NODES=("node01" "node02" "node03")
for node in "${NODES[@]}"; do
echo "Updating $node..."
# Deploy new configuration
ssh root@$node.example.com \
"nixos-rebuild switch --flake /srv/provisioning#$node"
# Wait for services to stabilize
sleep 30
# Verify health
curl -k https://${node}.example.com:2379/health | jq
echo "$node updated successfully"
done
Next Steps
After successful deployment:
-
Configure Monitoring
- Deploy Prometheus and Grafana
- Add cluster health dashboards
- Set up alerting rules
-
Enable Backups
- Configure automated backups for Chainfire/FlareDB data
- Test restore procedures
- Document backup schedule
-
Security Hardening
- Remove
-kflags from curl commands (validate TLS) - Implement network segmentation (VLANs)
- Enable audit logging
- Set up log aggregation
- Remove
-
Documentation
- Document node inventory (MAC addresses, IPs, roles)
- Create runbooks for common operations
- Update network diagrams
Reference Documentation
- Full Runbook: RUNBOOK.md
- Hardware Guide: HARDWARE.md
- Network Reference: NETWORK.md
- Command Reference: COMMANDS.md
- Design Document: design.md
Support
For detailed troubleshooting and advanced topics, see the full RUNBOOK.md.
Key Contacts:
- Infrastructure Team: infra@example.com
- Emergency Escalation: oncall@example.com
Useful Resources:
- NixOS Manual: https://nixos.org/manual/nixos/stable/
- nixos-anywhere: https://github.com/nix-community/nixos-anywhere
- iPXE Documentation: https://ipxe.org/
Document End