# Bare-Metal Provisioning Quick Start Guide **Target Audience:** Experienced operators familiar with NixOS and PXE boot **Time Required:** 2-4 hours for 3-node cluster **Last Updated:** 2025-12-10 ## Prerequisites Checklist - [ ] 3+ bare-metal servers with PXE boot enabled - [ ] Network switch and cabling ready - [ ] NixOS provisioning workstation with flakes enabled - [ ] SSH key pair generated - [ ] BMC/IPMI access configured (optional but recommended) ## 10-Step Deployment Process ### Step 1: Deploy PXE Server (15 minutes) ```bash # On provisioning server (NixOS) git clone cd chainfire/baremetal/pxe-server # Edit configuration sudo vim /etc/nixos/pxe-config.nix # Set: serverAddress, subnet, netmask, range, nodes (MAC addresses) # Add module import echo 'imports = [ ./chainfire/baremetal/pxe-server/nixos-module.nix ];' | \ sudo tee -a /etc/nixos/configuration.nix # Apply configuration sudo nixos-rebuild switch ``` **Validate:** ```bash sudo systemctl status dhcpd4 atftpd nginx curl http://localhost:8080/health ``` ### Step 2: Build Netboot Images (20 minutes) ```bash cd baremetal/image-builder # Build all profiles ./build-images.sh # Deploy to PXE server sudo cp artifacts/control-plane/* /var/lib/pxe-boot/nixos/control-plane/ sudo cp artifacts/worker/* /var/lib/pxe-boot/nixos/worker/ ``` **Validate:** ```bash curl -I http://localhost:8080/boot/nixos/control-plane/bzImage ls -lh /var/lib/pxe-boot/nixos/*/ ``` ### Step 3: Generate TLS Certificates (10 minutes) ```bash # Generate CA openssl genrsa -out ca-key.pem 4096 openssl req -x509 -new -nodes -key ca-key.pem -days 3650 \ -out ca-cert.pem -subj "/CN=PlasmaCloud CA" # Generate per-node certificates for node in node01 node02 node03; do openssl genrsa -out ${node}-key.pem 4096 openssl req -new -key ${node}-key.pem -out ${node}-csr.pem \ -subj "/CN=${node}.example.com" openssl x509 -req -in ${node}-csr.pem \ -CA ca-cert.pem -CAkey ca-key.pem \ -CAcreateserial -out ${node}-cert.pem -days 365 done ``` ### Step 4: Create Node Configurations (15 minutes) ```bash mkdir -p /srv/provisioning/nodes/{node01,node02,node03}.example.com/secrets # For each node, create: # 1. configuration.nix (see template below) # 2. disko.nix (disk layout) # 3. secrets/cluster-config.json # 4. Copy TLS certificates to secrets/ ``` **Minimal configuration.nix template:** ```nix { config, pkgs, lib, ... }: { imports = [ ../../profiles/control-plane.nix ../../common/base.nix ./disko.nix ]; networking = { hostName = "node01"; domain = "example.com"; interfaces.eth0.ipv4.addresses = [{ address = "10.0.200.10"; prefixLength = 24; }]; defaultGateway = "10.0.200.1"; nameservers = [ "10.0.200.1" ]; }; services.chainfire.enable = true; services.flaredb.enable = true; services.iam.enable = true; services.first-boot-automation.enable = true; system.stateVersion = "24.11"; } ``` **cluster-config.json (bootstrap nodes):** ```json { "node_id": "node01", "bootstrap": true, "raft_addr": "10.0.200.10:2380", "initial_peers": [ "node01.example.com:2380", "node02.example.com:2380", "node03.example.com:2380" ] } ``` ### Step 5: Power On Nodes (5 minutes) ```bash # Via BMC (example with ipmitool) for ip in 10.0.10.50 10.0.10.51 10.0.10.52; do ipmitool -I lanplus -H $ip -U admin -P password \ chassis bootdev pxe options=persistent ipmitool -I lanplus -H $ip -U admin -P password chassis power on done # Or physically: Power on servers with PXE boot enabled in BIOS ``` ### Step 6: Verify PXE Boot (5 minutes) Watch DHCP logs: ```bash sudo journalctl -u dhcpd4 -f ``` Expected output: ``` DHCPDISCOVER from 52:54:00:12:34:56 DHCPOFFER to 10.0.100.50 DHCPREQUEST from 52:54:00:12:34:56 DHCPACK to 10.0.100.50 ``` Test SSH to installer: ```bash # Wait 60-90 seconds for boot ssh root@10.0.100.50 'uname -a' # Expected: Linux ... nixos ``` ### Step 7: Run nixos-anywhere (30-60 minutes) ```bash # Provision all 3 nodes in parallel for node in node01 node02 node03; do nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#${node} \ --build-on-remote \ root@10.0.100.5{0,1,2} & # Adjust IPs done wait echo "Provisioning complete. Nodes will reboot automatically." ``` ### Step 8: Wait for First Boot (10 minutes) Nodes will reboot from disk and run first-boot automation. Monitor: ```bash # Wait for nodes to come online (check production IPs) for ip in 10.0.200.{10,11,12}; do until ssh root@$ip 'exit' 2>/dev/null; do echo "Waiting for $ip..." sleep 10 done done # Check cluster join logs ssh root@10.0.200.10 'journalctl -u chainfire-cluster-join.service' ``` ### Step 9: Verify Cluster Health (5 minutes) ```bash # Check Chainfire cluster curl -k https://node01.example.com:2379/admin/cluster/members | jq # Expected output: # { # "members": [ # {"id":"node01","raft_addr":"10.0.200.10:2380","status":"healthy","role":"leader"}, # {"id":"node02","raft_addr":"10.0.200.11:2380","status":"healthy","role":"follower"}, # {"id":"node03","raft_addr":"10.0.200.12:2380","status":"healthy","role":"follower"} # ] # } # Check FlareDB cluster curl -k https://node01.example.com:2479/admin/cluster/members | jq # Check IAM service curl -k https://node01.example.com:8080/health | jq ``` ### Step 10: Final Validation (5 minutes) ```bash # Run comprehensive health check /srv/provisioning/scripts/verify-cluster.sh # Test write/read curl -k -X PUT https://node01.example.com:2379/v1/kv/test \ -H "Content-Type: application/json" \ -d '{"value":"hello world"}' curl -k https://node02.example.com:2379/v1/kv/test | jq # Expected: {"key":"test","value":"hello world"} ``` --- ## Essential Commands ### PXE Server Management ```bash # Status sudo systemctl status dhcpd4 atftpd nginx # Restart services sudo systemctl restart dhcpd4 atftpd nginx # View DHCP leases sudo cat /var/lib/dhcp/dhcpd.leases # Monitor PXE boot sudo tcpdump -i eth0 -n port 67 or port 68 or port 69 ``` ### Node Provisioning ```bash # Single node nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#node01 \ root@10.0.100.50 # With debug output nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#node01 \ --debug \ --no-reboot \ root@10.0.100.50 ``` ### Cluster Operations ```bash # List cluster members curl -k https://node01.example.com:2379/admin/cluster/members | jq # Add new member curl -k -X POST https://node01.example.com:2379/admin/member/add \ -H "Content-Type: application/json" \ -d '{"id":"node04","raft_addr":"10.0.200.13:2380"}' # Remove member curl -k -X DELETE https://node01.example.com:2379/admin/member/node04 # Check leader curl -k https://node01.example.com:2379/admin/cluster/leader | jq ``` ### Node Management ```bash # Check service status ssh root@node01.example.com 'systemctl status chainfire flaredb iam' # View logs ssh root@node01.example.com 'journalctl -u chainfire.service -f' # Rollback NixOS generation ssh root@node01.example.com 'nixos-rebuild switch --rollback' # Reboot node ssh root@node01.example.com 'reboot' ``` ### Health Checks ```bash # All services on one node for port in 2379 2479 8080 9090 9091; do curl -k https://node01.example.com:$port/health 2>/dev/null | jq -c done # Cluster-wide health for node in node01 node02 node03; do echo "$node:" curl -k https://${node}.example.com:2379/health | jq -c done ``` --- ## Quick Troubleshooting Tips ### PXE Boot Not Working ```bash # Check DHCP server sudo systemctl status dhcpd4 sudo journalctl -u dhcpd4 -n 50 # Test TFTP tftp localhost -c get undionly.kpxe /tmp/test.kpxe # Verify BIOS settings: PXE enabled, network first in boot order ``` ### nixos-anywhere Fails ```bash # SSH to installer and check disks ssh root@10.0.100.50 'lsblk' # Wipe disk if needed ssh root@10.0.100.50 'wipefs -a /dev/sda && sgdisk --zap-all /dev/sda' # Retry with debug nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#node01 \ --debug \ root@10.0.100.50 2>&1 | tee provision.log ``` ### Cluster Join Fails ```bash # Check first-boot logs ssh root@node01.example.com 'journalctl -u chainfire-cluster-join.service' # Verify cluster-config.json ssh root@node01.example.com 'cat /etc/nixos/secrets/cluster-config.json | jq' # Manual join curl -k -X POST https://node01.example.com:2379/admin/member/add \ -H "Content-Type: application/json" \ -d '{"id":"node02","raft_addr":"10.0.200.11:2380"}' ``` ### Service Won't Start ```bash # Check status and logs ssh root@node01.example.com 'systemctl status chainfire.service' ssh root@node01.example.com 'journalctl -u chainfire.service -n 100' # Verify configuration ssh root@node01.example.com 'ls -l /etc/nixos/secrets/' # Check ports ssh root@node01.example.com 'ss -tlnp | grep 2379' ``` ### Network Issues ```bash # Test connectivity ssh root@node01.example.com 'ping -c 3 node02.example.com' # Check firewall ssh root@node01.example.com 'iptables -L -n | grep 2379' # Test specific port ssh root@node01.example.com 'nc -zv node02.example.com 2379' ``` --- ## Common Pitfalls 1. **Incorrect DHCP Configuration** - Symptom: Nodes get IP but don't download bootloader - Fix: Verify `next-server` and `filename` options in dhcpd.conf 2. **Wrong Bootstrap Flag** - Symptom: First 3 nodes fail to form cluster - Fix: Ensure all 3 have `"bootstrap": true` in cluster-config.json 3. **Missing TLS Certificates** - Symptom: Services start but cannot communicate - Fix: Verify certificates exist in `/etc/nixos/secrets/` with correct permissions 4. **Firewall Blocking Ports** - Symptom: Cluster members cannot reach each other - Fix: Add iptables rules for ports 2379, 2380, 2479, 2480 5. **PXE Boot Loops** - Symptom: Node keeps booting from network after installation - Fix: Change BIOS boot order (disk before network) or use BMC to set boot device --- ## Adding Additional Nodes After bootstrap cluster is healthy: ```bash # 1. Create node configuration (worker profile) mkdir -p /srv/provisioning/nodes/node04.example.com/secrets # 2. cluster-config.json with bootstrap=false echo '{ "node_id": "node04", "bootstrap": false, "leader_url": "https://node01.example.com:2379", "raft_addr": "10.0.200.13:2380" }' > /srv/provisioning/nodes/node04.example.com/secrets/cluster-config.json # 3. Power on and provision ipmitool -I lanplus -H 10.0.10.54 -U admin chassis bootdev pxe ipmitool -I lanplus -H 10.0.10.54 -U admin chassis power on # Wait 60s sleep 60 # 4. Run nixos-anywhere nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#node04 \ root@10.0.100.60 # 5. Verify join curl -k https://node01.example.com:2379/admin/cluster/members | jq ``` --- ## Rolling Updates ```bash #!/bin/bash # Update one node at a time NODES=("node01" "node02" "node03") for node in "${NODES[@]}"; do echo "Updating $node..." # Deploy new configuration ssh root@$node.example.com \ "nixos-rebuild switch --flake /srv/provisioning#$node" # Wait for services to stabilize sleep 30 # Verify health curl -k https://${node}.example.com:2379/health | jq echo "$node updated successfully" done ``` --- ## Next Steps After successful deployment: 1. **Configure Monitoring** - Deploy Prometheus and Grafana - Add cluster health dashboards - Set up alerting rules 2. **Enable Backups** - Configure automated backups for Chainfire/FlareDB data - Test restore procedures - Document backup schedule 3. **Security Hardening** - Remove `-k` flags from curl commands (validate TLS) - Implement network segmentation (VLANs) - Enable audit logging - Set up log aggregation 4. **Documentation** - Document node inventory (MAC addresses, IPs, roles) - Create runbooks for common operations - Update network diagrams --- ## Reference Documentation - **Full Runbook:** [RUNBOOK.md](RUNBOOK.md) - **Hardware Guide:** [HARDWARE.md](HARDWARE.md) - **Network Reference:** [NETWORK.md](NETWORK.md) - **Command Reference:** [COMMANDS.md](COMMANDS.md) - **Design Document:** [design.md](design.md) --- ## Support For detailed troubleshooting and advanced topics, see the full [RUNBOOK.md](RUNBOOK.md). **Key Contacts:** - Infrastructure Team: infra@example.com - Emergency Escalation: oncall@example.com **Useful Resources:** - NixOS Manual: https://nixos.org/manual/nixos/stable/ - nixos-anywhere: https://github.com/nix-community/nixos-anywhere - iPXE Documentation: https://ipxe.org/ --- **Document End**