- netboot-base.nix with SSH key auth - Launch scripts for node01/02/03 - Node configuration.nix and disko.nix - Nix modules for first-boot automation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
529 lines
12 KiB
Markdown
529 lines
12 KiB
Markdown
# Bare-Metal Provisioning Quick Start Guide
|
|
|
|
**Target Audience:** Experienced operators familiar with NixOS and PXE boot
|
|
**Time Required:** 2-4 hours for 3-node cluster
|
|
**Last Updated:** 2025-12-10
|
|
|
|
## Prerequisites Checklist
|
|
|
|
- [ ] 3+ bare-metal servers with PXE boot enabled
|
|
- [ ] Network switch and cabling ready
|
|
- [ ] NixOS provisioning workstation with flakes enabled
|
|
- [ ] SSH key pair generated
|
|
- [ ] BMC/IPMI access configured (optional but recommended)
|
|
|
|
## 10-Step Deployment Process
|
|
|
|
### Step 1: Deploy PXE Server (15 minutes)
|
|
|
|
```bash
|
|
# On provisioning server (NixOS)
|
|
git clone <plasmacloud-repo>
|
|
cd chainfire/baremetal/pxe-server
|
|
|
|
# Edit configuration
|
|
sudo vim /etc/nixos/pxe-config.nix
|
|
# Set: serverAddress, subnet, netmask, range, nodes (MAC addresses)
|
|
|
|
# Add module import
|
|
echo 'imports = [ ./chainfire/baremetal/pxe-server/nixos-module.nix ];' | \
|
|
sudo tee -a /etc/nixos/configuration.nix
|
|
|
|
# Apply configuration
|
|
sudo nixos-rebuild switch
|
|
```
|
|
|
|
**Validate:**
|
|
```bash
|
|
sudo systemctl status dhcpd4 atftpd nginx
|
|
curl http://localhost:8080/health
|
|
```
|
|
|
|
### Step 2: Build Netboot Images (20 minutes)
|
|
|
|
```bash
|
|
cd baremetal/image-builder
|
|
|
|
# Build all profiles
|
|
./build-images.sh
|
|
|
|
# Deploy to PXE server
|
|
sudo cp artifacts/control-plane/* /var/lib/pxe-boot/nixos/control-plane/
|
|
sudo cp artifacts/worker/* /var/lib/pxe-boot/nixos/worker/
|
|
```
|
|
|
|
**Validate:**
|
|
```bash
|
|
curl -I http://localhost:8080/boot/nixos/control-plane/bzImage
|
|
ls -lh /var/lib/pxe-boot/nixos/*/
|
|
```
|
|
|
|
### Step 3: Generate TLS Certificates (10 minutes)
|
|
|
|
```bash
|
|
# Generate CA
|
|
openssl genrsa -out ca-key.pem 4096
|
|
openssl req -x509 -new -nodes -key ca-key.pem -days 3650 \
|
|
-out ca-cert.pem -subj "/CN=PlasmaCloud CA"
|
|
|
|
# Generate per-node certificates
|
|
for node in node01 node02 node03; do
|
|
openssl genrsa -out ${node}-key.pem 4096
|
|
openssl req -new -key ${node}-key.pem -out ${node}-csr.pem \
|
|
-subj "/CN=${node}.example.com"
|
|
openssl x509 -req -in ${node}-csr.pem \
|
|
-CA ca-cert.pem -CAkey ca-key.pem \
|
|
-CAcreateserial -out ${node}-cert.pem -days 365
|
|
done
|
|
```
|
|
|
|
### Step 4: Create Node Configurations (15 minutes)
|
|
|
|
```bash
|
|
mkdir -p /srv/provisioning/nodes/{node01,node02,node03}.example.com/secrets
|
|
|
|
# For each node, create:
|
|
# 1. configuration.nix (see template below)
|
|
# 2. disko.nix (disk layout)
|
|
# 3. secrets/cluster-config.json
|
|
# 4. Copy TLS certificates to secrets/
|
|
```
|
|
|
|
**Minimal configuration.nix template:**
|
|
```nix
|
|
{ config, pkgs, lib, ... }:
|
|
{
|
|
imports = [
|
|
../../profiles/control-plane.nix
|
|
../../common/base.nix
|
|
./disko.nix
|
|
];
|
|
|
|
networking = {
|
|
hostName = "node01";
|
|
domain = "example.com";
|
|
interfaces.eth0.ipv4.addresses = [{
|
|
address = "10.0.200.10";
|
|
prefixLength = 24;
|
|
}];
|
|
defaultGateway = "10.0.200.1";
|
|
nameservers = [ "10.0.200.1" ];
|
|
};
|
|
|
|
services.chainfire.enable = true;
|
|
services.flaredb.enable = true;
|
|
services.iam.enable = true;
|
|
services.first-boot-automation.enable = true;
|
|
|
|
system.stateVersion = "24.11";
|
|
}
|
|
```
|
|
|
|
**cluster-config.json (bootstrap nodes):**
|
|
```json
|
|
{
|
|
"node_id": "node01",
|
|
"bootstrap": true,
|
|
"raft_addr": "10.0.200.10:2380",
|
|
"initial_peers": [
|
|
"node01.example.com:2380",
|
|
"node02.example.com:2380",
|
|
"node03.example.com:2380"
|
|
]
|
|
}
|
|
```
|
|
|
|
### Step 5: Power On Nodes (5 minutes)
|
|
|
|
```bash
|
|
# Via BMC (example with ipmitool)
|
|
for ip in 10.0.10.50 10.0.10.51 10.0.10.52; do
|
|
ipmitool -I lanplus -H $ip -U admin -P password \
|
|
chassis bootdev pxe options=persistent
|
|
ipmitool -I lanplus -H $ip -U admin -P password chassis power on
|
|
done
|
|
|
|
# Or physically: Power on servers with PXE boot enabled in BIOS
|
|
```
|
|
|
|
### Step 6: Verify PXE Boot (5 minutes)
|
|
|
|
Watch DHCP logs:
|
|
```bash
|
|
sudo journalctl -u dhcpd4 -f
|
|
```
|
|
|
|
Expected output:
|
|
```
|
|
DHCPDISCOVER from 52:54:00:12:34:56
|
|
DHCPOFFER to 10.0.100.50
|
|
DHCPREQUEST from 52:54:00:12:34:56
|
|
DHCPACK to 10.0.100.50
|
|
```
|
|
|
|
Test SSH to installer:
|
|
```bash
|
|
# Wait 60-90 seconds for boot
|
|
ssh root@10.0.100.50 'uname -a'
|
|
# Expected: Linux ... nixos
|
|
```
|
|
|
|
### Step 7: Run nixos-anywhere (30-60 minutes)
|
|
|
|
```bash
|
|
# Provision all 3 nodes in parallel
|
|
for node in node01 node02 node03; do
|
|
nix run github:nix-community/nixos-anywhere -- \
|
|
--flake /srv/provisioning#${node} \
|
|
--build-on-remote \
|
|
root@10.0.100.5{0,1,2} & # Adjust IPs
|
|
done
|
|
wait
|
|
|
|
echo "Provisioning complete. Nodes will reboot automatically."
|
|
```
|
|
|
|
### Step 8: Wait for First Boot (10 minutes)
|
|
|
|
Nodes will reboot from disk and run first-boot automation. Monitor:
|
|
|
|
```bash
|
|
# Wait for nodes to come online (check production IPs)
|
|
for ip in 10.0.200.{10,11,12}; do
|
|
until ssh root@$ip 'exit' 2>/dev/null; do
|
|
echo "Waiting for $ip..."
|
|
sleep 10
|
|
done
|
|
done
|
|
|
|
# Check cluster join logs
|
|
ssh root@10.0.200.10 'journalctl -u chainfire-cluster-join.service'
|
|
```
|
|
|
|
### Step 9: Verify Cluster Health (5 minutes)
|
|
|
|
```bash
|
|
# Check Chainfire cluster
|
|
curl -k https://node01.example.com:2379/admin/cluster/members | jq
|
|
|
|
# Expected output:
|
|
# {
|
|
# "members": [
|
|
# {"id":"node01","raft_addr":"10.0.200.10:2380","status":"healthy","role":"leader"},
|
|
# {"id":"node02","raft_addr":"10.0.200.11:2380","status":"healthy","role":"follower"},
|
|
# {"id":"node03","raft_addr":"10.0.200.12:2380","status":"healthy","role":"follower"}
|
|
# ]
|
|
# }
|
|
|
|
# Check FlareDB cluster
|
|
curl -k https://node01.example.com:2479/admin/cluster/members | jq
|
|
|
|
# Check IAM service
|
|
curl -k https://node01.example.com:8080/health | jq
|
|
```
|
|
|
|
### Step 10: Final Validation (5 minutes)
|
|
|
|
```bash
|
|
# Run comprehensive health check
|
|
/srv/provisioning/scripts/verify-cluster.sh
|
|
|
|
# Test write/read
|
|
curl -k -X PUT https://node01.example.com:2379/v1/kv/test \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"value":"hello world"}'
|
|
|
|
curl -k https://node02.example.com:2379/v1/kv/test | jq
|
|
# Expected: {"key":"test","value":"hello world"}
|
|
```
|
|
|
|
---
|
|
|
|
## Essential Commands
|
|
|
|
### PXE Server Management
|
|
```bash
|
|
# Status
|
|
sudo systemctl status dhcpd4 atftpd nginx
|
|
|
|
# Restart services
|
|
sudo systemctl restart dhcpd4 atftpd nginx
|
|
|
|
# View DHCP leases
|
|
sudo cat /var/lib/dhcp/dhcpd.leases
|
|
|
|
# Monitor PXE boot
|
|
sudo tcpdump -i eth0 -n port 67 or port 68 or port 69
|
|
```
|
|
|
|
### Node Provisioning
|
|
```bash
|
|
# Single node
|
|
nix run github:nix-community/nixos-anywhere -- \
|
|
--flake /srv/provisioning#node01 \
|
|
root@10.0.100.50
|
|
|
|
# With debug output
|
|
nix run github:nix-community/nixos-anywhere -- \
|
|
--flake /srv/provisioning#node01 \
|
|
--debug \
|
|
--no-reboot \
|
|
root@10.0.100.50
|
|
```
|
|
|
|
### Cluster Operations
|
|
```bash
|
|
# List cluster members
|
|
curl -k https://node01.example.com:2379/admin/cluster/members | jq
|
|
|
|
# Add new member
|
|
curl -k -X POST https://node01.example.com:2379/admin/member/add \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"id":"node04","raft_addr":"10.0.200.13:2380"}'
|
|
|
|
# Remove member
|
|
curl -k -X DELETE https://node01.example.com:2379/admin/member/node04
|
|
|
|
# Check leader
|
|
curl -k https://node01.example.com:2379/admin/cluster/leader | jq
|
|
```
|
|
|
|
### Node Management
|
|
```bash
|
|
# Check service status
|
|
ssh root@node01.example.com 'systemctl status chainfire flaredb iam'
|
|
|
|
# View logs
|
|
ssh root@node01.example.com 'journalctl -u chainfire.service -f'
|
|
|
|
# Rollback NixOS generation
|
|
ssh root@node01.example.com 'nixos-rebuild switch --rollback'
|
|
|
|
# Reboot node
|
|
ssh root@node01.example.com 'reboot'
|
|
```
|
|
|
|
### Health Checks
|
|
```bash
|
|
# All services on one node
|
|
for port in 2379 2479 8080 9090 9091; do
|
|
curl -k https://node01.example.com:$port/health 2>/dev/null | jq -c
|
|
done
|
|
|
|
# Cluster-wide health
|
|
for node in node01 node02 node03; do
|
|
echo "$node:"
|
|
curl -k https://${node}.example.com:2379/health | jq -c
|
|
done
|
|
```
|
|
|
|
---
|
|
|
|
## Quick Troubleshooting Tips
|
|
|
|
### PXE Boot Not Working
|
|
```bash
|
|
# Check DHCP server
|
|
sudo systemctl status dhcpd4
|
|
sudo journalctl -u dhcpd4 -n 50
|
|
|
|
# Test TFTP
|
|
tftp localhost -c get undionly.kpxe /tmp/test.kpxe
|
|
|
|
# Verify BIOS settings: PXE enabled, network first in boot order
|
|
```
|
|
|
|
### nixos-anywhere Fails
|
|
```bash
|
|
# SSH to installer and check disks
|
|
ssh root@10.0.100.50 'lsblk'
|
|
|
|
# Wipe disk if needed
|
|
ssh root@10.0.100.50 'wipefs -a /dev/sda && sgdisk --zap-all /dev/sda'
|
|
|
|
# Retry with debug
|
|
nix run github:nix-community/nixos-anywhere -- \
|
|
--flake /srv/provisioning#node01 \
|
|
--debug \
|
|
root@10.0.100.50 2>&1 | tee provision.log
|
|
```
|
|
|
|
### Cluster Join Fails
|
|
```bash
|
|
# Check first-boot logs
|
|
ssh root@node01.example.com 'journalctl -u chainfire-cluster-join.service'
|
|
|
|
# Verify cluster-config.json
|
|
ssh root@node01.example.com 'cat /etc/nixos/secrets/cluster-config.json | jq'
|
|
|
|
# Manual join
|
|
curl -k -X POST https://node01.example.com:2379/admin/member/add \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"id":"node02","raft_addr":"10.0.200.11:2380"}'
|
|
```
|
|
|
|
### Service Won't Start
|
|
```bash
|
|
# Check status and logs
|
|
ssh root@node01.example.com 'systemctl status chainfire.service'
|
|
ssh root@node01.example.com 'journalctl -u chainfire.service -n 100'
|
|
|
|
# Verify configuration
|
|
ssh root@node01.example.com 'ls -l /etc/nixos/secrets/'
|
|
|
|
# Check ports
|
|
ssh root@node01.example.com 'ss -tlnp | grep 2379'
|
|
```
|
|
|
|
### Network Issues
|
|
```bash
|
|
# Test connectivity
|
|
ssh root@node01.example.com 'ping -c 3 node02.example.com'
|
|
|
|
# Check firewall
|
|
ssh root@node01.example.com 'iptables -L -n | grep 2379'
|
|
|
|
# Test specific port
|
|
ssh root@node01.example.com 'nc -zv node02.example.com 2379'
|
|
```
|
|
|
|
---
|
|
|
|
## Common Pitfalls
|
|
|
|
1. **Incorrect DHCP Configuration**
|
|
- Symptom: Nodes get IP but don't download bootloader
|
|
- Fix: Verify `next-server` and `filename` options in dhcpd.conf
|
|
|
|
2. **Wrong Bootstrap Flag**
|
|
- Symptom: First 3 nodes fail to form cluster
|
|
- Fix: Ensure all 3 have `"bootstrap": true` in cluster-config.json
|
|
|
|
3. **Missing TLS Certificates**
|
|
- Symptom: Services start but cannot communicate
|
|
- Fix: Verify certificates exist in `/etc/nixos/secrets/` with correct permissions
|
|
|
|
4. **Firewall Blocking Ports**
|
|
- Symptom: Cluster members cannot reach each other
|
|
- Fix: Add iptables rules for ports 2379, 2380, 2479, 2480
|
|
|
|
5. **PXE Boot Loops**
|
|
- Symptom: Node keeps booting from network after installation
|
|
- Fix: Change BIOS boot order (disk before network) or use BMC to set boot device
|
|
|
|
---
|
|
|
|
## Adding Additional Nodes
|
|
|
|
After bootstrap cluster is healthy:
|
|
|
|
```bash
|
|
# 1. Create node configuration (worker profile)
|
|
mkdir -p /srv/provisioning/nodes/node04.example.com/secrets
|
|
|
|
# 2. cluster-config.json with bootstrap=false
|
|
echo '{
|
|
"node_id": "node04",
|
|
"bootstrap": false,
|
|
"leader_url": "https://node01.example.com:2379",
|
|
"raft_addr": "10.0.200.13:2380"
|
|
}' > /srv/provisioning/nodes/node04.example.com/secrets/cluster-config.json
|
|
|
|
# 3. Power on and provision
|
|
ipmitool -I lanplus -H 10.0.10.54 -U admin chassis bootdev pxe
|
|
ipmitool -I lanplus -H 10.0.10.54 -U admin chassis power on
|
|
|
|
# Wait 60s
|
|
sleep 60
|
|
|
|
# 4. Run nixos-anywhere
|
|
nix run github:nix-community/nixos-anywhere -- \
|
|
--flake /srv/provisioning#node04 \
|
|
root@10.0.100.60
|
|
|
|
# 5. Verify join
|
|
curl -k https://node01.example.com:2379/admin/cluster/members | jq
|
|
```
|
|
|
|
---
|
|
|
|
## Rolling Updates
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Update one node at a time
|
|
|
|
NODES=("node01" "node02" "node03")
|
|
|
|
for node in "${NODES[@]}"; do
|
|
echo "Updating $node..."
|
|
|
|
# Deploy new configuration
|
|
ssh root@$node.example.com \
|
|
"nixos-rebuild switch --flake /srv/provisioning#$node"
|
|
|
|
# Wait for services to stabilize
|
|
sleep 30
|
|
|
|
# Verify health
|
|
curl -k https://${node}.example.com:2379/health | jq
|
|
|
|
echo "$node updated successfully"
|
|
done
|
|
```
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
After successful deployment:
|
|
|
|
1. **Configure Monitoring**
|
|
- Deploy Prometheus and Grafana
|
|
- Add cluster health dashboards
|
|
- Set up alerting rules
|
|
|
|
2. **Enable Backups**
|
|
- Configure automated backups for Chainfire/FlareDB data
|
|
- Test restore procedures
|
|
- Document backup schedule
|
|
|
|
3. **Security Hardening**
|
|
- Remove `-k` flags from curl commands (validate TLS)
|
|
- Implement network segmentation (VLANs)
|
|
- Enable audit logging
|
|
- Set up log aggregation
|
|
|
|
4. **Documentation**
|
|
- Document node inventory (MAC addresses, IPs, roles)
|
|
- Create runbooks for common operations
|
|
- Update network diagrams
|
|
|
|
---
|
|
|
|
## Reference Documentation
|
|
|
|
- **Full Runbook:** [RUNBOOK.md](RUNBOOK.md)
|
|
- **Hardware Guide:** [HARDWARE.md](HARDWARE.md)
|
|
- **Network Reference:** [NETWORK.md](NETWORK.md)
|
|
- **Command Reference:** [COMMANDS.md](COMMANDS.md)
|
|
- **Design Document:** [design.md](design.md)
|
|
|
|
---
|
|
|
|
## Support
|
|
|
|
For detailed troubleshooting and advanced topics, see the full [RUNBOOK.md](RUNBOOK.md).
|
|
|
|
**Key Contacts:**
|
|
- Infrastructure Team: infra@example.com
|
|
- Emergency Escalation: oncall@example.com
|
|
|
|
**Useful Resources:**
|
|
- NixOS Manual: https://nixos.org/manual/nixos/stable/
|
|
- nixos-anywhere: https://github.com/nix-community/nixos-anywhere
|
|
- iPXE Documentation: https://ipxe.org/
|
|
|
|
---
|
|
|
|
**Document End**
|