photoncloud-monorepo/docs/por/T032-baremetal-provisioning/QUICKSTART.md

# Bare-Metal Provisioning Quick Start Guide

**Target Audience:** Experienced operators familiar with NixOS and PXE boot
**Time Required:** 2-4 hours for 3-node cluster
**Last Updated:** 2025-12-10

## Prerequisites Checklist

- [ ] 3+ bare-metal servers with PXE boot enabled
- [ ] Network switch and cabling ready
- [ ] NixOS provisioning workstation with flakes enabled
- [ ] SSH key pair generated
- [ ] BMC/IPMI access configured (optional but recommended)

## 10-Step Deployment Process

### Step 1: Deploy PXE Server (15 minutes)

```bash
# On provisioning server (NixOS)
git clone <plasmacloud-repo>
cd chainfire/baremetal/pxe-server

# Edit configuration
sudo vim /etc/nixos/pxe-config.nix
# Set: serverAddress, subnet, netmask, range, nodes (MAC addresses)

# Add module import
echo 'imports = [ ./chainfire/baremetal/pxe-server/nixos-module.nix ];' | \
  sudo tee -a /etc/nixos/configuration.nix

# Apply configuration
sudo nixos-rebuild switch
```

**Validate:**
```bash
sudo systemctl status dhcpd4 atftpd nginx
curl http://localhost:8080/health
```

### Step 2: Build Netboot Images (20 minutes)

```bash
cd baremetal/image-builder

# Build all profiles
./build-images.sh

# Deploy to PXE server
sudo cp artifacts/control-plane/* /var/lib/pxe-boot/nixos/control-plane/
sudo cp artifacts/worker/* /var/lib/pxe-boot/nixos/worker/
```

**Validate:**
```bash
curl -I http://localhost:8080/boot/nixos/control-plane/bzImage
ls -lh /var/lib/pxe-boot/nixos/*/
```

### Step 3: Generate TLS Certificates (10 minutes)

```bash
# Generate CA
openssl genrsa -out ca-key.pem 4096
openssl req -x509 -new -nodes -key ca-key.pem -days 3650 \
  -out ca-cert.pem -subj "/CN=PlasmaCloud CA"

# Generate per-node certificates
for node in node01 node02 node03; do
  openssl genrsa -out ${node}-key.pem 4096
  openssl req -new -key ${node}-key.pem -out ${node}-csr.pem \
    -subj "/CN=${node}.example.com"
  openssl x509 -req -in ${node}-csr.pem \
    -CA ca-cert.pem -CAkey ca-key.pem \
    -CAcreateserial -out ${node}-cert.pem -days 365
done
```

### Step 4: Create Node Configurations (15 minutes)

```bash
mkdir -p /srv/provisioning/nodes/{node01,node02,node03}.example.com/secrets

# For each node, create:
# 1. configuration.nix (see template below)
# 2. disko.nix (disk layout)
# 3. secrets/cluster-config.json
# 4. Copy TLS certificates to secrets/
```

**Minimal configuration.nix template:**
```nix
{ config, pkgs, lib, ... }:
{
  imports = [
    ../../profiles/control-plane.nix
    ../../common/base.nix
    ./disko.nix
  ];

  networking = {
    hostName = "node01";
    domain = "example.com";
    interfaces.eth0.ipv4.addresses = [{
      address = "10.0.200.10";
      prefixLength = 24;
    }];
    defaultGateway = "10.0.200.1";
    nameservers = [ "10.0.200.1" ];
  };

  services.chainfire.enable = true;
  services.flaredb.enable = true;
  services.iam.enable = true;
  services.first-boot-automation.enable = true;

  system.stateVersion = "24.11";
}
```

**cluster-config.json (bootstrap nodes):**
```json
{
  "node_id": "node01",
  "bootstrap": true,
  "raft_addr": "10.0.200.10:2380",
  "initial_peers": [
    "node01.example.com:2380",
    "node02.example.com:2380",
    "node03.example.com:2380"
  ]
}
```

### Step 5: Power On Nodes (5 minutes)

```bash
# Via BMC (example with ipmitool)
for ip in 10.0.10.50 10.0.10.51 10.0.10.52; do
  ipmitool -I lanplus -H $ip -U admin -P password \
    chassis bootdev pxe options=persistent
  ipmitool -I lanplus -H $ip -U admin -P password chassis power on
done

# Or physically: Power on servers with PXE boot enabled in BIOS
```

### Step 6: Verify PXE Boot (5 minutes)

Watch DHCP logs:
```bash
sudo journalctl -u dhcpd4 -f
```

Expected output:
```
DHCPDISCOVER from 52:54:00:12:34:56
DHCPOFFER to 10.0.100.50
DHCPREQUEST from 52:54:00:12:34:56
DHCPACK to 10.0.100.50
```

Test SSH to installer:
```bash
# Wait 60-90 seconds for boot
ssh root@10.0.100.50 'uname -a'
# Expected: Linux ... nixos
```

### Step 7: Run nixos-anywhere (30-60 minutes)

```bash
# Provision all 3 nodes in parallel
for node in node01 node02 node03; do
  nix run github:nix-community/nixos-anywhere -- \
    --flake /srv/provisioning#${node} \
    --build-on-remote \
    root@10.0.100.5{0,1,2} &  # Adjust IPs
done
wait

echo "Provisioning complete. Nodes will reboot automatically."
```

### Step 8: Wait for First Boot (10 minutes)

Nodes will reboot from disk and run first-boot automation. Monitor:

```bash
# Wait for nodes to come online (check production IPs)
for ip in 10.0.200.{10,11,12}; do
  until ssh root@$ip 'exit' 2>/dev/null; do
    echo "Waiting for $ip..."
    sleep 10
  done
done

# Check cluster join logs
ssh root@10.0.200.10 'journalctl -u chainfire-cluster-join.service'
```

### Step 9: Verify Cluster Health (5 minutes)

```bash
# Check Chainfire cluster
curl -k https://node01.example.com:2379/admin/cluster/members | jq

# Expected output:
# {
#   "members": [
#     {"id":"node01","raft_addr":"10.0.200.10:2380","status":"healthy","role":"leader"},
#     {"id":"node02","raft_addr":"10.0.200.11:2380","status":"healthy","role":"follower"},
#     {"id":"node03","raft_addr":"10.0.200.12:2380","status":"healthy","role":"follower"}
#   ]
# }

# Check FlareDB cluster
curl -k https://node01.example.com:2479/admin/cluster/members | jq

# Check IAM service
curl -k https://node01.example.com:8080/health | jq
```

### Step 10: Final Validation (5 minutes)

```bash
# Run comprehensive health check
/srv/provisioning/scripts/verify-cluster.sh

# Test write/read
curl -k -X PUT https://node01.example.com:2379/v1/kv/test \
  -H "Content-Type: application/json" \
  -d '{"value":"hello world"}'

curl -k https://node02.example.com:2379/v1/kv/test | jq
# Expected: {"key":"test","value":"hello world"}
```

---

## Essential Commands

### PXE Server Management
```bash
# Status
sudo systemctl status dhcpd4 atftpd nginx

# Restart services
sudo systemctl restart dhcpd4 atftpd nginx

# View DHCP leases
sudo cat /var/lib/dhcp/dhcpd.leases

# Monitor PXE boot
sudo tcpdump -i eth0 -n port 67 or port 68 or port 69
```

### Node Provisioning
```bash
# Single node
nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node01 \
  root@10.0.100.50

# With debug output
nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node01 \
  --debug \
  --no-reboot \
  root@10.0.100.50
```

### Cluster Operations
```bash
# List cluster members
curl -k https://node01.example.com:2379/admin/cluster/members | jq

# Add new member
curl -k -X POST https://node01.example.com:2379/admin/member/add \
  -H "Content-Type: application/json" \
  -d '{"id":"node04","raft_addr":"10.0.200.13:2380"}'

# Remove member
curl -k -X DELETE https://node01.example.com:2379/admin/member/node04

# Check leader
curl -k https://node01.example.com:2379/admin/cluster/leader | jq
```

### Node Management
```bash
# Check service status
ssh root@node01.example.com 'systemctl status chainfire flaredb iam'

# View logs
ssh root@node01.example.com 'journalctl -u chainfire.service -f'

# Rollback NixOS generation
ssh root@node01.example.com 'nixos-rebuild switch --rollback'

# Reboot node
ssh root@node01.example.com 'reboot'
```

### Health Checks
```bash
# All services on one node
for port in 2379 2479 8080 9090 9091; do
  curl -k https://node01.example.com:$port/health 2>/dev/null | jq -c
done

# Cluster-wide health
for node in node01 node02 node03; do
  echo "$node:"
  curl -k https://${node}.example.com:2379/health | jq -c
done
```

---

## Quick Troubleshooting Tips

### PXE Boot Not Working
```bash
# Check DHCP server
sudo systemctl status dhcpd4
sudo journalctl -u dhcpd4 -n 50

# Test TFTP
tftp localhost -c get undionly.kpxe /tmp/test.kpxe

# Verify BIOS settings: PXE enabled, network first in boot order
```

### nixos-anywhere Fails
```bash
# SSH to installer and check disks
ssh root@10.0.100.50 'lsblk'

# Wipe disk if needed
ssh root@10.0.100.50 'wipefs -a /dev/sda && sgdisk --zap-all /dev/sda'

# Retry with debug
nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node01 \
  --debug \
  root@10.0.100.50 2>&1 | tee provision.log
```

### Cluster Join Fails
```bash
# Check first-boot logs
ssh root@node01.example.com 'journalctl -u chainfire-cluster-join.service'

# Verify cluster-config.json
ssh root@node01.example.com 'cat /etc/nixos/secrets/cluster-config.json | jq'

# Manual join
curl -k -X POST https://node01.example.com:2379/admin/member/add \
  -H "Content-Type: application/json" \
  -d '{"id":"node02","raft_addr":"10.0.200.11:2380"}'
```

### Service Won't Start
```bash
# Check status and logs
ssh root@node01.example.com 'systemctl status chainfire.service'
ssh root@node01.example.com 'journalctl -u chainfire.service -n 100'

# Verify configuration
ssh root@node01.example.com 'ls -l /etc/nixos/secrets/'

# Check ports
ssh root@node01.example.com 'ss -tlnp | grep 2379'
```

### Network Issues
```bash
# Test connectivity
ssh root@node01.example.com 'ping -c 3 node02.example.com'

# Check firewall
ssh root@node01.example.com 'iptables -L -n | grep 2379'

# Test specific port
ssh root@node01.example.com 'nc -zv node02.example.com 2379'
```

---

## Common Pitfalls

1. **Incorrect DHCP Configuration**
   - Symptom: Nodes get IP but don't download bootloader
   - Fix: Verify `next-server` and `filename` options in dhcpd.conf

2. **Wrong Bootstrap Flag**
   - Symptom: First 3 nodes fail to form cluster
   - Fix: Ensure all 3 have `"bootstrap": true` in cluster-config.json

3. **Missing TLS Certificates**
   - Symptom: Services start but cannot communicate
   - Fix: Verify certificates exist in `/etc/nixos/secrets/` with correct permissions

4. **Firewall Blocking Ports**
   - Symptom: Cluster members cannot reach each other
   - Fix: Add iptables rules for ports 2379, 2380, 2479, 2480

5. **PXE Boot Loops**
   - Symptom: Node keeps booting from network after installation
   - Fix: Change BIOS boot order (disk before network) or use BMC to set boot device

---

## Adding Additional Nodes

After bootstrap cluster is healthy:

```bash
# 1. Create node configuration (worker profile)
mkdir -p /srv/provisioning/nodes/node04.example.com/secrets

# 2. cluster-config.json with bootstrap=false
echo '{
  "node_id": "node04",
  "bootstrap": false,
  "leader_url": "https://node01.example.com:2379",
  "raft_addr": "10.0.200.13:2380"
}' > /srv/provisioning/nodes/node04.example.com/secrets/cluster-config.json

# 3. Power on and provision
ipmitool -I lanplus -H 10.0.10.54 -U admin chassis bootdev pxe
ipmitool -I lanplus -H 10.0.10.54 -U admin chassis power on

# Wait 60s
sleep 60

# 4. Run nixos-anywhere
nix run github:nix-community/nixos-anywhere -- \
  --flake /srv/provisioning#node04 \
  root@10.0.100.60

# 5. Verify join
curl -k https://node01.example.com:2379/admin/cluster/members | jq
```

---

## Rolling Updates

```bash
#!/bin/bash
# Update one node at a time

NODES=("node01" "node02" "node03")

for node in "${NODES[@]}"; do
  echo "Updating $node..."

  # Deploy new configuration
  ssh root@$node.example.com \
    "nixos-rebuild switch --flake /srv/provisioning#$node"

  # Wait for services to stabilize
  sleep 30

  # Verify health
  curl -k https://${node}.example.com:2379/health | jq

  echo "$node updated successfully"
done
```

---

## Next Steps

After successful deployment:

1. **Configure Monitoring**
   - Deploy Prometheus and Grafana
   - Add cluster health dashboards
   - Set up alerting rules

2. **Enable Backups**
   - Configure automated backups for Chainfire/FlareDB data
   - Test restore procedures
   - Document backup schedule

3. **Security Hardening**
   - Remove `-k` flags from curl commands (validate TLS)
   - Implement network segmentation (VLANs)
   - Enable audit logging
   - Set up log aggregation

4. **Documentation**
   - Document node inventory (MAC addresses, IPs, roles)
   - Create runbooks for common operations
   - Update network diagrams

---

## Reference Documentation

- **Full Runbook:** [RUNBOOK.md](RUNBOOK.md)
- **Hardware Guide:** [HARDWARE.md](HARDWARE.md)
- **Network Reference:** [NETWORK.md](NETWORK.md)
- **Command Reference:** [COMMANDS.md](COMMANDS.md)
- **Design Document:** [design.md](design.md)

---

## Support

For detailed troubleshooting and advanced topics, see the full [RUNBOOK.md](RUNBOOK.md).

**Key Contacts:**
- Infrastructure Team: infra@example.com
- Emergency Escalation: oncall@example.com

**Useful Resources:**
- NixOS Manual: https://nixos.org/manual/nixos/stable/
- nixos-anywhere: https://github.com/nix-community/nixos-anywhere
- iPXE Documentation: https://ipxe.org/

---

**Document End**