# Bare-Metal Provisioning Operator Runbook **Document Version:** 1.0 **Last Updated:** 2025-12-10 **Status:** Production Ready **Author:** PlasmaCloud Infrastructure Team ## 1. Overview ### 1.1 What This Runbook Covers This runbook provides comprehensive, step-by-step instructions for deploying PlasmaCloud infrastructure on bare-metal servers using automated PXE-based provisioning. By following this guide, operators will be able to: - Deploy a complete PlasmaCloud cluster from bare hardware to running services - Bootstrap a 3-node Raft cluster (Chainfire + FlareDB) - Add additional nodes to an existing cluster - Validate cluster health and troubleshoot common issues - Perform operational tasks (updates, maintenance, recovery) ### 1.2 Prerequisites **Required Access and Permissions:** - Root/sudo access on provisioning server - Physical or IPMI/BMC access to bare-metal servers - Network access to provisioning VLAN - SSH key pair for nixos-anywhere **Required Tools:** - NixOS with flakes enabled (provisioning workstation) - curl, jq, ssh client - ipmitool (optional, for remote management) - Serial console access tool (optional) **Required Knowledge:** - Basic understanding of PXE boot process - Linux system administration - Network configuration (DHCP, DNS, firewall) - NixOS basics (declarative configuration, flakes) ### 1.3 Architecture Diagram ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ Bare-Metal Provisioning Flow │ └─────────────────────────────────────────────────────────────────────────┘ Phase 1: PXE Boot Phase 2: Installation ┌──────────────┐ ┌──────────────────┐ │ Bare-Metal │ 1. DHCP Request │ DHCP Server │ │ Server ├─────────────────>│ (PXE Server) │ │ │ └──────────────────┘ │ (powered │ 2. TFTP Get │ │ on, PXE │ bootloader │ │ enabled) │<───────────────────────────┘ │ │ │ 3. iPXE │ 4. HTTP Get ┌──────────────────┐ │ loads │ boot.ipxe │ HTTP Server │ │ ├──────────────────>│ (nginx) │ │ │ └──────────────────┘ │ 5. iPXE │ 6. HTTP Get │ │ menu │ kernel+initrd │ │ │<───────────────────────────┘ │ │ │ 7. Boot │ │ NixOS │ │ Installer│ └──────┬───────┘ │ │ 8. SSH Connection ┌──────────────────┐ └───────────────────────────>│ Provisioning │ │ Workstation │ │ │ │ 9. Run │ │ nixos- │ │ anywhere │ └──────┬───────────┘ │ ┌────────────────────┴────────────────────┐ │ │ v v ┌──────────────────────────┐ ┌──────────────────────────┐ │ 10. Partition disks │ │ 11. Install NixOS │ │ (disko) │ │ - Build system │ │ - GPT/LVM/LUKS │ │ - Copy closures │ │ - Format filesystems │ │ - Install bootloader│ │ - Mount /mnt │ │ - Inject secrets │ └──────────────────────────┘ └──────────────────────────┘ Phase 3: First Boot Phase 4: Running Cluster ┌──────────────┐ ┌──────────────────┐ │ Bare-Metal │ 12. Reboot │ NixOS System │ │ Server │ ────────────> │ (from disk) │ └──────────────┘ └──────────────────┘ │ ┌───────────────────┴────────────────────┐ │ 13. First-boot automation │ │ - Chainfire cluster join/bootstrap │ │ - FlareDB cluster join/bootstrap │ │ - IAM initialization │ │ - Health checks │ └───────────────────┬────────────────────┘ │ v ┌──────────────────┐ │ Running Cluster │ │ - All services │ │ healthy │ │ - Raft quorum │ │ - TLS enabled │ └──────────────────┘ ``` ## 2. Hardware Requirements ### 2.1 Minimum Specifications Per Node **Control Plane Nodes (3-5 recommended):** - CPU: 8 cores / 16 threads (Intel Xeon or AMD EPYC) - RAM: 32 GB DDR4 ECC - Storage: 500 GB SSD (NVMe preferred) - Network: 2x 10 GbE (bonded/redundant) - BMC: IPMI 2.0 or Redfish compatible **Worker Nodes:** - CPU: 16+ cores / 32+ threads - RAM: 64 GB+ DDR4 ECC - Storage: 1 TB+ NVMe SSD - Network: 2x 10 GbE or 2x 25 GbE - BMC: IPMI 2.0 or Redfish compatible **All-in-One (Development/Testing):** - CPU: 16 cores / 32 threads - RAM: 64 GB DDR4 - Storage: 1 TB SSD - Network: 1x 10 GbE (minimum) - BMC: Optional but recommended ### 2.2 Recommended Production Specifications **Control Plane Nodes:** - CPU: 16-32 cores (Intel Xeon Gold/Platinum or AMD EPYC) - RAM: 64-128 GB DDR4 ECC - Storage: 1-2 TB NVMe SSD (RAID1 for redundancy) - Network: 2x 25 GbE (active/active bonding) - BMC: Redfish with SOL (Serial-over-LAN) **Worker Nodes:** - CPU: 32-64 cores - RAM: 128-256 GB DDR4 ECC - Storage: 2-4 TB NVMe SSD - Network: 2x 25 GbE or 2x 100 GbE - GPU: Optional (NVIDIA/AMD for ML workloads) ### 2.3 Hardware Compatibility Matrix | Vendor | Model | Tested | BIOS | UEFI | Notes | |-----------|---------------|--------|------|------|--------------------------------| | Dell | PowerEdge R640| Yes | Yes | Yes | Requires BIOS A19+ | | Dell | PowerEdge R650| Yes | Yes | Yes | Best PXE compatibility | | HPE | ProLiant DL360| Yes | Yes | Yes | Disable Secure Boot | | HPE | ProLiant DL380| Yes | Yes | Yes | Latest firmware recommended | | Supermicro| SYS-2029U | Yes | Yes | Yes | Requires BMC 1.73+ | | Lenovo | ThinkSystem | Partial| Yes | Yes | Some NIC issues on older models| | Generic | Whitebox x86 | Partial| Yes | Maybe| UEFI support varies | ### 2.4 BIOS/UEFI Settings **Required Settings:** - Boot Mode: UEFI (preferred) or Legacy BIOS - PXE/Network Boot: Enabled on primary NIC - Boot Order: Network → Disk - Secure Boot: Disabled (for PXE boot) - Virtualization: Enabled (VT-x/AMD-V) - SR-IOV: Enabled (if using advanced networking) **Dell-Specific (iDRAC):** ``` System BIOS → Boot Settings: Boot Mode: UEFI UEFI Network Stack: Enabled PXE Device 1: Integrated NIC 1 System BIOS → System Profile: Profile: Performance ``` **HPE-Specific (iLO):** ``` System Configuration → BIOS/Platform: Boot Mode: UEFI Mode Network Boot: Enabled PXE Support: UEFI Only System Configuration → UEFI Boot Order: 1. Network Adapter (NIC 1) 2. Hard Disk ``` **Supermicro-Specific (IPMI):** ``` BIOS Setup → Boot: Boot mode select: UEFI UEFI Network Stack: Enabled Boot Option #1: UEFI Network BIOS Setup → Advanced → CPU Configuration: Intel Virtualization Technology: Enabled ``` ### 2.5 BMC/IPMI Requirements **Mandatory Features:** - Remote power control (on/off/reset) - Boot device selection (PXE/disk) - Remote console access (KVM-over-IP or SOL) **Recommended Features:** - Virtual media mounting - Sensor monitoring (temperature, fans, PSU) - Event logging - SMTP alerting **Network Configuration:** - Dedicated BMC network (separate VLAN recommended) - Static IP or DHCP reservation - HTTPS access enabled - Default credentials changed ## 3. Network Setup ### 3.1 Network Topology **Single-Segment Topology (Simple):** ``` ┌─────────────────────────────────────────────────────┐ │ Provisioning Server PXE/DHCP/HTTP │ │ 10.0.100.10 │ └──────────────┬──────────────────────────────────────┘ │ │ Layer 2 Switch (unmanaged) │ ┬──────────┴──────────┬─────────────┬ │ │ │ ┌───┴────┐ ┌────┴─────┐ ┌───┴────┐ │ Node01 │ │ Node02 │ │ Node03 │ │10.0.100│ │ 10.0.100 │ │10.0.100│ │ .50 │ │ .51 │ │ .52 │ └────────┘ └──────────┘ └────────┘ ``` **Multi-VLAN Topology (Production):** ``` ┌──────────────────────────────────────────────────────┐ │ Management Network (VLAN 10) │ │ - Provisioning Server: 10.0.10.10 │ │ - BMC/IPMI: 10.0.10.50-99 │ └──────────────────┬───────────────────────────────────┘ │ ┌──────────────────┴───────────────────────────────────┐ │ Provisioning Network (VLAN 100) │ │ - PXE Boot: 10.0.100.0/24 │ │ - DHCP Range: 10.0.100.100-200 │ └──────────────────┬───────────────────────────────────┘ │ ┌──────────────────┴───────────────────────────────────┐ │ Production Network (VLAN 200) │ │ - Static IPs: 10.0.200.10-99 │ │ - Service Traffic │ └──────────────────┬───────────────────────────────────┘ │ ┌────────┴────────┐ │ L3 Switch │ │ (VLANs, Routing)│ └────────┬─────────┘ │ ┬───────────┴──────────┬─────────┬ │ │ │ ┌────┴────┐ ┌────┴────┐ │ │ Node01 │ │ Node02 │ │... │ eth0: │ │ eth0: │ │ VLAN100│ │ VLAN100│ │ eth1: │ │ eth1: │ │ VLAN200│ │ VLAN200│ └─────────┘ └─────────┘ ``` ### 3.2 DHCP Server Configuration **ISC DHCP Configuration (`/etc/dhcp/dhcpd.conf`):** ```dhcp # Global options option architecture-type code 93 = unsigned integer 16; default-lease-time 600; max-lease-time 7200; authoritative; # Provisioning subnet subnet 10.0.100.0 netmask 255.255.255.0 { range 10.0.100.100 10.0.100.200; option routers 10.0.100.1; option domain-name-servers 10.0.100.1, 8.8.8.8; option domain-name "prov.example.com"; # PXE boot server next-server 10.0.100.10; # Architecture-specific boot file selection if exists user-class and option user-class = "iPXE" { # iPXE already loaded, provide boot script via HTTP filename "http://10.0.100.10:8080/boot/ipxe/boot.ipxe"; } elsif option architecture-type = 00:00 { # BIOS (legacy) - load iPXE via TFTP filename "undionly.kpxe"; } elsif option architecture-type = 00:07 { # UEFI x86_64 - load iPXE via TFTP filename "ipxe.efi"; } elsif option architecture-type = 00:09 { # UEFI x86_64 (alternate) - load iPXE via TFTP filename "ipxe.efi"; } else { # Fallback to UEFI filename "ipxe.efi"; } } # Static reservations for control plane nodes host node01 { hardware ethernet 52:54:00:12:34:56; fixed-address 10.0.100.50; option host-name "node01"; } host node02 { hardware ethernet 52:54:00:12:34:57; fixed-address 10.0.100.51; option host-name "node02"; } host node03 { hardware ethernet 52:54:00:12:34:58; fixed-address 10.0.100.52; option host-name "node03"; } ``` **Validation Commands:** ```bash # Test DHCP configuration syntax sudo dhcpd -t -cf /etc/dhcp/dhcpd.conf # Start DHCP server sudo systemctl start isc-dhcp-server sudo systemctl enable isc-dhcp-server # Monitor DHCP leases sudo tail -f /var/lib/dhcp/dhcpd.leases # Test DHCP response sudo nmap --script broadcast-dhcp-discover -e eth0 ``` ### 3.3 DNS Requirements **Forward DNS Zone (`example.com`):** ```zone ; Control plane nodes node01.example.com. IN A 10.0.200.10 node02.example.com. IN A 10.0.200.11 node03.example.com. IN A 10.0.200.12 ; Worker nodes worker01.example.com. IN A 10.0.200.20 worker02.example.com. IN A 10.0.200.21 ; Service VIPs (optional, for load balancing) chainfire.example.com. IN A 10.0.200.100 flaredb.example.com. IN A 10.0.200.101 iam.example.com. IN A 10.0.200.102 ``` **Reverse DNS Zone (`200.0.10.in-addr.arpa`):** ```zone ; Control plane nodes 10.200.0.10.in-addr.arpa. IN PTR node01.example.com. 11.200.0.10.in-addr.arpa. IN PTR node02.example.com. 12.200.0.10.in-addr.arpa. IN PTR node03.example.com. ``` **Validation:** ```bash # Test forward resolution dig +short node01.example.com # Test reverse resolution dig +short -x 10.0.200.10 # Test from target node after provisioning ssh root@10.0.100.50 'hostname -f' ``` ### 3.4 Firewall Rules **Service Port Matrix (see NETWORK.md for complete reference):** | Service | API Port | Raft Port | Additional | Protocol | |--------------|----------|-----------|------------|----------| | Chainfire | 2379 | 2380 | 2381 (gossip) | TCP | | FlareDB | 2479 | 2480 | - | TCP | | IAM | 8080 | - | - | TCP | | PlasmaVMC | 9090 | - | - | TCP | | PrismNET | 9091 | - | - | TCP | | FlashDNS | 53 | - | - | TCP/UDP | | FiberLB | 9092 | - | - | TCP | | K8sHost | 10250 | - | - | TCP | **iptables Rules (Provisioning Server):** ```bash #!/bin/bash # Provisioning server firewall rules # Allow DHCP iptables -A INPUT -p udp --dport 67 -j ACCEPT iptables -A INPUT -p udp --dport 68 -j ACCEPT # Allow TFTP iptables -A INPUT -p udp --dport 69 -j ACCEPT # Allow HTTP (boot server) iptables -A INPUT -p tcp --dport 80 -j ACCEPT iptables -A INPUT -p tcp --dport 8080 -j ACCEPT # Allow SSH (for nixos-anywhere) iptables -A INPUT -p tcp --dport 22 -j ACCEPT ``` **iptables Rules (Cluster Nodes):** ```bash #!/bin/bash # Cluster node firewall rules # Allow SSH (management) iptables -A INPUT -p tcp --dport 22 -s 10.0.0.0/8 -j ACCEPT # Allow Chainfire (from cluster subnet only) iptables -A INPUT -p tcp --dport 2379 -s 10.0.200.0/24 -j ACCEPT iptables -A INPUT -p tcp --dport 2380 -s 10.0.200.0/24 -j ACCEPT iptables -A INPUT -p tcp --dport 2381 -s 10.0.200.0/24 -j ACCEPT # Allow FlareDB iptables -A INPUT -p tcp --dport 2479 -s 10.0.200.0/24 -j ACCEPT iptables -A INPUT -p tcp --dport 2480 -s 10.0.200.0/24 -j ACCEPT # Allow IAM (from cluster and client subnets) iptables -A INPUT -p tcp --dport 8080 -s 10.0.0.0/8 -j ACCEPT # Drop all other traffic iptables -A INPUT -j DROP ``` **nftables Rules (Modern Alternative):** ```nft #!/usr/sbin/nft -f flush ruleset table inet filter { chain input { type filter hook input priority 0; policy drop; # Allow established connections ct state established,related accept # Allow loopback iif lo accept # Allow SSH tcp dport 22 ip saddr 10.0.0.0/8 accept # Allow cluster services from cluster subnet tcp dport { 2379, 2380, 2381, 2479, 2480 } ip saddr 10.0.200.0/24 accept # Allow IAM from internal network tcp dport 8080 ip saddr 10.0.0.0/8 accept } } ``` ### 3.5 Static IP Allocation Strategy **IP Allocation Plan:** ``` 10.0.100.0/24 - Provisioning network (DHCP during install) .1 - Gateway .10 - PXE/DHCP/HTTP server .50-.79 - Control plane nodes (static reservations) .80-.99 - Worker nodes (static reservations) .100-.200 - DHCP pool (temporary during provisioning) 10.0.200.0/24 - Production network (static IPs) .1 - Gateway .10-.19 - Control plane nodes .20-.99 - Worker nodes .100-.199 - Service VIPs ``` ### 3.6 Network Bandwidth Requirements **Per-Node During Provisioning:** - PXE boot: ~200-500 MB (kernel + initrd) - nixos-anywhere: ~1-5 GB (NixOS closures) - Time: 5-15 minutes on 1 Gbps link **Production Cluster:** - Control plane: 1 Gbps minimum, 10 Gbps recommended - Workers: 10 Gbps minimum, 25 Gbps recommended - Inter-node latency: <1ms ideal, <5ms acceptable ## 4. Pre-Deployment Checklist Complete this checklist before beginning deployment: ### 4.1 Hardware Checklist - [ ] All servers racked and powered - [ ] All network cables connected (data + BMC) - [ ] All power supplies connected (redundant if available) - [ ] BMC/IPMI network configured - [ ] BMC credentials documented - [ ] BIOS/UEFI settings configured per section 2.4 - [ ] PXE boot enabled and first in boot order - [ ] Secure Boot disabled (if using UEFI) - [ ] Hardware inventory recorded (MAC addresses, serial numbers) ### 4.2 Network Checklist - [ ] Network switches configured (VLANs, trunking) - [ ] DHCP server configured and tested - [ ] DNS forward/reverse zones created - [ ] Firewall rules configured - [ ] Network connectivity verified (ping tests) - [ ] Bandwidth validated (iperf between nodes) - [ ] DHCP relay configured (if multi-subnet) - [ ] NTP server configured for time sync ### 4.3 PXE Server Checklist - [ ] PXE server deployed (see T032.S2) - [ ] DHCP service running and healthy - [ ] TFTP service running and healthy - [ ] HTTP service running and healthy - [ ] iPXE bootloaders downloaded (undionly.kpxe, ipxe.efi) - [ ] NixOS netboot images built and uploaded (see T032.S3) - [ ] Boot script configured (boot.ipxe) - [ ] Health endpoints responding **Validation:** ```bash # On PXE server sudo systemctl status isc-dhcp-server sudo systemctl status atftpd sudo systemctl status nginx # Test HTTP access curl http://10.0.100.10:8080/boot/ipxe/boot.ipxe curl http://10.0.100.10:8080/health # Test TFTP access tftp 10.0.100.10 -c get undionly.kpxe /tmp/test.kpxe ``` ### 4.4 Node Configuration Checklist - [ ] Per-node NixOS configurations created (`/srv/provisioning/nodes/`) - [ ] Hardware configurations generated or templated - [ ] Disko disk layouts defined - [ ] Network settings configured (static IPs, VLANs) - [ ] Service selections defined (control-plane vs worker) - [ ] Cluster configuration JSON files created - [ ] Node inventory documented (MAC → hostname → role) ### 4.5 TLS Certificates Checklist - [ ] CA certificate generated - [ ] Per-node certificates generated - [ ] Certificate files copied to secrets directories - [ ] Certificate permissions set (0400 for private keys) - [ ] Certificate expiry dates documented - [ ] Rotation procedure documented **Generate Certificates:** ```bash # Generate CA (if not already done) openssl genrsa -out ca-key.pem 4096 openssl req -x509 -new -nodes -key ca-key.pem -days 3650 \ -out ca-cert.pem -subj "/CN=PlasmaCloud CA" # Generate per-node certificate for node in node01 node02 node03; do openssl genrsa -out ${node}-key.pem 4096 openssl req -new -key ${node}-key.pem -out ${node}-csr.pem \ -subj "/CN=${node}.example.com" openssl x509 -req -in ${node}-csr.pem -CA ca-cert.pem -CAkey ca-key.pem \ -CAcreateserial -out ${node}-cert.pem -days 365 done ``` ### 4.6 Provisioning Workstation Checklist - [ ] NixOS or Nix package manager installed - [ ] Nix flakes enabled - [ ] SSH key pair generated for provisioning - [ ] SSH public key added to netboot images - [ ] Network access to provisioning VLAN - [ ] Git repository cloned (if using version control) - [ ] nixos-anywhere installed: `nix profile install github:nix-community/nixos-anywhere` ## 5. Deployment Workflow ### 5.1 Phase 1: PXE Server Setup **Reference:** See `/home/centra/cloud/chainfire/baremetal/pxe-server/` (T032.S2) **Step 1.1: Deploy PXE Server Using NixOS Module** Create PXE server configuration: ```nix # /etc/nixos/pxe-server.nix { config, pkgs, lib, ... }: { imports = [ /path/to/chainfire/baremetal/pxe-server/nixos-module.nix ]; services.centra-pxe-server = { enable = true; interface = "eth0"; serverAddress = "10.0.100.10"; dhcp = { subnet = "10.0.100.0"; netmask = "255.255.255.0"; broadcast = "10.0.100.255"; range = { start = "10.0.100.100"; end = "10.0.100.200"; }; router = "10.0.100.1"; domainNameServers = [ "10.0.100.1" "8.8.8.8" ]; }; nodes = { "52:54:00:12:34:56" = { profile = "control-plane"; hostname = "node01"; ipAddress = "10.0.100.50"; }; "52:54:00:12:34:57" = { profile = "control-plane"; hostname = "node02"; ipAddress = "10.0.100.51"; }; "52:54:00:12:34:58" = { profile = "control-plane"; hostname = "node03"; ipAddress = "10.0.100.52"; }; }; }; } ``` Apply configuration: ```bash sudo nixos-rebuild switch -I nixos-config=/etc/nixos/pxe-server.nix ``` **Step 1.2: Verify PXE Services** ```bash # Check all services are running sudo systemctl status dhcpd4.service sudo systemctl status atftpd.service sudo systemctl status nginx.service # Test DHCP server sudo journalctl -u dhcpd4 -f & # Power on a test server and watch for DHCP requests # Test TFTP server tftp localhost -c get undionly.kpxe /tmp/test.kpxe ls -lh /tmp/test.kpxe # Should show ~100KB file # Test HTTP server curl http://localhost:8080/health # Expected: {"status":"healthy","services":{"dhcp":"running","tftp":"running","http":"running"}} curl http://localhost:8080/boot/ipxe/boot.ipxe # Expected: iPXE boot script content ``` ### 5.2 Phase 2: Build Netboot Images **Reference:** See `/home/centra/cloud/baremetal/image-builder/` (T032.S3) **Step 2.1: Build Images for All Profiles** ```bash cd /home/centra/cloud/baremetal/image-builder # Build all profiles ./build-images.sh # Or build specific profile ./build-images.sh --profile control-plane ./build-images.sh --profile worker ./build-images.sh --profile all-in-one ``` **Expected Output:** ``` Building netboot image for control-plane... Building initrd... [... Nix build output ...] ✓ Build complete: artifacts/control-plane/initrd (234 MB) ✓ Build complete: artifacts/control-plane/bzImage (12 MB) ``` **Step 2.2: Copy Images to PXE Server** ```bash # Automatic (if PXE server directory exists) ./build-images.sh --deploy # Manual copy sudo cp artifacts/control-plane/* /var/lib/pxe-boot/nixos/control-plane/ sudo cp artifacts/worker/* /var/lib/pxe-boot/nixos/worker/ sudo cp artifacts/all-in-one/* /var/lib/pxe-boot/nixos/all-in-one/ ``` **Step 2.3: Verify Image Integrity** ```bash # Check file sizes (should be reasonable) ls -lh /var/lib/pxe-boot/nixos/*/ # Verify images are accessible via HTTP curl -I http://10.0.100.10:8080/boot/nixos/control-plane/bzImage # Expected: HTTP/1.1 200 OK, Content-Length: ~12000000 curl -I http://10.0.100.10:8080/boot/nixos/control-plane/initrd # Expected: HTTP/1.1 200 OK, Content-Length: ~234000000 ``` ### 5.3 Phase 3: Prepare Node Configurations **Step 3.1: Generate Node-Specific NixOS Configs** Create directory structure: ```bash mkdir -p /srv/provisioning/nodes/{node01,node02,node03}.example.com/{secrets,} ``` **Node Configuration Template (`nodes/node01.example.com/configuration.nix`):** ```nix { config, pkgs, lib, ... }: { imports = [ ../../profiles/control-plane.nix ../../common/base.nix ./hardware.nix ./disko.nix ]; # Hostname and domain networking = { hostName = "node01"; domain = "example.com"; usePredictableInterfaceNames = false; # Use eth0, eth1 # Provisioning interface (temporary) interfaces.eth0 = { useDHCP = false; ipv4.addresses = [{ address = "10.0.100.50"; prefixLength = 24; }]; }; # Production interface interfaces.eth1 = { useDHCP = false; ipv4.addresses = [{ address = "10.0.200.10"; prefixLength = 24; }]; }; defaultGateway = "10.0.200.1"; nameservers = [ "10.0.200.1" "8.8.8.8" ]; }; # Enable PlasmaCloud services services.chainfire = { enable = true; port = 2379; raftPort = 2380; gossipPort = 2381; settings = { node_id = "node01"; cluster_name = "prod-cluster"; tls = { cert_path = "/etc/nixos/secrets/node01-cert.pem"; key_path = "/etc/nixos/secrets/node01-key.pem"; ca_path = "/etc/nixos/secrets/ca-cert.pem"; }; }; }; services.flaredb = { enable = true; port = 2479; raftPort = 2480; settings = { node_id = "node01"; cluster_name = "prod-cluster"; chainfire_endpoint = "https://localhost:2379"; tls = { cert_path = "/etc/nixos/secrets/node01-cert.pem"; key_path = "/etc/nixos/secrets/node01-key.pem"; ca_path = "/etc/nixos/secrets/ca-cert.pem"; }; }; }; services.iam = { enable = true; port = 8080; settings = { flaredb_endpoint = "https://localhost:2479"; tls = { cert_path = "/etc/nixos/secrets/node01-cert.pem"; key_path = "/etc/nixos/secrets/node01-key.pem"; ca_path = "/etc/nixos/secrets/ca-cert.pem"; }; }; }; # Enable first-boot automation services.first-boot-automation = { enable = true; configFile = "/etc/nixos/secrets/cluster-config.json"; }; system.stateVersion = "24.11"; } ``` **Step 3.2: Create cluster-config.json for Each Node** **Bootstrap Node (node01):** ```json { "node_id": "node01", "node_role": "control-plane", "bootstrap": true, "cluster_name": "prod-cluster", "leader_url": "https://node01.example.com:2379", "raft_addr": "10.0.200.10:2380", "initial_peers": [ "node01.example.com:2380", "node02.example.com:2380", "node03.example.com:2380" ], "flaredb_peers": [ "node01.example.com:2480", "node02.example.com:2480", "node03.example.com:2480" ] } ``` Copy to secrets: ```bash cp cluster-config-node01.json /srv/provisioning/nodes/node01.example.com/secrets/cluster-config.json cp cluster-config-node02.json /srv/provisioning/nodes/node02.example.com/secrets/cluster-config.json cp cluster-config-node03.json /srv/provisioning/nodes/node03.example.com/secrets/cluster-config.json ``` **Step 3.3: Generate Disko Disk Layouts** **Simple Single-Disk Layout (`nodes/node01.example.com/disko.nix`):** ```nix { disks ? [ "/dev/sda" ], ... }: { disko.devices = { disk = { main = { type = "disk"; device = builtins.head disks; content = { type = "gpt"; partitions = { ESP = { size = "1G"; type = "EF00"; content = { type = "filesystem"; format = "vfat"; mountpoint = "/boot"; }; }; root = { size = "100%"; content = { type = "filesystem"; format = "ext4"; mountpoint = "/"; }; }; }; }; }; }; }; } ``` **Step 3.4: Pre-Generate TLS Certificates** ```bash # Copy per-node certificates cp ca-cert.pem /srv/provisioning/nodes/node01.example.com/secrets/ cp node01-cert.pem /srv/provisioning/nodes/node01.example.com/secrets/ cp node01-key.pem /srv/provisioning/nodes/node01.example.com/secrets/ # Set permissions chmod 644 /srv/provisioning/nodes/node01.example.com/secrets/*-cert.pem chmod 644 /srv/provisioning/nodes/node01.example.com/secrets/ca-cert.pem chmod 600 /srv/provisioning/nodes/node01.example.com/secrets/*-key.pem ``` ### 5.4 Phase 4: Bootstrap First 3 Nodes **Step 4.1: Power On Nodes via BMC** ```bash # Using ipmitool (example for Dell/HP/Supermicro) for ip in 10.0.10.50 10.0.10.51 10.0.10.52; do ipmitool -I lanplus -H $ip -U admin -P password chassis bootdev pxe options=persistent ipmitool -I lanplus -H $ip -U admin -P password chassis power on done ``` **Step 4.2: Verify PXE Boot Success** Watch serial console (if available): ```bash # Connect via IPMI SOL ipmitool -I lanplus -H 10.0.10.50 -U admin -P password sol activate # Expected output: # ... DHCP discovery ... # ... TFTP download undionly.kpxe or ipxe.efi ... # ... iPXE menu appears ... # ... Kernel and initrd download ... # ... NixOS installer boots ... # ... SSH server starts ... ``` Verify installer is ready: ```bash # Wait for nodes to appear in DHCP leases sudo tail -f /var/lib/dhcp/dhcpd.leases # Test SSH connectivity ssh root@10.0.100.50 'uname -a' # Expected: Linux node01 ... nixos ``` **Step 4.3: Run nixos-anywhere Simultaneously on All 3** Create provisioning script: ```bash #!/bin/bash # /srv/provisioning/scripts/provision-bootstrap-nodes.sh set -euo pipefail NODES=("node01" "node02" "node03") PROVISION_IPS=("10.0.100.50" "10.0.100.51" "10.0.100.52") FLAKE_ROOT="/srv/provisioning" for i in "${!NODES[@]}"; do node="${NODES[$i]}" ip="${PROVISION_IPS[$i]}" echo "Provisioning $node at $ip..." nix run github:nix-community/nixos-anywhere -- \ --flake "$FLAKE_ROOT#$node" \ --build-on-remote \ root@$ip & done wait echo "All nodes provisioned successfully!" ``` Run provisioning: ```bash chmod +x /srv/provisioning/scripts/provision-bootstrap-nodes.sh ./provision-bootstrap-nodes.sh ``` **Expected output per node:** ``` Provisioning node01 at 10.0.100.50... Connecting via SSH... Running disko to partition disks... Building NixOS system... Installing bootloader... Copying secrets... Installation complete. Rebooting... ``` **Step 4.4: Wait for First-Boot Automation** After reboot, nodes will boot from disk and run first-boot automation. Monitor progress: ```bash # Watch logs on node01 (via SSH after it reboots) ssh root@10.0.200.10 # Note: now on production network # Check cluster join services journalctl -u chainfire-cluster-join.service -f journalctl -u flaredb-cluster-join.service -f # Expected log output: # {"level":"INFO","message":"Waiting for local chainfire service..."} # {"level":"INFO","message":"Local chainfire healthy"} # {"level":"INFO","message":"Bootstrap node, cluster initialized"} # {"level":"INFO","message":"Cluster join complete"} ``` **Step 4.5: Verify Cluster Health** ```bash # Check Chainfire cluster curl -k https://node01.example.com:2379/admin/cluster/members | jq # Expected output: # { # "members": [ # {"id":"node01","raft_addr":"10.0.200.10:2380","status":"healthy","role":"leader"}, # {"id":"node02","raft_addr":"10.0.200.11:2380","status":"healthy","role":"follower"}, # {"id":"node03","raft_addr":"10.0.200.12:2380","status":"healthy","role":"follower"} # ] # } # Check FlareDB cluster curl -k https://node01.example.com:2479/admin/cluster/members | jq # Check IAM service curl -k https://node01.example.com:8080/health | jq # Expected: {"status":"healthy","database":"connected"} ``` ### 5.5 Phase 5: Add Additional Nodes **Step 5.1: Prepare Join-Mode Configurations** Create configuration for node04 (worker profile): ```json { "node_id": "node04", "node_role": "worker", "bootstrap": false, "cluster_name": "prod-cluster", "leader_url": "https://node01.example.com:2379", "raft_addr": "10.0.200.20:2380" } ``` **Step 5.2: Power On and Provision Nodes** ```bash # Power on node via BMC ipmitool -I lanplus -H 10.0.10.54 -U admin -P password chassis bootdev pxe ipmitool -I lanplus -H 10.0.10.54 -U admin -P password chassis power on # Wait for PXE boot and SSH ready sleep 60 # Provision node nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#node04 \ --build-on-remote \ root@10.0.100.60 ``` **Step 5.3: Verify Cluster Join via API** ```bash # Check cluster members (should include node04) curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members[] | select(.id=="node04")' # Expected: # {"id":"node04","raft_addr":"10.0.200.20:2380","status":"healthy","role":"follower"} ``` **Step 5.4: Validate Replication and Service Distribution** ```bash # Write test data on leader curl -k -X PUT https://node01.example.com:2379/v1/kv/test \ -H "Content-Type: application/json" \ -d '{"value":"hello world"}' # Read from follower (should be replicated) curl -k https://node02.example.com:2379/v1/kv/test | jq # Expected: {"key":"test","value":"hello world"} ``` ## 6. Verification & Validation ### 6.1 Health Check Commands for All Services **Chainfire:** ```bash curl -k https://node01.example.com:2379/health | jq # Expected: {"status":"healthy","raft":"leader","cluster_size":3} # Check cluster membership curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members | length' # Expected: 3 (for initial bootstrap) ``` **FlareDB:** ```bash curl -k https://node01.example.com:2479/health | jq # Expected: {"status":"healthy","raft":"leader","chainfire":"connected"} # Query test metric curl -k https://node01.example.com:2479/v1/query \ -H "Content-Type: application/json" \ -d '{"query":"up{job=\"node\"}","time":"now"}' ``` **IAM:** ```bash curl -k https://node01.example.com:8080/health | jq # Expected: {"status":"healthy","database":"connected","version":"1.0.0"} # List users (requires authentication) curl -k https://node01.example.com:8080/api/users \ -H "Authorization: Bearer $IAM_TOKEN" | jq ``` **PlasmaVMC:** ```bash curl -k https://node01.example.com:9090/health | jq # Expected: {"status":"healthy","vms_running":0} # List VMs curl -k https://node01.example.com:9090/api/vms | jq ``` **PrismNET:** ```bash curl -k https://node01.example.com:9091/health | jq # Expected: {"status":"healthy","networks":0} ``` **FlashDNS:** ```bash dig @node01.example.com example.com # Expected: DNS response with ANSWER section # Health check curl -k https://node01.example.com:853/health | jq ``` **FiberLB:** ```bash curl -k https://node01.example.com:9092/health | jq # Expected: {"status":"healthy","backends":0} ``` **K8sHost:** ```bash kubectl --kubeconfig=/etc/kubernetes/admin.conf get nodes # Expected: Node list including this node ``` ### 6.2 Cluster Membership Verification ```bash #!/bin/bash # /srv/provisioning/scripts/verify-cluster.sh echo "Checking Chainfire cluster..." curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members[] | {id, status, role}' echo "" echo "Checking FlareDB cluster..." curl -k https://node01.example.com:2479/admin/cluster/members | jq '.members[] | {id, status, role}' echo "" echo "Cluster health summary:" echo " Chainfire nodes: $(curl -sk https://node01.example.com:2379/admin/cluster/members | jq '.members | length')" echo " FlareDB nodes: $(curl -sk https://node01.example.com:2479/admin/cluster/members | jq '.members | length')" echo " Raft leaders: Chainfire=$(curl -sk https://node01.example.com:2379/admin/cluster/members | jq -r '.members[] | select(.role=="leader") | .id'), FlareDB=$(curl -sk https://node01.example.com:2479/admin/cluster/members | jq -r '.members[] | select(.role=="leader") | .id')" ``` ### 6.3 Raft Leader Election Check ```bash # Identify current leader LEADER=$(curl -sk https://node01.example.com:2379/admin/cluster/members | jq -r '.members[] | select(.role=="leader") | .id') echo "Current Chainfire leader: $LEADER" # Verify all followers can reach leader for node in node01 node02 node03; do echo "Checking $node..." curl -sk https://$node.example.com:2379/admin/cluster/leader | jq done ``` ### 6.4 TLS Certificate Validation ```bash # Check certificate expiry for node in node01 node02 node03; do echo "Checking $node certificate..." echo | openssl s_client -connect $node.example.com:2379 2>/dev/null | openssl x509 -noout -dates done # Verify certificate chain echo | openssl s_client -connect node01.example.com:2379 -CAfile /srv/provisioning/ca-cert.pem -verify 1 # Expected: Verify return code: 0 (ok) ``` ### 6.5 Network Connectivity Tests ```bash # Test inter-node connectivity (from node01) ssh root@node01.example.com ' for node in node02 node03; do echo "Testing connectivity to $node..." nc -zv $node.example.com 2379 nc -zv $node.example.com 2380 done ' # Test bandwidth (iperf3) ssh root@node02.example.com 'iperf3 -s' & ssh root@node01.example.com 'iperf3 -c node02.example.com -t 10' # Expected: ~10 Gbps on 10GbE, ~1 Gbps on 1GbE ``` ### 6.6 Performance Smoke Tests **Chainfire Write Performance:** ```bash # 1000 writes time for i in {1..1000}; do curl -sk -X PUT https://node01.example.com:2379/v1/kv/test$i \ -H "Content-Type: application/json" \ -d "{\"value\":\"test data $i\"}" > /dev/null done # Expected: <10 seconds on healthy cluster ``` **FlareDB Query Performance:** ```bash # Insert test metrics curl -k -X POST https://node01.example.com:2479/v1/write \ -H "Content-Type: application/json" \ -d '{"metric":"test_metric","value":42,"timestamp":"'$(date -Iseconds)'"}' # Query performance time curl -k https://node01.example.com:2479/v1/query \ -H "Content-Type: application/json" \ -d '{"query":"test_metric","start":"1h","end":"now"}' # Expected: <100ms response time ``` ## 7. Common Operations ### 7.1 Adding a New Node **Step 1: Prepare Node Configuration** ```bash # Create node directory mkdir -p /srv/provisioning/nodes/node05.example.com/secrets # Copy template configuration cp /srv/provisioning/nodes/node01.example.com/configuration.nix \ /srv/provisioning/nodes/node05.example.com/ # Edit for new node vim /srv/provisioning/nodes/node05.example.com/configuration.nix # Update: hostName, ipAddresses, node_id ``` **Step 2: Generate Cluster Config (Join Mode)** ```json { "node_id": "node05", "node_role": "worker", "bootstrap": false, "cluster_name": "prod-cluster", "leader_url": "https://node01.example.com:2379", "raft_addr": "10.0.200.21:2380" } ``` **Step 3: Provision Node** ```bash # Power on and PXE boot ipmitool -I lanplus -H 10.0.10.55 -U admin -P password chassis bootdev pxe ipmitool -I lanplus -H 10.0.10.55 -U admin -P password chassis power on # Wait for SSH sleep 60 # Run nixos-anywhere nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#node05 \ root@10.0.100.65 ``` **Step 4: Verify Join** ```bash # Check cluster membership curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members[] | select(.id=="node05")' ``` ### 7.2 Replacing a Failed Node **Step 1: Remove Failed Node from Cluster** ```bash # Remove from Chainfire cluster curl -k -X DELETE https://node01.example.com:2379/admin/member/node02 # Remove from FlareDB cluster curl -k -X DELETE https://node01.example.com:2479/admin/member/node02 ``` **Step 2: Physically Replace Hardware** - Power off old node - Remove from rack - Install new node - Connect all cables - Configure BMC **Step 3: Provision Replacement Node** ```bash # Use same node ID and configuration nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#node02 \ root@10.0.100.51 ``` **Step 4: Verify Rejoin** ```bash # Cluster should automatically add node during first-boot curl -k https://node01.example.com:2379/admin/cluster/members | jq ``` ### 7.3 Updating Node Configuration **Step 1: Edit Configuration** ```bash vim /srv/provisioning/nodes/node01.example.com/configuration.nix # Make changes (e.g., add service, change network config) ``` **Step 2: Build and Deploy** ```bash # Build configuration locally nix build /srv/provisioning#node01 # Deploy to node (from node or remote) nixos-rebuild switch --flake /srv/provisioning#node01 ``` **Step 3: Verify Changes** ```bash # Check active configuration ssh root@node01.example.com 'nixos-rebuild list-generations' # Test services still healthy curl -k https://node01.example.com:2379/health | jq ``` ### 7.4 Rolling Updates **Update Process (One Node at a Time):** ```bash #!/bin/bash # /srv/provisioning/scripts/rolling-update.sh NODES=("node01" "node02" "node03") for node in "${NODES[@]}"; do echo "Updating $node..." # Build new configuration nix build /srv/provisioning#$node # Deploy (test mode first) ssh root@$node.example.com "nixos-rebuild test --flake /srv/provisioning#$node" # Verify health if ! curl -k https://$node.example.com:2379/health | jq -e '.status == "healthy"'; then echo "ERROR: $node unhealthy after test, aborting" ssh root@$node.example.com "nixos-rebuild switch --rollback" exit 1 fi # Apply permanently ssh root@$node.example.com "nixos-rebuild switch --flake /srv/provisioning#$node" # Wait for reboot if kernel changed echo "Waiting 30s for stabilization..." sleep 30 # Final health check curl -k https://$node.example.com:2379/health | jq echo "$node updated successfully" done ``` ### 7.5 Draining a Node for Maintenance **Step 1: Mark Node for Drain** ```bash # Disable node in load balancer (if using one) curl -k -X POST https://node01.example.com:9092/api/backend/node02 \ -d '{"status":"drain"}' ``` **Step 2: Migrate VMs (PlasmaVMC)** ```bash # List VMs on node ssh root@node02.example.com 'systemctl list-units | grep plasmavmc-vm@' # Migrate each VM curl -k -X POST https://node01.example.com:9090/api/vms/vm-001/migrate \ -d '{"target_node":"node03"}' ``` **Step 3: Stop Services** ```bash ssh root@node02.example.com ' systemctl stop plasmavmc.service systemctl stop chainfire.service systemctl stop flaredb.service ' ``` **Step 4: Perform Maintenance** ```bash # Reboot for kernel update, hardware maintenance, etc. ssh root@node02.example.com 'reboot' ``` **Step 5: Re-enable Node** ```bash # Verify all services healthy ssh root@node02.example.com 'systemctl status chainfire flaredb plasmavmc' # Re-enable in load balancer curl -k -X POST https://node01.example.com:9092/api/backend/node02 \ -d '{"status":"active"}' ``` ### 7.6 Decommissioning a Node **Step 1: Drain Node (see 7.5)** **Step 2: Remove from Cluster** ```bash # Remove from Chainfire curl -k -X DELETE https://node01.example.com:2379/admin/member/node02 # Remove from FlareDB curl -k -X DELETE https://node01.example.com:2479/admin/member/node02 # Verify removal curl -k https://node01.example.com:2379/admin/cluster/members | jq ``` **Step 3: Power Off** ```bash # Via BMC ipmitool -I lanplus -H 10.0.10.51 -U admin -P password chassis power off # Or via SSH ssh root@node02.example.com 'poweroff' ``` **Step 4: Update Inventory** ```bash # Remove from node inventory vim /srv/provisioning/inventory.json # Remove node02 entry # Remove from DNS # Update DNS zone to remove node02.example.com # Remove from monitoring # Update Prometheus targets to remove node02 ``` ## 8. Troubleshooting ### 8.1 PXE Boot Failures **Symptom:** Server does not obtain IP address or does not boot from network **Diagnosis:** ```bash # Monitor DHCP server logs sudo journalctl -u dhcpd4 -f # Monitor TFTP requests sudo tcpdump -i eth0 -n port 69 # Check PXE server services sudo systemctl status dhcpd4 atftpd nginx ``` **Common Causes:** 1. **DHCP server not running:** `sudo systemctl start dhcpd4` 2. **Wrong network interface:** Check `interfaces` in dhcpd.conf 3. **Firewall blocking DHCP/TFTP:** `sudo iptables -L -n | grep -E "67|68|69"` 4. **PXE not enabled in BIOS:** Enter BIOS and enable Network Boot 5. **Network cable disconnected:** Check physical connection **Solution:** ```bash # Restart all PXE services sudo systemctl restart dhcpd4 atftpd nginx # Verify DHCP configuration sudo dhcpd -t -cf /etc/dhcp/dhcpd.conf # Test TFTP tftp localhost -c get undionly.kpxe /tmp/test.kpxe # Power cycle server ipmitool -I lanplus -H -U admin chassis power cycle ``` ### 8.2 Installation Failures (nixos-anywhere) **Symptom:** nixos-anywhere fails during disk partitioning, installation, or bootloader setup **Diagnosis:** ```bash # Check nixos-anywhere output for errors # Common errors: disk not found, partition table errors, out of space # SSH to installer for manual inspection ssh root@10.0.100.50 # Check disk status lsblk dmesg | grep -i error ``` **Common Causes:** 1. **Disk device wrong:** Update disko.nix with correct device (e.g., /dev/nvme0n1) 2. **Disk not wiped:** Previous partition table conflicts 3. **Out of disk space:** Insufficient storage for Nix closures 4. **Network issues:** Cannot download packages from binary cache **Solution:** ```bash # Manual disk wipe (on installer) ssh root@10.0.100.50 ' wipefs -a /dev/sda sgdisk --zap-all /dev/sda ' # Retry nixos-anywhere nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#node01 \ --debug \ root@10.0.100.50 ``` ### 8.3 Cluster Join Failures **Symptom:** Node boots successfully but does not join cluster **Diagnosis:** ```bash # Check first-boot logs on node ssh root@node01.example.com 'journalctl -u chainfire-cluster-join.service -u flaredb-cluster-join.service' # Common errors: # - "Health check timeout after 120s" # - "Join request failed: connection refused" # - "Configuration file not found" ``` **Bootstrap Mode vs Join Mode:** - **Bootstrap:** Node expects to create new cluster with peers - **Join:** Node expects to connect to existing leader **Common Causes:** 1. **Wrong bootstrap flag:** Check cluster-config.json 2. **Leader unreachable:** Network/firewall issue 3. **TLS certificate errors:** Verify cert paths and validity 4. **Service not starting:** Check main service (chainfire.service) **Solution:** ```bash # Verify cluster-config.json ssh root@node01.example.com 'cat /etc/nixos/secrets/cluster-config.json | jq' # Test leader connectivity ssh root@node04.example.com 'curl -k https://node01.example.com:2379/health' # Check TLS certificates ssh root@node04.example.com 'ls -l /etc/nixos/secrets/*.pem' # Manual cluster join (if automation fails) curl -k -X POST https://node01.example.com:2379/admin/member/add \ -H "Content-Type: application/json" \ -d '{"id":"node04","raft_addr":"10.0.200.20:2380"}' ``` ### 8.4 Service Start Failures **Symptom:** Service fails to start after boot **Diagnosis:** ```bash # Check service status ssh root@node01.example.com 'systemctl status chainfire.service' # View logs ssh root@node01.example.com 'journalctl -u chainfire.service -n 100' # Common errors: # - "bind: address already in use" (port conflict) # - "certificate verify failed" (TLS issue) # - "permission denied" (file permissions) ``` **Common Causes:** 1. **Port already in use:** Another service using same port 2. **Missing dependencies:** Required service not running 3. **Configuration error:** Invalid config file 4. **File permissions:** Cannot read secrets **Solution:** ```bash # Check port usage ssh root@node01.example.com 'ss -tlnp | grep 2379' # Verify dependencies ssh root@node01.example.com 'systemctl list-dependencies chainfire.service' # Test configuration manually ssh root@node01.example.com 'chainfire-server --config /etc/nixos/chainfire.toml --check-config' # Fix permissions ssh root@node01.example.com 'chmod 600 /etc/nixos/secrets/*-key.pem' ``` ### 8.5 Network Connectivity Issues **Symptom:** Nodes cannot communicate with each other or external services **Diagnosis:** ```bash # Test basic connectivity ssh root@node01.example.com 'ping -c 3 node02.example.com' # Test specific ports ssh root@node01.example.com 'nc -zv node02.example.com 2379' # Check firewall rules ssh root@node01.example.com 'iptables -L -n | grep 2379' # Check routing ssh root@node01.example.com 'ip route show' ``` **Common Causes:** 1. **Firewall blocking traffic:** Missing iptables rules 2. **Wrong IP address:** Configuration mismatch 3. **Network interface down:** Interface not configured 4. **DNS resolution failure:** Cannot resolve hostnames **Solution:** ```bash # Add firewall rules ssh root@node01.example.com ' iptables -A INPUT -p tcp --dport 2379 -s 10.0.200.0/24 -j ACCEPT iptables -A INPUT -p tcp --dport 2380 -s 10.0.200.0/24 -j ACCEPT iptables-save > /etc/iptables/rules.v4 ' # Fix DNS resolution ssh root@node01.example.com ' echo "10.0.200.11 node02.example.com node02" >> /etc/hosts ' # Restart networking ssh root@node01.example.com 'systemctl restart systemd-networkd' ``` ### 8.6 TLS Certificate Errors **Symptom:** Services cannot establish TLS connections **Diagnosis:** ```bash # Test TLS connection openssl s_client -connect node01.example.com:2379 -CAfile /srv/provisioning/ca-cert.pem # Check certificate validity ssh root@node01.example.com ' openssl x509 -in /etc/nixos/secrets/node01-cert.pem -noout -dates ' # Common errors: # - "certificate verify failed" (wrong CA) # - "certificate has expired" (cert expired) # - "certificate subject name mismatch" (wrong CN) ``` **Common Causes:** 1. **Expired certificate:** Regenerate certificate 2. **Wrong CA certificate:** Verify CA cert is correct 3. **Hostname mismatch:** CN does not match hostname 4. **File permissions:** Cannot read certificate files **Solution:** ```bash # Regenerate certificate openssl req -new -key /srv/provisioning/secrets/node01-key.pem \ -out /srv/provisioning/secrets/node01-csr.pem \ -subj "/CN=node01.example.com" openssl x509 -req -in /srv/provisioning/secrets/node01-csr.pem \ -CA /srv/provisioning/ca-cert.pem \ -CAkey /srv/provisioning/ca-key.pem \ -CAcreateserial \ -out /srv/provisioning/secrets/node01-cert.pem \ -days 365 # Copy to node scp /srv/provisioning/secrets/node01-cert.pem root@node01.example.com:/etc/nixos/secrets/ # Restart service ssh root@node01.example.com 'systemctl restart chainfire.service' ``` ### 8.7 Performance Degradation **Symptom:** Services are slow or unresponsive **Diagnosis:** ```bash # Check system load ssh root@node01.example.com 'uptime' ssh root@node01.example.com 'top -bn1 | head -20' # Check disk I/O ssh root@node01.example.com 'iostat -x 1 5' # Check network bandwidth ssh root@node01.example.com 'iftop -i eth1' # Check Raft logs for slow operations ssh root@node01.example.com 'journalctl -u chainfire.service | grep "slow operation"' ``` **Common Causes:** 1. **High CPU usage:** Too many requests, inefficient queries 2. **Disk I/O bottleneck:** Slow disk, too many writes 3. **Network saturation:** Bandwidth exhausted 4. **Memory pressure:** OOM killer active 5. **Raft slow commits:** Network latency between nodes **Solution:** ```bash # Add more resources (vertical scaling) # Or add more nodes (horizontal scaling) # Check for resource leaks ssh root@node01.example.com 'systemctl status chainfire | grep Memory' # Restart service to clear memory leaks (temporary) ssh root@node01.example.com 'systemctl restart chainfire.service' # Optimize disk I/O (enable write caching if safe) ssh root@node01.example.com 'hdparm -W1 /dev/sda' ``` ## 9. Rollback & Recovery ### 9.1 NixOS Generation Rollback NixOS provides atomic rollback capability via generations: **List Available Generations:** ```bash ssh root@node01.example.com 'nixos-rebuild list-generations' # Example output: # 1 2025-12-10 10:30:00 # 2 2025-12-10 12:45:00 (current) ``` **Rollback to Previous Generation:** ```bash # Rollback and reboot ssh root@node01.example.com 'nixos-rebuild switch --rollback' # Or boot into previous generation once (no permanent change) ssh root@node01.example.com 'nixos-rebuild boot --rollback && reboot' ``` **Rollback to Specific Generation:** ```bash ssh root@node01.example.com 'nix-env --switch-generation 1 -p /nix/var/nix/profiles/system' ssh root@node01.example.com 'reboot' ``` ### 9.2 Re-Provisioning from PXE Complete re-provisioning wipes all data and reinstalls from scratch: **Step 1: Remove Node from Cluster** ```bash curl -k -X DELETE https://node01.example.com:2379/admin/member/node02 curl -k -X DELETE https://node01.example.com:2479/admin/member/node02 ``` **Step 2: Set Boot to PXE** ```bash ipmitool -I lanplus -H 10.0.10.51 -U admin chassis bootdev pxe ``` **Step 3: Reboot Node** ```bash ssh root@node02.example.com 'reboot' # Or via BMC ipmitool -I lanplus -H 10.0.10.51 -U admin chassis power cycle ``` **Step 4: Run nixos-anywhere** ```bash # Wait for PXE boot and SSH ready sleep 90 nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#node02 \ root@10.0.100.51 ``` ### 9.3 Disaster Recovery Procedures **Complete Cluster Loss (All Nodes Down):** **Step 1: Restore from Backup (if available)** ```bash # Restore Chainfire data ssh root@node01.example.com ' systemctl stop chainfire.service rm -rf /var/lib/chainfire/* tar -xzf /backup/chainfire-$(date +%Y%m%d).tar.gz -C /var/lib/chainfire/ systemctl start chainfire.service ' ``` **Step 2: Bootstrap New Cluster** If no backup, re-provision all nodes as bootstrap: ```bash # Update cluster-config.json for all nodes # Set bootstrap=true, same initial_peers # Provision all 3 nodes for node in node01 node02 node03; do nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#$node \ root@ & done wait ``` **Single Node Failure:** **Step 1: Verify Cluster Quorum** ```bash # Check remaining nodes have quorum curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members | length' # Expected: 2 (if 3-node cluster with 1 failure) ``` **Step 2: Remove Failed Node** ```bash curl -k -X DELETE https://node01.example.com:2379/admin/member/node02 ``` **Step 3: Provision Replacement** ```bash # Use same node ID and configuration nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#node02 \ root@10.0.100.51 ``` ### 9.4 Backup and Restore **Automated Backup Script:** ```bash #!/bin/bash # /srv/provisioning/scripts/backup-cluster.sh BACKUP_DIR="/backup/cluster-$(date +%Y%m%d-%H%M%S)" mkdir -p "$BACKUP_DIR" # Backup Chainfire data for node in node01 node02 node03; do ssh root@$node.example.com \ "tar -czf - /var/lib/chainfire" > "$BACKUP_DIR/chainfire-$node.tar.gz" done # Backup FlareDB data for node in node01 node02 node03; do ssh root@$node.example.com \ "tar -czf - /var/lib/flaredb" > "$BACKUP_DIR/flaredb-$node.tar.gz" done # Backup configurations cp -r /srv/provisioning/nodes "$BACKUP_DIR/configs" echo "Backup complete: $BACKUP_DIR" ``` **Restore Script:** ```bash #!/bin/bash # /srv/provisioning/scripts/restore-cluster.sh BACKUP_DIR="$1" if [ -z "$BACKUP_DIR" ]; then echo "Usage: $0 " exit 1 fi # Stop services on all nodes for node in node01 node02 node03; do ssh root@$node.example.com 'systemctl stop chainfire flaredb' done # Restore Chainfire data for node in node01 node02 node03; do cat "$BACKUP_DIR/chainfire-$node.tar.gz" | \ ssh root@$node.example.com "cd / && tar -xzf -" done # Restore FlareDB data for node in node01 node02 node03; do cat "$BACKUP_DIR/flaredb-$node.tar.gz" | \ ssh root@$node.example.com "cd / && tar -xzf -" done # Restart services for node in node01 node02 node03; do ssh root@$node.example.com 'systemctl start chainfire flaredb' done echo "Restore complete" ``` ## 10. Security Best Practices ### 10.1 SSH Key Management **Generate Dedicated Provisioning Key:** ```bash ssh-keygen -t ed25519 -C "provisioning@example.com" -f ~/.ssh/id_ed25519_provisioning ``` **Add to Netboot Image:** ```nix # In netboot-base.nix users.users.root.openssh.authorizedKeys.keys = [ "ssh-ed25519 AAAAC3Nza... provisioning@example.com" ]; ``` **Rotate Keys Regularly:** ```bash # Generate new key ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519_provisioning_new # Add to all nodes for node in node01 node02 node03; do ssh-copy-id -i ~/.ssh/id_ed25519_provisioning_new.pub root@$node.example.com done # Remove old key from authorized_keys # Update netboot image with new key ``` ### 10.2 TLS Certificate Rotation **Automated Rotation Script:** ```bash #!/bin/bash # /srv/provisioning/scripts/rotate-certs.sh # Generate new certificates for node in node01 node02 node03; do openssl genrsa -out ${node}-key-new.pem 4096 openssl req -new -key ${node}-key-new.pem -out ${node}-csr.pem \ -subj "/CN=${node}.example.com" openssl x509 -req -in ${node}-csr.pem \ -CA ca-cert.pem -CAkey ca-key.pem \ -CAcreateserial -out ${node}-cert-new.pem -days 365 done # Deploy new certificates (without restarting services yet) for node in node01 node02 node03; do scp ${node}-cert-new.pem root@${node}.example.com:/etc/nixos/secrets/${node}-cert-new.pem scp ${node}-key-new.pem root@${node}.example.com:/etc/nixos/secrets/${node}-key-new.pem done # Update configuration to use new certs # ... (NixOS configuration update) ... # Rolling restart to apply new certificates for node in node01 node02 node03; do ssh root@${node}.example.com 'systemctl restart chainfire flaredb iam' sleep 30 # Wait for stabilization done echo "Certificate rotation complete" ``` ### 10.3 Secrets Management **Best Practices:** - Store secrets outside Nix store (use `/etc/nixos/secrets/`) - Set restrictive permissions (0600 for private keys, 0400 for passwords) - Use environment variables for runtime secrets - Never commit secrets to Git - Use encrypted secrets (sops-nix or agenix) **Example with sops-nix:** ```nix # In configuration.nix { imports = [ ]; sops.defaultSopsFile = ./secrets.yaml; sops.secrets."node01/tls-key" = { owner = "chainfire"; mode = "0400"; }; services.chainfire.settings.tls.key_path = config.sops.secrets."node01/tls-key".path; } ``` ### 10.4 Network Isolation **VLAN Segmentation:** - Management VLAN (10): BMC/IPMI, provisioning workstation - Provisioning VLAN (100): PXE boot, temporary - Production VLAN (200): Cluster services, inter-node communication - Client VLAN (300): External clients accessing services **Firewall Zones:** ```bash # Example nftables rules table inet filter { chain input { type filter hook input priority 0; policy drop; # Management from trusted subnet only iifname "eth0" ip saddr 10.0.10.0/24 tcp dport 22 accept # Cluster traffic from cluster subnet only iifname "eth1" ip saddr 10.0.200.0/24 tcp dport { 2379, 2380, 2479, 2480 } accept # Client traffic from client subnet only iifname "eth2" ip saddr 10.0.300.0/24 tcp dport { 8080, 9090 } accept } } ``` ### 10.5 Audit Logging **Enable Structured Logging:** ```nix # In configuration.nix services.chainfire.settings.logging = { level = "info"; format = "json"; output = "journal"; }; # Enable journald forwarding to SIEM services.journald.extraConfig = '' ForwardToSyslog=yes Storage=persistent MaxRetentionSec=7days ''; ``` **Audit Key Events:** - Cluster membership changes - Node joins/leaves - Authentication failures - Configuration changes - TLS certificate errors **Log Aggregation:** ```bash # Forward logs to central logging server # Example: rsyslog configuration cat > /etc/rsyslog.d/50-remote.conf <