# Command Reference Guide **Document Version:** 1.0 **Last Updated:** 2025-12-10 ## Table of Contents - [PXE Server Operations](#pxe-server-operations) - [Image Building](#image-building) - [Node Provisioning](#node-provisioning) - [Cluster Management](#cluster-management) - [Service Management](#service-management) - [Health Checks](#health-checks) - [BMC/IPMI Operations](#bmcipmi-operations) - [Network Diagnostics](#network-diagnostics) - [Log Querying](#log-querying) - [Backup and Restore](#backup-and-restore) ## PXE Server Operations ### Start/Stop Services ```bash # Start all PXE services sudo systemctl start dhcpd4.service atftpd.service nginx.service # Stop all PXE services sudo systemctl stop dhcpd4.service atftpd.service nginx.service # Restart all PXE services sudo systemctl restart dhcpd4.service atftpd.service nginx.service # Enable services at boot sudo systemctl enable dhcpd4.service atftpd.service nginx.service # Check status sudo systemctl status dhcpd4.service sudo systemctl status atftpd.service sudo systemctl status nginx.service ``` ### DHCP Server Management ```bash # Test DHCP configuration syntax sudo dhcpd -t -cf /etc/dhcp/dhcpd.conf # View DHCP leases sudo cat /var/lib/dhcp/dhcpd.leases # Watch DHCP leases in real-time sudo tail -f /var/lib/dhcp/dhcpd.leases # View DHCP server logs sudo journalctl -u dhcpd4.service -f # Check DHCP server is listening sudo ss -ulnp | grep :67 # Send DHCP discover (from client) sudo nmap --script broadcast-dhcp-discover -e eth0 ``` ### TFTP Server Management ```bash # Test TFTP download locally tftp localhost -c get undionly.kpxe /tmp/test.kpxe # Test TFTP from remote host tftp 10.0.100.10 -c get ipxe.efi /tmp/test.efi # Check TFTP server is listening sudo ss -ulnp | grep :69 # View TFTP logs sudo journalctl -u atftpd.service -f # Monitor TFTP traffic sudo tcpdump -i eth0 -n port 69 -vv # List TFTP root directory ls -lh /var/lib/tftpboot/ ``` ### HTTP Server Management ```bash # Test HTTP server curl http://localhost:8080/health # Test boot script availability curl http://localhost:8080/boot/ipxe/boot.ipxe # Test netboot image availability curl -I http://localhost:8080/boot/nixos/control-plane/bzImage # Check nginx configuration syntax sudo nginx -t # Reload nginx configuration (without restart) sudo nginx -s reload # View nginx access logs sudo tail -f /var/log/nginx/access.log # View nginx error logs sudo tail -f /var/log/nginx/error.log # Monitor HTTP traffic sudo tcpdump -i eth0 -n port 80 or port 8080 -A | grep -E "GET|POST|HTTP" ``` ### PXE Boot Debugging ```bash # Monitor all PXE-related traffic sudo tcpdump -i eth0 -n '(port 67 or port 68 or port 69 or port 80)' -vv # Watch for DHCP discover packets sudo tcpdump -i eth0 -n 'udp port 67 or udp port 68' -vv # Watch for specific MAC address sudo tcpdump -i eth0 -n 'ether host 52:54:00:12:34:56' # Check PXE server health curl http://10.0.100.10:8080/health | jq # View comprehensive logs (all services) sudo journalctl -u dhcpd4 -u atftpd -u nginx -f --since "5 minutes ago" ``` ## Image Building ### Build Netboot Images ```bash # Build all profiles cd /home/centra/cloud/baremetal/image-builder ./build-images.sh # Build specific profile ./build-images.sh --profile control-plane ./build-images.sh --profile worker ./build-images.sh --profile all-in-one # Build and deploy to PXE server ./build-images.sh --deploy # Build with custom output directory ./build-images.sh --output-dir /srv/pxe/images # Build with verbose output ./build-images.sh --verbose ``` ### Manual Nix Builds ```bash # Build initrd nix build .#nixosConfigurations.netboot-control-plane.config.system.build.netbootRamdisk # Build kernel nix build .#nixosConfigurations.netboot-control-plane.config.system.build.kernel # Build complete system nix build .#nixosConfigurations.netboot-control-plane.config.system.build.toplevel # Check build artifacts ls -lh result/ # Copy artifacts manually sudo cp result/bzImage /var/lib/pxe-boot/nixos/control-plane/ sudo cp result/initrd /var/lib/pxe-boot/nixos/control-plane/ ``` ### Image Verification ```bash # Check image sizes ls -lh /var/lib/pxe-boot/nixos/*/ # Verify bzImage is a valid kernel file /var/lib/pxe-boot/nixos/control-plane/bzImage # Expected: Linux kernel x86 boot executable ... # Verify initrd is compressed file /var/lib/pxe-boot/nixos/control-plane/initrd # Expected: gzip compressed data ... # Check HTTP accessibility curl -I http://10.0.100.10:8080/boot/nixos/control-plane/bzImage # Expected: HTTP/1.1 200 OK # Calculate checksums sha256sum /var/lib/pxe-boot/nixos/control-plane/{bzImage,initrd} ``` ## Node Provisioning ### nixos-anywhere Commands ```bash # Basic provisioning nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#node01 \ root@10.0.100.50 # Provision with remote build (faster on slow local machine) nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#node01 \ --build-on-remote \ root@10.0.100.50 # Provision with disk encryption nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#node01 \ --disk-encryption-keys /tmp/luks.key <(cat /srv/provisioning/secrets/node01-luks.key) \ root@10.0.100.50 # Debug mode (verbose output, no reboot) nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#node01 \ --debug \ --no-reboot \ root@10.0.100.50 # Use specific SSH key nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#node01 \ --ssh-key ~/.ssh/id_ed25519_provisioning \ root@10.0.100.50 # Use specific Nix binary nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#node01 \ --nix-path /run/current-system/sw/bin/nix \ root@10.0.100.50 ``` ### Batch Provisioning ```bash # Provision multiple nodes in parallel for node in node01 node02 node03; do nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#${node} \ --build-on-remote \ root@ & done wait echo "All nodes provisioned" # Provision with logging for node in node01 node02 node03; do nix run github:nix-community/nixos-anywhere -- \ --flake /srv/provisioning#${node} \ root@ 2>&1 | tee /var/log/provision-${node}.log & done wait ``` ### SSH to Installer ```bash # SSH to PXE-booted installer ssh root@10.0.100.50 # Check available disks ssh root@10.0.100.50 'lsblk' # Check network configuration ssh root@10.0.100.50 'ip addr show' # Check internet connectivity ssh root@10.0.100.50 'ping -c 3 cache.nixos.org' # Manual disk wipe (if needed) ssh root@10.0.100.50 'wipefs -a /dev/sda && sgdisk --zap-all /dev/sda' # Test disko configuration ssh root@10.0.100.50 'nix-shell -p disko --run "disko --mode test /tmp/disko.nix"' ``` ## Cluster Management ### Cluster Member Operations ```bash # List cluster members (Chainfire) curl -k https://node01.example.com:2379/admin/cluster/members | jq # List cluster members (FlareDB) curl -k https://node01.example.com:2479/admin/cluster/members | jq # Get cluster leader curl -k https://node01.example.com:2379/admin/cluster/leader | jq # Get cluster status curl -k https://node01.example.com:2379/admin/cluster/status | jq ``` ### Add Node to Cluster ```bash # Add member to Chainfire cluster curl -k -X POST https://node01.example.com:2379/admin/member/add \ -H "Content-Type: application/json" \ -d '{ "id": "node04", "raft_addr": "10.0.200.13:2380" }' # Add member to FlareDB cluster curl -k -X POST https://node01.example.com:2479/admin/member/add \ -H "Content-Type: application/json" \ -d '{ "id": "node04", "raft_addr": "10.0.200.13:2480" }' ``` ### Remove Node from Cluster ```bash # Remove member from Chainfire cluster curl -k -X DELETE https://node01.example.com:2379/admin/member/node04 # Remove member from FlareDB cluster curl -k -X DELETE https://node01.example.com:2479/admin/member/node04 # Verify removal curl -k https://node01.example.com:2379/admin/cluster/members | jq '.members[] | select(.id=="node04")' # Expected: empty (no output) ``` ### Cluster Health Checks ```bash # Check all nodes health (Chainfire) for node in node01 node02 node03; do echo "$node:" curl -k https://${node}.example.com:2379/health | jq -c done # Check cluster has quorum MEMBER_COUNT=$(curl -sk https://node01.example.com:2379/admin/cluster/members | jq '.members | length') echo "Cluster has $MEMBER_COUNT members" if [ $MEMBER_COUNT -ge 2 ]; then echo "Quorum achieved" else echo "WARNING: No quorum" fi # Check Raft leader exists LEADER=$(curl -sk https://node01.example.com:2379/admin/cluster/leader | jq -r '.id') if [ -n "$LEADER" ]; then echo "Leader: $LEADER" else echo "ERROR: No leader elected" fi ``` ## Service Management ### Systemd Service Control ```bash # Start service sudo systemctl start chainfire.service # Stop service sudo systemctl stop chainfire.service # Restart service sudo systemctl restart chainfire.service # Reload configuration (without restart) sudo systemctl reload chainfire.service # Enable service at boot sudo systemctl enable chainfire.service # Disable service at boot sudo systemctl disable chainfire.service # Check service status sudo systemctl status chainfire.service # View service dependencies sudo systemctl list-dependencies chainfire.service ``` ### Multi-Service Operations ```bash # Start all PlasmaCloud services sudo systemctl start chainfire.service flaredb.service iam.service \ plasmavmc.service prismnet.service flashdns.service # Stop all PlasmaCloud services sudo systemctl stop chainfire.service flaredb.service iam.service \ plasmavmc.service prismnet.service flashdns.service # Check status of all services systemctl status 'chainfire.service' 'flaredb.service' 'iam.service' \ 'plasmavmc.service' 'prismnet.service' 'flashdns.service' --no-pager # Restart services in order sudo systemctl restart chainfire.service && sleep 10 sudo systemctl restart flaredb.service && sleep 10 sudo systemctl restart iam.service ``` ### NixOS Configuration Management ```bash # Build new configuration (test) sudo nixos-rebuild test --flake /srv/provisioning#node01 # Build and apply new configuration sudo nixos-rebuild switch --flake /srv/provisioning#node01 # Build and set as boot default (no activation) sudo nixos-rebuild boot --flake /srv/provisioning#node01 # Rollback to previous generation sudo nixos-rebuild switch --rollback # List generations sudo nixos-rebuild list-generations # Boot into specific generation (next boot only) sudo nixos-rebuild boot --switch-generation 3 # Delete old generations sudo nix-collect-garbage --delete-older-than 30d ``` ## Health Checks ### Service Health Endpoints ```bash # Chainfire health curl -k https://node01.example.com:2379/health | jq # FlareDB health curl -k https://node01.example.com:2479/health | jq # IAM health curl -k https://node01.example.com:8080/health | jq # PlasmaVMC health curl -k https://node01.example.com:9090/health | jq # PrismNET health curl -k https://node01.example.com:9091/health | jq # FlashDNS health (via HTTP) curl -k https://node01.example.com:853/health | jq # FiberLB health curl -k https://node01.example.com:9092/health | jq # K8sHost health curl -k https://node01.example.com:10250/healthz ``` ### Comprehensive Health Check Script ```bash #!/bin/bash # /srv/provisioning/scripts/health-check-all.sh NODES=("node01" "node02" "node03") SERVICES=("2379:Chainfire" "2479:FlareDB" "8080:IAM" "9090:PlasmaVMC") for node in "${NODES[@]}"; do echo "Checking $node..." for service in "${SERVICES[@]}"; do port=$(echo $service | cut -d: -f1) name=$(echo $service | cut -d: -f2) status=$(curl -sk https://${node}.example.com:${port}/health | jq -r '.status' 2>/dev/null) if [ "$status" = "healthy" ]; then echo " ✓ $name: healthy" else echo " ✗ $name: unhealthy or unreachable" fi done echo "" done ``` ### System Health Checks ```bash # Check system load ssh root@node01.example.com 'uptime' # Check memory usage ssh root@node01.example.com 'free -h' # Check disk usage ssh root@node01.example.com 'df -h' # Check disk I/O ssh root@node01.example.com 'iostat -x 1 5' # Check network bandwidth ssh root@node01.example.com 'iftop -i eth1 -t -s 5' # Check process list ssh root@node01.example.com 'ps aux --sort=-%mem | head -20' # Check for OOM kills ssh root@node01.example.com 'dmesg | grep -i "out of memory"' ``` ## BMC/IPMI Operations ### Power Control ```bash # Power on ipmitool -I lanplus -H 10.0.10.50 -U admin -P password chassis power on # Power off (graceful) ipmitool -I lanplus -H 10.0.10.50 -U admin chassis power soft # Power off (force) ipmitool -I lanplus -H 10.0.10.50 -U admin chassis power off # Power cycle ipmitool -I lanplus -H 10.0.10.50 -U admin chassis power cycle # Power status ipmitool -I lanplus -H 10.0.10.50 -U admin chassis power status ``` ### Boot Device Control ```bash # Set next boot to PXE ipmitool -I lanplus -H 10.0.10.50 -U admin chassis bootdev pxe # Set next boot to disk ipmitool -I lanplus -H 10.0.10.50 -U admin chassis bootdev disk # Set next boot to CDROM ipmitool -I lanplus -H 10.0.10.50 -U admin chassis bootdev cdrom # Set persistent PXE boot (all future boots) ipmitool -I lanplus -H 10.0.10.50 -U admin chassis bootdev pxe options=persistent # Clear persistent boot device ipmitool -I lanplus -H 10.0.10.50 -U admin chassis bootdev none ``` ### Serial-over-LAN (SOL) ```bash # Activate SOL session ipmitool -I lanplus -H 10.0.10.50 -U admin sol activate # Deactivate SOL session (from another terminal) ipmitool -I lanplus -H 10.0.10.50 -U admin sol deactivate # Configure SOL settings ipmitool -I lanplus -H 10.0.10.50 -U admin sol set enabled true 1 ipmitool -I lanplus -H 10.0.10.50 -U admin sol set volatile-bit-rate 115.2 1 # View SOL configuration ipmitool -I lanplus -H 10.0.10.50 -U admin sol info 1 ``` ### System Information ```bash # Get sensor readings ipmitool -I lanplus -H 10.0.10.50 -U admin sdr list # Get specific sensor ipmitool -I lanplus -H 10.0.10.50 -U admin sdr get "CPU Temp" # Get system event log ipmitool -I lanplus -H 10.0.10.50 -U admin sel list # Clear system event log ipmitool -I lanplus -H 10.0.10.50 -U admin sel clear # Get BMC info ipmitool -I lanplus -H 10.0.10.50 -U admin bmc info # Get FRU (Field Replaceable Unit) info ipmitool -I lanplus -H 10.0.10.50 -U admin fru print ``` ### Batch Operations ```bash # Power on all nodes for ip in 10.0.10.{50..55}; do echo "Powering on $ip..." ipmitool -I lanplus -H $ip -U admin -P password chassis power on done # Check power status all nodes for ip in 10.0.10.{50..55}; do echo -n "$ip: " ipmitool -I lanplus -H $ip -U admin -P password chassis power status done # Set all nodes to PXE boot for ip in 10.0.10.{50..55}; do echo "Setting $ip to PXE boot..." ipmitool -I lanplus -H $ip -U admin -P password chassis bootdev pxe options=persistent done ``` ## Network Diagnostics ### Connectivity Tests ```bash # Ping test ping -c 5 node01.example.com # TCP port test nc -zv node01.example.com 2379 # TCP port test with timeout timeout 5 bash -c ' chainfire-backup-$(date +%Y%m%d).tar.gz # Backup FlareDB data ssh root@node01.example.com 'tar -czf - /var/lib/flaredb' > flaredb-backup-$(date +%Y%m%d).tar.gz # Backup configuration files tar -czf provisioning-config-$(date +%Y%m%d).tar.gz /srv/provisioning/nodes/ # Backup TLS certificates tar -czf tls-certs-$(date +%Y%m%d).tar.gz /srv/provisioning/secrets/*.pem ``` ### Automated Backup Script ```bash #!/bin/bash # /srv/provisioning/scripts/backup-cluster.sh BACKUP_DIR="/backup/cluster-$(date +%Y%m%d-%H%M%S)" mkdir -p "$BACKUP_DIR" # Backup cluster data from all nodes for node in node01 node02 node03; do echo "Backing up $node..." ssh root@$node.example.com "tar -czf - /var/lib/chainfire" > "$BACKUP_DIR/chainfire-$node.tar.gz" ssh root@$node.example.com "tar -czf - /var/lib/flaredb" > "$BACKUP_DIR/flaredb-$node.tar.gz" done # Backup configurations cp -r /srv/provisioning/nodes "$BACKUP_DIR/configs" # Create manifest cat > "$BACKUP_DIR/manifest.txt" <