- netboot-base.nix with SSH key auth - Launch scripts for node01/02/03 - Node configuration.nix and disko.nix - Nix modules for first-boot automation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
19 KiB
First-Boot Automation for Bare-Metal Provisioning
Automated cluster joining and service initialization for bare-metal provisioned NixOS nodes.
Table of Contents
- Overview
- Quick Start
- Configuration
- Bootstrap vs Join
- Systemd Services
- Troubleshooting
- Manual Operations
- Security
- Examples
Overview
The first-boot automation system handles automated cluster joining for distributed services (Chainfire, FlareDB, IAM) on first boot of bare-metal provisioned nodes. It supports two modes:
- Bootstrap Mode: Initialize a new Raft cluster (first 3 nodes)
- Join Mode: Join an existing cluster (additional nodes)
Features
- Automated health checking with retries
- Idempotent operations (safe to run multiple times)
- Structured JSON logging to journald
- Graceful failure handling with configurable retries
- Integration with TLS certificates (T031)
- Support for both bootstrap and runtime join scenarios
Architecture
See ARCHITECTURE.md for detailed design documentation.
Quick Start
Prerequisites
- Node provisioned via T032.S1-S3 (PXE boot and installation)
- Cluster configuration file at
/etc/nixos/secrets/cluster-config.json - TLS certificates at
/etc/nixos/secrets/(T031) - Network connectivity to cluster leader (for join mode)
Enable First-Boot Automation
In your NixOS configuration:
# /etc/nixos/configuration.nix
{
imports = [
./nix/modules/first-boot-automation.nix
];
services.first-boot-automation = {
enable = true;
configFile = "/etc/nixos/secrets/cluster-config.json";
# Optional: disable specific services
enableChainfire = true;
enableFlareDB = true;
enableIAM = true;
enableHealthCheck = true;
};
}
First Boot
After provisioning and reboot:
- Node boots from disk
- systemd starts services
- First-boot automation runs automatically
- Cluster join completes within 30-60 seconds
Check status:
systemctl status chainfire-cluster-join.service
systemctl status flaredb-cluster-join.service
systemctl status iam-initial-setup.service
systemctl status cluster-health-check.service
Configuration
cluster-config.json Format
{
"node_id": "node01",
"node_role": "control-plane",
"bootstrap": true,
"cluster_name": "prod-cluster",
"leader_url": "https://node01.prod.example.com:2379",
"raft_addr": "10.0.1.10:2380",
"initial_peers": [
"node01:2380",
"node02:2380",
"node03:2380"
],
"flaredb_peers": [
"node01:2480",
"node02:2480",
"node03:2480"
]
}
Required Fields
| Field | Type | Description |
|---|---|---|
node_id |
string | Unique identifier for this node |
node_role |
string | Node role: control-plane, worker, or all-in-one |
bootstrap |
boolean | true for first 3 nodes, false for additional nodes |
cluster_name |
string | Cluster identifier |
leader_url |
string | HTTPS URL of cluster leader (used for join) |
raft_addr |
string | This node's Raft address (IP:port) |
initial_peers |
array | List of bootstrap peer addresses |
flaredb_peers |
array | List of FlareDB peer addresses |
Optional Fields
| Field | Type | Description |
|---|---|---|
node_ip |
string | Node's primary IP address |
node_fqdn |
string | Fully qualified domain name |
datacenter |
string | Datacenter identifier |
rack |
string | Rack identifier |
services |
object | Per-service configuration |
tls |
object | TLS certificate paths |
network |
object | Network CIDR ranges |
Example Configurations
See examples/ directory:
cluster-config-bootstrap.json- Bootstrap node (first 3)cluster-config-join.json- Join node (additional)cluster-config-all-in-one.json- Single-node deployment
Bootstrap vs Join
Bootstrap Mode (bootstrap: true)
When to use:
- First 3 nodes in a new cluster
- Nodes configured with matching
initial_peers - No existing cluster to join
Behavior:
- Services start with
--initial-clusterconfiguration - Raft consensus automatically elects leader
- Cluster join service detects bootstrap mode and exits immediately
- Marker file created:
/var/lib/first-boot-automation/.chainfire-initialized
Example:
{
"node_id": "node01",
"bootstrap": true,
"initial_peers": ["node01:2380", "node02:2380", "node03:2380"]
}
Join Mode (bootstrap: false)
When to use:
- Nodes joining an existing cluster
- Expansion or replacement nodes
- Leader is known and reachable
Behavior:
- Service starts with no initial cluster config
- Waits for local service to be healthy (max 120s)
- POST to leader's
/admin/member/addendpoint - Retries up to 5 times with 10s delay
- Marker file created:
/var/lib/first-boot-automation/.chainfire-joined
Example:
{
"node_id": "node04",
"bootstrap": false,
"leader_url": "https://node01.prod.example.com:2379",
"raft_addr": "10.0.1.13:2380"
}
Decision Matrix
| Scenario | bootstrap | initial_peers | leader_url |
|---|---|---|---|
| Node 1 (first) | true |
all 3 nodes | self |
| Node 2 (first) | true |
all 3 nodes | self |
| Node 3 (first) | true |
all 3 nodes | self |
| Node 4+ (join) | false |
all 3 nodes | node 1 |
Systemd Services
chainfire-cluster-join.service
Description: Joins Chainfire cluster on first boot
Dependencies:
- After:
network-online.target,chainfire.service - Before:
flaredb-cluster-join.service
Configuration:
- Type:
oneshot - RemainAfterExit:
true - Restart:
on-failure
Logs:
journalctl -u chainfire-cluster-join.service
flaredb-cluster-join.service
Description: Joins FlareDB cluster after Chainfire
Dependencies:
- After:
chainfire-cluster-join.service,flaredb.service - Requires:
chainfire-cluster-join.service
Configuration:
- Type:
oneshot - RemainAfterExit:
true - Restart:
on-failure
Logs:
journalctl -u flaredb-cluster-join.service
iam-initial-setup.service
Description: IAM initial setup and admin user creation
Dependencies:
- After:
flaredb-cluster-join.service,iam.service
Configuration:
- Type:
oneshot - RemainAfterExit:
true
Logs:
journalctl -u iam-initial-setup.service
cluster-health-check.service
Description: Validates cluster health on first boot
Dependencies:
- After: all cluster-join services
Configuration:
- Type:
oneshot - RemainAfterExit:
false
Logs:
journalctl -u cluster-health-check.service
Troubleshooting
Check Service Status
# Overall status
systemctl status chainfire-cluster-join.service
systemctl status flaredb-cluster-join.service
# Detailed logs with JSON output
journalctl -u chainfire-cluster-join.service -o json-pretty
# Follow logs in real-time
journalctl -u chainfire-cluster-join.service -f
Common Issues
1. Health Check Timeout
Symptom:
{"level":"ERROR","message":"Health check timeout after 120s"}
Causes:
- Service not starting (check main service logs)
- Port conflict
- TLS certificate issues
Solutions:
# Check main service
systemctl status chainfire.service
journalctl -u chainfire.service
# Test health endpoint manually
curl -k https://localhost:2379/health
# Restart services
systemctl restart chainfire.service
systemctl restart chainfire-cluster-join.service
2. Leader Unreachable
Symptom:
{"level":"ERROR","message":"Join request failed: connection error"}
Causes:
- Network connectivity issues
- Firewall blocking ports
- Leader not running
- Wrong leader URL in config
Solutions:
# Test network connectivity
ping node01.prod.example.com
curl -k https://node01.prod.example.com:2379/health
# Check firewall
iptables -L -n | grep 2379
# Verify configuration
jq '.leader_url' /etc/nixos/secrets/cluster-config.json
# Try manual join (see below)
3. Invalid Configuration
Symptom:
{"level":"ERROR","message":"Configuration file not found"}
Causes:
- Missing configuration file
- Wrong file path
- Invalid JSON syntax
- Missing required fields
Solutions:
# Check file exists
ls -la /etc/nixos/secrets/cluster-config.json
# Validate JSON syntax
jq . /etc/nixos/secrets/cluster-config.json
# Check required fields
jq '.node_id, .bootstrap, .leader_url' /etc/nixos/secrets/cluster-config.json
# Fix and restart
systemctl restart chainfire-cluster-join.service
4. Already Member (Reboot)
Symptom:
{"level":"WARN","message":"Already member of cluster (HTTP 409)"}
Explanation:
- This is normal on reboots
- Marker file prevents duplicate joins
- No action needed
Verify:
# Check marker file
cat /var/lib/first-boot-automation/.chainfire-joined
# Should show timestamp: 2025-12-10T10:30:45+00:00
5. Join Retry Exhausted
Symptom:
{"level":"ERROR","message":"Failed to join cluster after 5 attempts"}
Causes:
- Persistent network issues
- Leader down or overloaded
- Invalid node configuration
- Cluster at capacity
Solutions:
# Check cluster status on leader
curl -k https://node01.prod.example.com:2379/admin/cluster/members | jq
# Verify this node's configuration
jq '.node_id, .raft_addr' /etc/nixos/secrets/cluster-config.json
# Increase retry attempts (edit NixOS config)
# Or perform manual join (see below)
Verify Cluster Membership
On leader node:
# Chainfire members
curl -k https://localhost:2379/admin/cluster/members | jq
# FlareDB members
curl -k https://localhost:2479/admin/cluster/members | jq
Expected output:
{
"members": [
{"id": "node01", "raft_addr": "10.0.1.10:2380", "status": "healthy"},
{"id": "node02", "raft_addr": "10.0.1.11:2380", "status": "healthy"},
{"id": "node03", "raft_addr": "10.0.1.12:2380", "status": "healthy"}
]
}
Check Marker Files
# List all marker files
ls -la /var/lib/first-boot-automation/
# View timestamps
cat /var/lib/first-boot-automation/.chainfire-joined
cat /var/lib/first-boot-automation/.flaredb-joined
Reset and Re-join
Warning: This will remove the node from the cluster and rejoin.
# Stop services
systemctl stop chainfire.service flaredb.service
# Remove data and markers
rm -rf /var/lib/chainfire/*
rm -rf /var/lib/flaredb/*
rm /var/lib/first-boot-automation/.chainfire-*
rm /var/lib/first-boot-automation/.flaredb-*
# Restart (will auto-join)
systemctl start chainfire.service
systemctl restart chainfire-cluster-join.service
Manual Operations
Manual Cluster Join
If automation fails, perform manual join:
Chainfire:
# On joining node, ensure service is running and healthy
curl -k https://localhost:2379/health
# From any node, add member to cluster
curl -k -X POST https://node01.prod.example.com:2379/admin/member/add \
-H "Content-Type: application/json" \
-d '{
"id": "node04",
"raft_addr": "10.0.1.13:2380"
}'
# Create marker to prevent auto-retry
mkdir -p /var/lib/first-boot-automation
date -Iseconds > /var/lib/first-boot-automation/.chainfire-joined
FlareDB:
curl -k -X POST https://node01.prod.example.com:2479/admin/member/add \
-H "Content-Type: application/json" \
-d '{
"id": "node04",
"raft_addr": "10.0.1.13:2480"
}'
date -Iseconds > /var/lib/first-boot-automation/.flaredb-joined
Remove Node from Cluster
On leader:
# Chainfire
curl -k -X DELETE https://node01.prod.example.com:2379/admin/member/node04
# FlareDB
curl -k -X DELETE https://node01.prod.example.com:2479/admin/member/node04
On removed node:
# Stop services
systemctl stop chainfire.service flaredb.service
# Clean up data
rm -rf /var/lib/chainfire/*
rm -rf /var/lib/flaredb/*
rm /var/lib/first-boot-automation/.chainfire-*
rm /var/lib/first-boot-automation/.flaredb-*
Disable First-Boot Automation
If you need to disable automation:
# In NixOS configuration
services.first-boot-automation.enable = false;
Or stop services temporarily:
systemctl stop chainfire-cluster-join.service
systemctl disable chainfire-cluster-join.service
Re-enable After Manual Operations
After manual cluster operations:
# Create marker files to indicate join complete
mkdir -p /var/lib/first-boot-automation
date -Iseconds > /var/lib/first-boot-automation/.chainfire-joined
date -Iseconds > /var/lib/first-boot-automation/.flaredb-joined
# Or re-enable automation (will skip if markers exist)
systemctl enable --now chainfire-cluster-join.service
Security
TLS Certificates
Requirements:
- All cluster communication uses TLS
- Certificates must exist before first boot
- Generated by T031 TLS automation
Certificate Paths:
/etc/nixos/secrets/
├── ca.crt # CA certificate
├── node01.crt # Node certificate
└── node01.key # Node private key (mode 0600)
Permissions:
chmod 600 /etc/nixos/secrets/node01.key
chmod 644 /etc/nixos/secrets/node01.crt
chmod 644 /etc/nixos/secrets/ca.crt
Configuration File Security
Cluster configuration contains sensitive data:
- IP addresses and network topology
- Service URLs
- Node identifiers
Recommended permissions:
chmod 600 /etc/nixos/secrets/cluster-config.json
chown root:root /etc/nixos/secrets/cluster-config.json
Network Security
Required firewall rules:
# Chainfire
iptables -A INPUT -p tcp --dport 2379 -s 10.0.1.0/24 -j ACCEPT # API
iptables -A INPUT -p tcp --dport 2380 -s 10.0.1.0/24 -j ACCEPT # Raft
iptables -A INPUT -p tcp --dport 2381 -s 10.0.1.0/24 -j ACCEPT # Gossip
# FlareDB
iptables -A INPUT -p tcp --dport 2479 -s 10.0.1.0/24 -j ACCEPT # API
iptables -A INPUT -p tcp --dport 2480 -s 10.0.1.0/24 -j ACCEPT # Raft
# IAM
iptables -A INPUT -p tcp --dport 8080 -s 10.0.1.0/24 -j ACCEPT # API
Production Considerations
For production deployments:
- Remove
-kflag from curl (validate TLS certificates) - Implement mTLS for client authentication
- Rotate credentials regularly
- Audit logs with structured logging
- Monitor health endpoints continuously
- Backup cluster state before changes
Examples
Example 1: 3-Node Bootstrap Cluster
Node 1:
{
"node_id": "node01",
"bootstrap": true,
"raft_addr": "10.0.1.10:2380",
"initial_peers": ["node01:2380", "node02:2380", "node03:2380"]
}
Node 2:
{
"node_id": "node02",
"bootstrap": true,
"raft_addr": "10.0.1.11:2380",
"initial_peers": ["node01:2380", "node02:2380", "node03:2380"]
}
Node 3:
{
"node_id": "node03",
"bootstrap": true,
"raft_addr": "10.0.1.12:2380",
"initial_peers": ["node01:2380", "node02:2380", "node03:2380"]
}
Provisioning:
# Provision all 3 nodes simultaneously
for i in {1..3}; do
nixos-anywhere --flake .#node0$i root@node0$i.example.com &
done
wait
# Nodes will bootstrap automatically on first boot
Example 2: Join Existing Cluster
Node 4 (joining):
{
"node_id": "node04",
"bootstrap": false,
"leader_url": "https://node01.prod.example.com:2379",
"raft_addr": "10.0.1.13:2380"
}
Provisioning:
nixos-anywhere --flake .#node04 root@node04.example.com
# Node will automatically join on first boot
Example 3: Single-Node All-in-One
For development/testing:
{
"node_id": "aio01",
"bootstrap": true,
"raft_addr": "10.0.2.10:2380",
"initial_peers": ["aio01:2380"],
"flaredb_peers": ["aio01:2480"]
}
Provisioning:
nixos-anywhere --flake .#aio01 root@aio01.example.com
Integration with Other Systems
T024 NixOS Modules
First-boot automation integrates with service modules:
{
imports = [
./nix/modules/chainfire.nix
./nix/modules/flaredb.nix
./nix/modules/first-boot-automation.nix
];
services.chainfire.enable = true;
services.flaredb.enable = true;
services.first-boot-automation.enable = true;
}
T025 Observability
Health checks integrate with Prometheus:
# prometheus.yml
scrape_configs:
- job_name: 'cluster-health'
static_configs:
- targets: ['node01:2379', 'node02:2379', 'node03:2379']
metrics_path: '/health'
T031 TLS Certificates
Certificates generated by T031 are used automatically:
# On provisioning server
./tls/generate-node-cert.sh node01.example.com 10.0.1.10
# Copied during nixos-anywhere
# First-boot automation reads from /etc/nixos/secrets/
Logs and Debugging
Structured Logging
All logs are JSON-formatted:
{
"timestamp": "2025-12-10T10:30:45+00:00",
"level": "INFO",
"service": "chainfire",
"operation": "cluster-join",
"message": "Successfully joined cluster"
}
Query Examples
All first-boot logs:
journalctl -u "*cluster-join*" -u "*initial-setup*" -u "*health-check*"
Errors only:
journalctl -u chainfire-cluster-join.service | grep '"level":"ERROR"'
Last boot only:
journalctl -b -u chainfire-cluster-join.service
JSON output for parsing:
journalctl -u chainfire-cluster-join.service -o json | jq '.MESSAGE'
Performance Tuning
Timeout Configuration
Adjust timeouts in NixOS module:
services.first-boot-automation = {
enable = true;
# Override default ports if needed
chainfirePort = 2379;
flaredbPort = 2479;
};
Retry Configuration
Modify retry logic in scripts:
# baremetal/first-boot/cluster-join.sh
MAX_ATTEMPTS=10 # Increase from 5
RETRY_DELAY=15 # Increase from 10s
Health Check Interval
Adjust polling interval:
# In service scripts
sleep 10 # Increase from 5s for less aggressive polling
Support and Contributing
Getting Help
- Check logs:
journalctl -u chainfire-cluster-join.service - Review troubleshooting section above
- Consult ARCHITECTURE.md for design details
- Check cluster status on leader node
Reporting Issues
Include in bug reports:
# Gather diagnostic information
journalctl -u chainfire-cluster-join.service > cluster-join.log
systemctl status chainfire-cluster-join.service > service-status.txt
cat /etc/nixos/secrets/cluster-config.json > config.json # Redact sensitive data!
ls -la /var/lib/first-boot-automation/ > markers.txt
Development
See ARCHITECTURE.md for contributing guidelines.
References
- ARCHITECTURE.md: Detailed design documentation
- T024: NixOS service modules
- T025: Observability and monitoring
- T031: TLS certificate automation
- T032.S1-S3: PXE boot and provisioning
- Design Document:
/home/centra/cloud/docs/por/T032-baremetal-provisioning/design.md
License
Internal use only - Centra Cloud Platform