- netboot-base.nix with SSH key auth - Launch scripts for node01/02/03 - Node configuration.nix and disko.nix - Nix modules for first-boot automation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
858 lines
19 KiB
Markdown
858 lines
19 KiB
Markdown
# First-Boot Automation for Bare-Metal Provisioning
|
|
|
|
Automated cluster joining and service initialization for bare-metal provisioned NixOS nodes.
|
|
|
|
## Table of Contents
|
|
|
|
- [Overview](#overview)
|
|
- [Quick Start](#quick-start)
|
|
- [Configuration](#configuration)
|
|
- [Bootstrap vs Join](#bootstrap-vs-join)
|
|
- [Systemd Services](#systemd-services)
|
|
- [Troubleshooting](#troubleshooting)
|
|
- [Manual Operations](#manual-operations)
|
|
- [Security](#security)
|
|
- [Examples](#examples)
|
|
|
|
## Overview
|
|
|
|
The first-boot automation system handles automated cluster joining for distributed services (Chainfire, FlareDB, IAM) on first boot of bare-metal provisioned nodes. It supports two modes:
|
|
|
|
- **Bootstrap Mode**: Initialize a new Raft cluster (first 3 nodes)
|
|
- **Join Mode**: Join an existing cluster (additional nodes)
|
|
|
|
### Features
|
|
|
|
- Automated health checking with retries
|
|
- Idempotent operations (safe to run multiple times)
|
|
- Structured JSON logging to journald
|
|
- Graceful failure handling with configurable retries
|
|
- Integration with TLS certificates (T031)
|
|
- Support for both bootstrap and runtime join scenarios
|
|
|
|
### Architecture
|
|
|
|
See [ARCHITECTURE.md](ARCHITECTURE.md) for detailed design documentation.
|
|
|
|
## Quick Start
|
|
|
|
### Prerequisites
|
|
|
|
1. Node provisioned via T032.S1-S3 (PXE boot and installation)
|
|
2. Cluster configuration file at `/etc/nixos/secrets/cluster-config.json`
|
|
3. TLS certificates at `/etc/nixos/secrets/` (T031)
|
|
4. Network connectivity to cluster leader (for join mode)
|
|
|
|
### Enable First-Boot Automation
|
|
|
|
In your NixOS configuration:
|
|
|
|
```nix
|
|
# /etc/nixos/configuration.nix
|
|
{
|
|
imports = [
|
|
./nix/modules/first-boot-automation.nix
|
|
];
|
|
|
|
services.first-boot-automation = {
|
|
enable = true;
|
|
configFile = "/etc/nixos/secrets/cluster-config.json";
|
|
|
|
# Optional: disable specific services
|
|
enableChainfire = true;
|
|
enableFlareDB = true;
|
|
enableIAM = true;
|
|
enableHealthCheck = true;
|
|
};
|
|
}
|
|
```
|
|
|
|
### First Boot
|
|
|
|
After provisioning and reboot:
|
|
|
|
1. Node boots from disk
|
|
2. systemd starts services
|
|
3. First-boot automation runs automatically
|
|
4. Cluster join completes within 30-60 seconds
|
|
|
|
Check status:
|
|
```bash
|
|
systemctl status chainfire-cluster-join.service
|
|
systemctl status flaredb-cluster-join.service
|
|
systemctl status iam-initial-setup.service
|
|
systemctl status cluster-health-check.service
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### cluster-config.json Format
|
|
|
|
```json
|
|
{
|
|
"node_id": "node01",
|
|
"node_role": "control-plane",
|
|
"bootstrap": true,
|
|
"cluster_name": "prod-cluster",
|
|
"leader_url": "https://node01.prod.example.com:2379",
|
|
"raft_addr": "10.0.1.10:2380",
|
|
"initial_peers": [
|
|
"node01:2380",
|
|
"node02:2380",
|
|
"node03:2380"
|
|
],
|
|
"flaredb_peers": [
|
|
"node01:2480",
|
|
"node02:2480",
|
|
"node03:2480"
|
|
]
|
|
}
|
|
```
|
|
|
|
### Required Fields
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `node_id` | string | Unique identifier for this node |
|
|
| `node_role` | string | Node role: `control-plane`, `worker`, or `all-in-one` |
|
|
| `bootstrap` | boolean | `true` for first 3 nodes, `false` for additional nodes |
|
|
| `cluster_name` | string | Cluster identifier |
|
|
| `leader_url` | string | HTTPS URL of cluster leader (used for join) |
|
|
| `raft_addr` | string | This node's Raft address (IP:port) |
|
|
| `initial_peers` | array | List of bootstrap peer addresses |
|
|
| `flaredb_peers` | array | List of FlareDB peer addresses |
|
|
|
|
### Optional Fields
|
|
|
|
| Field | Type | Description |
|
|
|-------|------|-------------|
|
|
| `node_ip` | string | Node's primary IP address |
|
|
| `node_fqdn` | string | Fully qualified domain name |
|
|
| `datacenter` | string | Datacenter identifier |
|
|
| `rack` | string | Rack identifier |
|
|
| `services` | object | Per-service configuration |
|
|
| `tls` | object | TLS certificate paths |
|
|
| `network` | object | Network CIDR ranges |
|
|
|
|
### Example Configurations
|
|
|
|
See [examples/](examples/) directory:
|
|
|
|
- `cluster-config-bootstrap.json` - Bootstrap node (first 3)
|
|
- `cluster-config-join.json` - Join node (additional)
|
|
- `cluster-config-all-in-one.json` - Single-node deployment
|
|
|
|
## Bootstrap vs Join
|
|
|
|
### Bootstrap Mode (bootstrap: true)
|
|
|
|
**When to use:**
|
|
- First 3 nodes in a new cluster
|
|
- Nodes configured with matching `initial_peers`
|
|
- No existing cluster to join
|
|
|
|
**Behavior:**
|
|
1. Services start with `--initial-cluster` configuration
|
|
2. Raft consensus automatically elects leader
|
|
3. Cluster join service detects bootstrap mode and exits immediately
|
|
4. Marker file created: `/var/lib/first-boot-automation/.chainfire-initialized`
|
|
|
|
**Example:**
|
|
```json
|
|
{
|
|
"node_id": "node01",
|
|
"bootstrap": true,
|
|
"initial_peers": ["node01:2380", "node02:2380", "node03:2380"]
|
|
}
|
|
```
|
|
|
|
### Join Mode (bootstrap: false)
|
|
|
|
**When to use:**
|
|
- Nodes joining an existing cluster
|
|
- Expansion or replacement nodes
|
|
- Leader is known and reachable
|
|
|
|
**Behavior:**
|
|
1. Service starts with no initial cluster config
|
|
2. Waits for local service to be healthy (max 120s)
|
|
3. POST to leader's `/admin/member/add` endpoint
|
|
4. Retries up to 5 times with 10s delay
|
|
5. Marker file created: `/var/lib/first-boot-automation/.chainfire-joined`
|
|
|
|
**Example:**
|
|
```json
|
|
{
|
|
"node_id": "node04",
|
|
"bootstrap": false,
|
|
"leader_url": "https://node01.prod.example.com:2379",
|
|
"raft_addr": "10.0.1.13:2380"
|
|
}
|
|
```
|
|
|
|
### Decision Matrix
|
|
|
|
| Scenario | bootstrap | initial_peers | leader_url |
|
|
|----------|-----------|---------------|------------|
|
|
| Node 1 (first) | `true` | all 3 nodes | self |
|
|
| Node 2 (first) | `true` | all 3 nodes | self |
|
|
| Node 3 (first) | `true` | all 3 nodes | self |
|
|
| Node 4+ (join) | `false` | all 3 nodes | node 1 |
|
|
|
|
## Systemd Services
|
|
|
|
### chainfire-cluster-join.service
|
|
|
|
**Description:** Joins Chainfire cluster on first boot
|
|
|
|
**Dependencies:**
|
|
- After: `network-online.target`, `chainfire.service`
|
|
- Before: `flaredb-cluster-join.service`
|
|
|
|
**Configuration:**
|
|
- Type: `oneshot`
|
|
- RemainAfterExit: `true`
|
|
- Restart: `on-failure`
|
|
|
|
**Logs:**
|
|
```bash
|
|
journalctl -u chainfire-cluster-join.service
|
|
```
|
|
|
|
### flaredb-cluster-join.service
|
|
|
|
**Description:** Joins FlareDB cluster after Chainfire
|
|
|
|
**Dependencies:**
|
|
- After: `chainfire-cluster-join.service`, `flaredb.service`
|
|
- Requires: `chainfire-cluster-join.service`
|
|
|
|
**Configuration:**
|
|
- Type: `oneshot`
|
|
- RemainAfterExit: `true`
|
|
- Restart: `on-failure`
|
|
|
|
**Logs:**
|
|
```bash
|
|
journalctl -u flaredb-cluster-join.service
|
|
```
|
|
|
|
### iam-initial-setup.service
|
|
|
|
**Description:** IAM initial setup and admin user creation
|
|
|
|
**Dependencies:**
|
|
- After: `flaredb-cluster-join.service`, `iam.service`
|
|
|
|
**Configuration:**
|
|
- Type: `oneshot`
|
|
- RemainAfterExit: `true`
|
|
|
|
**Logs:**
|
|
```bash
|
|
journalctl -u iam-initial-setup.service
|
|
```
|
|
|
|
### cluster-health-check.service
|
|
|
|
**Description:** Validates cluster health on first boot
|
|
|
|
**Dependencies:**
|
|
- After: all cluster-join services
|
|
|
|
**Configuration:**
|
|
- Type: `oneshot`
|
|
- RemainAfterExit: `false`
|
|
|
|
**Logs:**
|
|
```bash
|
|
journalctl -u cluster-health-check.service
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Check Service Status
|
|
|
|
```bash
|
|
# Overall status
|
|
systemctl status chainfire-cluster-join.service
|
|
systemctl status flaredb-cluster-join.service
|
|
|
|
# Detailed logs with JSON output
|
|
journalctl -u chainfire-cluster-join.service -o json-pretty
|
|
|
|
# Follow logs in real-time
|
|
journalctl -u chainfire-cluster-join.service -f
|
|
```
|
|
|
|
### Common Issues
|
|
|
|
#### 1. Health Check Timeout
|
|
|
|
**Symptom:**
|
|
```json
|
|
{"level":"ERROR","message":"Health check timeout after 120s"}
|
|
```
|
|
|
|
**Causes:**
|
|
- Service not starting (check main service logs)
|
|
- Port conflict
|
|
- TLS certificate issues
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Check main service
|
|
systemctl status chainfire.service
|
|
journalctl -u chainfire.service
|
|
|
|
# Test health endpoint manually
|
|
curl -k https://localhost:2379/health
|
|
|
|
# Restart services
|
|
systemctl restart chainfire.service
|
|
systemctl restart chainfire-cluster-join.service
|
|
```
|
|
|
|
#### 2. Leader Unreachable
|
|
|
|
**Symptom:**
|
|
```json
|
|
{"level":"ERROR","message":"Join request failed: connection error"}
|
|
```
|
|
|
|
**Causes:**
|
|
- Network connectivity issues
|
|
- Firewall blocking ports
|
|
- Leader not running
|
|
- Wrong leader URL in config
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Test network connectivity
|
|
ping node01.prod.example.com
|
|
curl -k https://node01.prod.example.com:2379/health
|
|
|
|
# Check firewall
|
|
iptables -L -n | grep 2379
|
|
|
|
# Verify configuration
|
|
jq '.leader_url' /etc/nixos/secrets/cluster-config.json
|
|
|
|
# Try manual join (see below)
|
|
```
|
|
|
|
#### 3. Invalid Configuration
|
|
|
|
**Symptom:**
|
|
```json
|
|
{"level":"ERROR","message":"Configuration file not found"}
|
|
```
|
|
|
|
**Causes:**
|
|
- Missing configuration file
|
|
- Wrong file path
|
|
- Invalid JSON syntax
|
|
- Missing required fields
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Check file exists
|
|
ls -la /etc/nixos/secrets/cluster-config.json
|
|
|
|
# Validate JSON syntax
|
|
jq . /etc/nixos/secrets/cluster-config.json
|
|
|
|
# Check required fields
|
|
jq '.node_id, .bootstrap, .leader_url' /etc/nixos/secrets/cluster-config.json
|
|
|
|
# Fix and restart
|
|
systemctl restart chainfire-cluster-join.service
|
|
```
|
|
|
|
#### 4. Already Member (Reboot)
|
|
|
|
**Symptom:**
|
|
```json
|
|
{"level":"WARN","message":"Already member of cluster (HTTP 409)"}
|
|
```
|
|
|
|
**Explanation:**
|
|
- This is **normal** on reboots
|
|
- Marker file prevents duplicate joins
|
|
- No action needed
|
|
|
|
**Verify:**
|
|
```bash
|
|
# Check marker file
|
|
cat /var/lib/first-boot-automation/.chainfire-joined
|
|
|
|
# Should show timestamp: 2025-12-10T10:30:45+00:00
|
|
```
|
|
|
|
#### 5. Join Retry Exhausted
|
|
|
|
**Symptom:**
|
|
```json
|
|
{"level":"ERROR","message":"Failed to join cluster after 5 attempts"}
|
|
```
|
|
|
|
**Causes:**
|
|
- Persistent network issues
|
|
- Leader down or overloaded
|
|
- Invalid node configuration
|
|
- Cluster at capacity
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Check cluster status on leader
|
|
curl -k https://node01.prod.example.com:2379/admin/cluster/members | jq
|
|
|
|
# Verify this node's configuration
|
|
jq '.node_id, .raft_addr' /etc/nixos/secrets/cluster-config.json
|
|
|
|
# Increase retry attempts (edit NixOS config)
|
|
# Or perform manual join (see below)
|
|
```
|
|
|
|
### Verify Cluster Membership
|
|
|
|
**On leader node:**
|
|
```bash
|
|
# Chainfire members
|
|
curl -k https://localhost:2379/admin/cluster/members | jq
|
|
|
|
# FlareDB members
|
|
curl -k https://localhost:2479/admin/cluster/members | jq
|
|
```
|
|
|
|
**Expected output:**
|
|
```json
|
|
{
|
|
"members": [
|
|
{"id": "node01", "raft_addr": "10.0.1.10:2380", "status": "healthy"},
|
|
{"id": "node02", "raft_addr": "10.0.1.11:2380", "status": "healthy"},
|
|
{"id": "node03", "raft_addr": "10.0.1.12:2380", "status": "healthy"}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Check Marker Files
|
|
|
|
```bash
|
|
# List all marker files
|
|
ls -la /var/lib/first-boot-automation/
|
|
|
|
# View timestamps
|
|
cat /var/lib/first-boot-automation/.chainfire-joined
|
|
cat /var/lib/first-boot-automation/.flaredb-joined
|
|
```
|
|
|
|
### Reset and Re-join
|
|
|
|
**Warning:** This will remove the node from the cluster and rejoin.
|
|
|
|
```bash
|
|
# Stop services
|
|
systemctl stop chainfire.service flaredb.service
|
|
|
|
# Remove data and markers
|
|
rm -rf /var/lib/chainfire/*
|
|
rm -rf /var/lib/flaredb/*
|
|
rm /var/lib/first-boot-automation/.chainfire-*
|
|
rm /var/lib/first-boot-automation/.flaredb-*
|
|
|
|
# Restart (will auto-join)
|
|
systemctl start chainfire.service
|
|
systemctl restart chainfire-cluster-join.service
|
|
```
|
|
|
|
## Manual Operations
|
|
|
|
### Manual Cluster Join
|
|
|
|
If automation fails, perform manual join:
|
|
|
|
**Chainfire:**
|
|
```bash
|
|
# On joining node, ensure service is running and healthy
|
|
curl -k https://localhost:2379/health
|
|
|
|
# From any node, add member to cluster
|
|
curl -k -X POST https://node01.prod.example.com:2379/admin/member/add \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"id": "node04",
|
|
"raft_addr": "10.0.1.13:2380"
|
|
}'
|
|
|
|
# Create marker to prevent auto-retry
|
|
mkdir -p /var/lib/first-boot-automation
|
|
date -Iseconds > /var/lib/first-boot-automation/.chainfire-joined
|
|
```
|
|
|
|
**FlareDB:**
|
|
```bash
|
|
curl -k -X POST https://node01.prod.example.com:2479/admin/member/add \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"id": "node04",
|
|
"raft_addr": "10.0.1.13:2480"
|
|
}'
|
|
|
|
date -Iseconds > /var/lib/first-boot-automation/.flaredb-joined
|
|
```
|
|
|
|
### Remove Node from Cluster
|
|
|
|
**On leader:**
|
|
```bash
|
|
# Chainfire
|
|
curl -k -X DELETE https://node01.prod.example.com:2379/admin/member/node04
|
|
|
|
# FlareDB
|
|
curl -k -X DELETE https://node01.prod.example.com:2479/admin/member/node04
|
|
```
|
|
|
|
**On removed node:**
|
|
```bash
|
|
# Stop services
|
|
systemctl stop chainfire.service flaredb.service
|
|
|
|
# Clean up data
|
|
rm -rf /var/lib/chainfire/*
|
|
rm -rf /var/lib/flaredb/*
|
|
rm /var/lib/first-boot-automation/.chainfire-*
|
|
rm /var/lib/first-boot-automation/.flaredb-*
|
|
```
|
|
|
|
### Disable First-Boot Automation
|
|
|
|
If you need to disable automation:
|
|
|
|
```nix
|
|
# In NixOS configuration
|
|
services.first-boot-automation.enable = false;
|
|
```
|
|
|
|
Or stop services temporarily:
|
|
```bash
|
|
systemctl stop chainfire-cluster-join.service
|
|
systemctl disable chainfire-cluster-join.service
|
|
```
|
|
|
|
### Re-enable After Manual Operations
|
|
|
|
After manual cluster operations:
|
|
|
|
```bash
|
|
# Create marker files to indicate join complete
|
|
mkdir -p /var/lib/first-boot-automation
|
|
date -Iseconds > /var/lib/first-boot-automation/.chainfire-joined
|
|
date -Iseconds > /var/lib/first-boot-automation/.flaredb-joined
|
|
|
|
# Or re-enable automation (will skip if markers exist)
|
|
systemctl enable --now chainfire-cluster-join.service
|
|
```
|
|
|
|
## Security
|
|
|
|
### TLS Certificates
|
|
|
|
**Requirements:**
|
|
- All cluster communication uses TLS
|
|
- Certificates must exist before first boot
|
|
- Generated by T031 TLS automation
|
|
|
|
**Certificate Paths:**
|
|
```
|
|
/etc/nixos/secrets/
|
|
├── ca.crt # CA certificate
|
|
├── node01.crt # Node certificate
|
|
└── node01.key # Node private key (mode 0600)
|
|
```
|
|
|
|
**Permissions:**
|
|
```bash
|
|
chmod 600 /etc/nixos/secrets/node01.key
|
|
chmod 644 /etc/nixos/secrets/node01.crt
|
|
chmod 644 /etc/nixos/secrets/ca.crt
|
|
```
|
|
|
|
### Configuration File Security
|
|
|
|
**Cluster configuration contains sensitive data:**
|
|
- IP addresses and network topology
|
|
- Service URLs
|
|
- Node identifiers
|
|
|
|
**Recommended permissions:**
|
|
```bash
|
|
chmod 600 /etc/nixos/secrets/cluster-config.json
|
|
chown root:root /etc/nixos/secrets/cluster-config.json
|
|
```
|
|
|
|
### Network Security
|
|
|
|
**Required firewall rules:**
|
|
```bash
|
|
# Chainfire
|
|
iptables -A INPUT -p tcp --dport 2379 -s 10.0.1.0/24 -j ACCEPT # API
|
|
iptables -A INPUT -p tcp --dport 2380 -s 10.0.1.0/24 -j ACCEPT # Raft
|
|
iptables -A INPUT -p tcp --dport 2381 -s 10.0.1.0/24 -j ACCEPT # Gossip
|
|
|
|
# FlareDB
|
|
iptables -A INPUT -p tcp --dport 2479 -s 10.0.1.0/24 -j ACCEPT # API
|
|
iptables -A INPUT -p tcp --dport 2480 -s 10.0.1.0/24 -j ACCEPT # Raft
|
|
|
|
# IAM
|
|
iptables -A INPUT -p tcp --dport 8080 -s 10.0.1.0/24 -j ACCEPT # API
|
|
```
|
|
|
|
### Production Considerations
|
|
|
|
**For production deployments:**
|
|
|
|
1. **Remove `-k` flag from curl** (validate TLS certificates)
|
|
2. **Implement mTLS** for client authentication
|
|
3. **Rotate credentials** regularly
|
|
4. **Audit logs** with structured logging
|
|
5. **Monitor health endpoints** continuously
|
|
6. **Backup cluster state** before changes
|
|
|
|
## Examples
|
|
|
|
### Example 1: 3-Node Bootstrap Cluster
|
|
|
|
**Node 1:**
|
|
```json
|
|
{
|
|
"node_id": "node01",
|
|
"bootstrap": true,
|
|
"raft_addr": "10.0.1.10:2380",
|
|
"initial_peers": ["node01:2380", "node02:2380", "node03:2380"]
|
|
}
|
|
```
|
|
|
|
**Node 2:**
|
|
```json
|
|
{
|
|
"node_id": "node02",
|
|
"bootstrap": true,
|
|
"raft_addr": "10.0.1.11:2380",
|
|
"initial_peers": ["node01:2380", "node02:2380", "node03:2380"]
|
|
}
|
|
```
|
|
|
|
**Node 3:**
|
|
```json
|
|
{
|
|
"node_id": "node03",
|
|
"bootstrap": true,
|
|
"raft_addr": "10.0.1.12:2380",
|
|
"initial_peers": ["node01:2380", "node02:2380", "node03:2380"]
|
|
}
|
|
```
|
|
|
|
**Provisioning:**
|
|
```bash
|
|
# Provision all 3 nodes simultaneously
|
|
for i in {1..3}; do
|
|
nixos-anywhere --flake .#node0$i root@node0$i.example.com &
|
|
done
|
|
wait
|
|
|
|
# Nodes will bootstrap automatically on first boot
|
|
```
|
|
|
|
### Example 2: Join Existing Cluster
|
|
|
|
**Node 4 (joining):**
|
|
```json
|
|
{
|
|
"node_id": "node04",
|
|
"bootstrap": false,
|
|
"leader_url": "https://node01.prod.example.com:2379",
|
|
"raft_addr": "10.0.1.13:2380"
|
|
}
|
|
```
|
|
|
|
**Provisioning:**
|
|
```bash
|
|
nixos-anywhere --flake .#node04 root@node04.example.com
|
|
|
|
# Node will automatically join on first boot
|
|
```
|
|
|
|
### Example 3: Single-Node All-in-One
|
|
|
|
**For development/testing:**
|
|
```json
|
|
{
|
|
"node_id": "aio01",
|
|
"bootstrap": true,
|
|
"raft_addr": "10.0.2.10:2380",
|
|
"initial_peers": ["aio01:2380"],
|
|
"flaredb_peers": ["aio01:2480"]
|
|
}
|
|
```
|
|
|
|
**Provisioning:**
|
|
```bash
|
|
nixos-anywhere --flake .#aio01 root@aio01.example.com
|
|
```
|
|
|
|
## Integration with Other Systems
|
|
|
|
### T024 NixOS Modules
|
|
|
|
First-boot automation integrates with service modules:
|
|
|
|
```nix
|
|
{
|
|
imports = [
|
|
./nix/modules/chainfire.nix
|
|
./nix/modules/flaredb.nix
|
|
./nix/modules/first-boot-automation.nix
|
|
];
|
|
|
|
services.chainfire.enable = true;
|
|
services.flaredb.enable = true;
|
|
services.first-boot-automation.enable = true;
|
|
}
|
|
```
|
|
|
|
### T025 Observability
|
|
|
|
Health checks integrate with Prometheus:
|
|
|
|
```yaml
|
|
# prometheus.yml
|
|
scrape_configs:
|
|
- job_name: 'cluster-health'
|
|
static_configs:
|
|
- targets: ['node01:2379', 'node02:2379', 'node03:2379']
|
|
metrics_path: '/health'
|
|
```
|
|
|
|
### T031 TLS Certificates
|
|
|
|
Certificates generated by T031 are used automatically:
|
|
|
|
```bash
|
|
# On provisioning server
|
|
./tls/generate-node-cert.sh node01.example.com 10.0.1.10
|
|
|
|
# Copied during nixos-anywhere
|
|
# First-boot automation reads from /etc/nixos/secrets/
|
|
```
|
|
|
|
## Logs and Debugging
|
|
|
|
### Structured Logging
|
|
|
|
All logs are JSON-formatted:
|
|
|
|
```json
|
|
{
|
|
"timestamp": "2025-12-10T10:30:45+00:00",
|
|
"level": "INFO",
|
|
"service": "chainfire",
|
|
"operation": "cluster-join",
|
|
"message": "Successfully joined cluster"
|
|
}
|
|
```
|
|
|
|
### Query Examples
|
|
|
|
**All first-boot logs:**
|
|
```bash
|
|
journalctl -u "*cluster-join*" -u "*initial-setup*" -u "*health-check*"
|
|
```
|
|
|
|
**Errors only:**
|
|
```bash
|
|
journalctl -u chainfire-cluster-join.service | grep '"level":"ERROR"'
|
|
```
|
|
|
|
**Last boot only:**
|
|
```bash
|
|
journalctl -b -u chainfire-cluster-join.service
|
|
```
|
|
|
|
**JSON output for parsing:**
|
|
```bash
|
|
journalctl -u chainfire-cluster-join.service -o json | jq '.MESSAGE'
|
|
```
|
|
|
|
## Performance Tuning
|
|
|
|
### Timeout Configuration
|
|
|
|
Adjust timeouts in NixOS module:
|
|
|
|
```nix
|
|
services.first-boot-automation = {
|
|
enable = true;
|
|
|
|
# Override default ports if needed
|
|
chainfirePort = 2379;
|
|
flaredbPort = 2479;
|
|
};
|
|
```
|
|
|
|
### Retry Configuration
|
|
|
|
Modify retry logic in scripts:
|
|
|
|
```bash
|
|
# baremetal/first-boot/cluster-join.sh
|
|
MAX_ATTEMPTS=10 # Increase from 5
|
|
RETRY_DELAY=15 # Increase from 10s
|
|
```
|
|
|
|
### Health Check Interval
|
|
|
|
Adjust polling interval:
|
|
|
|
```bash
|
|
# In service scripts
|
|
sleep 10 # Increase from 5s for less aggressive polling
|
|
```
|
|
|
|
## Support and Contributing
|
|
|
|
### Getting Help
|
|
|
|
1. Check logs: `journalctl -u chainfire-cluster-join.service`
|
|
2. Review troubleshooting section above
|
|
3. Consult [ARCHITECTURE.md](ARCHITECTURE.md) for design details
|
|
4. Check cluster status on leader node
|
|
|
|
### Reporting Issues
|
|
|
|
Include in bug reports:
|
|
|
|
```bash
|
|
# Gather diagnostic information
|
|
journalctl -u chainfire-cluster-join.service > cluster-join.log
|
|
systemctl status chainfire-cluster-join.service > service-status.txt
|
|
cat /etc/nixos/secrets/cluster-config.json > config.json # Redact sensitive data!
|
|
ls -la /var/lib/first-boot-automation/ > markers.txt
|
|
```
|
|
|
|
### Development
|
|
|
|
See [ARCHITECTURE.md](ARCHITECTURE.md) for contributing guidelines.
|
|
|
|
## References
|
|
|
|
- **ARCHITECTURE.md**: Detailed design documentation
|
|
- **T024**: NixOS service modules
|
|
- **T025**: Observability and monitoring
|
|
- **T031**: TLS certificate automation
|
|
- **T032.S1-S3**: PXE boot and provisioning
|
|
- **Design Document**: `/home/centra/cloud/docs/por/T032-baremetal-provisioning/design.md`
|
|
|
|
## License
|
|
|
|
Internal use only - Centra Cloud Platform
|