centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere

- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-11 09:59:19 +09:00

19 KiB

Raw Blame History

First-Boot Automation for Bare-Metal Provisioning

Automated cluster joining and service initialization for bare-metal provisioned NixOS nodes.

Overview
Quick Start
Configuration
Bootstrap vs Join
Systemd Services
Troubleshooting
Manual Operations
Security
Examples

Overview

The first-boot automation system handles automated cluster joining for distributed services (Chainfire, FlareDB, IAM) on first boot of bare-metal provisioned nodes. It supports two modes:

Bootstrap Mode: Initialize a new Raft cluster (first 3 nodes)
Join Mode: Join an existing cluster (additional nodes)

Features

Automated health checking with retries
Idempotent operations (safe to run multiple times)
Structured JSON logging to journald
Graceful failure handling with configurable retries
Integration with TLS certificates (T031)
Support for both bootstrap and runtime join scenarios

Architecture

See ARCHITECTURE.md for detailed design documentation.

Quick Start

Prerequisites

Node provisioned via T032.S1-S3 (PXE boot and installation)
Cluster configuration file at /etc/nixos/secrets/cluster-config.json
TLS certificates at /etc/nixos/secrets/ (T031)
Network connectivity to cluster leader (for join mode)

Enable First-Boot Automation

In your NixOS configuration:

# /etc/nixos/configuration.nix
{
  imports = [
    ./nix/modules/first-boot-automation.nix
  ];

  services.first-boot-automation = {
    enable = true;
    configFile = "/etc/nixos/secrets/cluster-config.json";

    # Optional: disable specific services
    enableChainfire = true;
    enableFlareDB = true;
    enableIAM = true;
    enableHealthCheck = true;
  };
}

First Boot

After provisioning and reboot:

Node boots from disk
systemd starts services
First-boot automation runs automatically
Cluster join completes within 30-60 seconds

Check status:

systemctl status chainfire-cluster-join.service
systemctl status flaredb-cluster-join.service
systemctl status iam-initial-setup.service
systemctl status cluster-health-check.service

Configuration

cluster-config.json Format

{
  "node_id": "node01",
  "node_role": "control-plane",
  "bootstrap": true,
  "cluster_name": "prod-cluster",
  "leader_url": "https://node01.prod.example.com:2379",
  "raft_addr": "10.0.1.10:2380",
  "initial_peers": [
    "node01:2380",
    "node02:2380",
    "node03:2380"
  ],
  "flaredb_peers": [
    "node01:2480",
    "node02:2480",
    "node03:2480"
  ]
}

Required Fields

Field	Type	Description
`node_id`	string	Unique identifier for this node
`node_role`	string	Node role: `control-plane`, `worker`, or `all-in-one`
`bootstrap`	boolean	`true` for first 3 nodes, `false` for additional nodes
`cluster_name`	string	Cluster identifier
`leader_url`	string	HTTPS URL of cluster leader (used for join)
`raft_addr`	string	This node's Raft address (IP:port)
`initial_peers`	array	List of bootstrap peer addresses
`flaredb_peers`	array	List of FlareDB peer addresses

Optional Fields

Field	Type	Description
`node_ip`	string	Node's primary IP address
`node_fqdn`	string	Fully qualified domain name
`datacenter`	string	Datacenter identifier
`rack`	string	Rack identifier
`services`	object	Per-service configuration
`tls`	object	TLS certificate paths
`network`	object	Network CIDR ranges

Example Configurations

See examples/ directory:

cluster-config-bootstrap.json - Bootstrap node (first 3)
cluster-config-join.json - Join node (additional)
cluster-config-all-in-one.json - Single-node deployment

Bootstrap vs Join

Bootstrap Mode (bootstrap: true)

When to use:

First 3 nodes in a new cluster
Nodes configured with matching initial_peers
No existing cluster to join

Behavior:

Services start with --initial-cluster configuration
Raft consensus automatically elects leader
Cluster join service detects bootstrap mode and exits immediately
Marker file created: /var/lib/first-boot-automation/.chainfire-initialized

Example:

{
  "node_id": "node01",
  "bootstrap": true,
  "initial_peers": ["node01:2380", "node02:2380", "node03:2380"]
}

Join Mode (bootstrap: false)

When to use:

Nodes joining an existing cluster
Expansion or replacement nodes
Leader is known and reachable

Behavior:

Service starts with no initial cluster config
Waits for local service to be healthy (max 120s)
POST to leader's /admin/member/add endpoint
Retries up to 5 times with 10s delay
Marker file created: /var/lib/first-boot-automation/.chainfire-joined

Example:

{
  "node_id": "node04",
  "bootstrap": false,
  "leader_url": "https://node01.prod.example.com:2379",
  "raft_addr": "10.0.1.13:2380"
}

Decision Matrix

Scenario	bootstrap	initial_peers	leader_url
Node 1 (first)	`true`	all 3 nodes	self
Node 2 (first)	`true`	all 3 nodes	self
Node 3 (first)	`true`	all 3 nodes	self
Node 4+ (join)	`false`	all 3 nodes	node 1

Systemd Services

chainfire-cluster-join.service

Description: Joins Chainfire cluster on first boot

Dependencies:

After: network-online.target, chainfire.service
Before: flaredb-cluster-join.service

Configuration:

Type: oneshot
RemainAfterExit: true
Restart: on-failure

Logs:

journalctl -u chainfire-cluster-join.service

flaredb-cluster-join.service

Description: Joins FlareDB cluster after Chainfire

Dependencies:

After: chainfire-cluster-join.service, flaredb.service
Requires: chainfire-cluster-join.service

Configuration:

Type: oneshot
RemainAfterExit: true
Restart: on-failure

Logs:

journalctl -u flaredb-cluster-join.service

iam-initial-setup.service

Description: IAM initial setup and admin user creation

Dependencies:

After: flaredb-cluster-join.service, iam.service

Configuration:

Type: oneshot
RemainAfterExit: true

Logs:

journalctl -u iam-initial-setup.service

cluster-health-check.service

Description: Validates cluster health on first boot

Dependencies:

After: all cluster-join services

Configuration:

Type: oneshot
RemainAfterExit: false

Logs:

journalctl -u cluster-health-check.service

Troubleshooting

Check Service Status

# Overall status
systemctl status chainfire-cluster-join.service
systemctl status flaredb-cluster-join.service

# Detailed logs with JSON output
journalctl -u chainfire-cluster-join.service -o json-pretty

# Follow logs in real-time
journalctl -u chainfire-cluster-join.service -f

Common Issues

1. Health Check Timeout

Symptom:

{"level":"ERROR","message":"Health check timeout after 120s"}

Causes:

Service not starting (check main service logs)
Port conflict
TLS certificate issues

Solutions:

# Check main service
systemctl status chainfire.service
journalctl -u chainfire.service

# Test health endpoint manually
curl -k https://localhost:2379/health

# Restart services
systemctl restart chainfire.service
systemctl restart chainfire-cluster-join.service

2. Leader Unreachable

Symptom:

{"level":"ERROR","message":"Join request failed: connection error"}

Causes:

Network connectivity issues
Firewall blocking ports
Leader not running
Wrong leader URL in config

Solutions:

# Test network connectivity
ping node01.prod.example.com
curl -k https://node01.prod.example.com:2379/health

# Check firewall
iptables -L -n | grep 2379

# Verify configuration
jq '.leader_url' /etc/nixos/secrets/cluster-config.json

# Try manual join (see below)

3. Invalid Configuration

Symptom:

{"level":"ERROR","message":"Configuration file not found"}

Causes:

Missing configuration file
Wrong file path
Invalid JSON syntax
Missing required fields

Solutions:

# Check file exists
ls -la /etc/nixos/secrets/cluster-config.json

# Validate JSON syntax
jq . /etc/nixos/secrets/cluster-config.json

# Check required fields
jq '.node_id, .bootstrap, .leader_url' /etc/nixos/secrets/cluster-config.json

# Fix and restart
systemctl restart chainfire-cluster-join.service

4. Already Member (Reboot)

Symptom:

{"level":"WARN","message":"Already member of cluster (HTTP 409)"}

Explanation:

This is normal on reboots
Marker file prevents duplicate joins
No action needed

Verify:

# Check marker file
cat /var/lib/first-boot-automation/.chainfire-joined

# Should show timestamp: 2025-12-10T10:30:45+00:00

5. Join Retry Exhausted

Symptom:

{"level":"ERROR","message":"Failed to join cluster after 5 attempts"}

Causes:

Persistent network issues
Leader down or overloaded
Invalid node configuration
Cluster at capacity

Solutions:

# Check cluster status on leader
curl -k https://node01.prod.example.com:2379/admin/cluster/members | jq

# Verify this node's configuration
jq '.node_id, .raft_addr' /etc/nixos/secrets/cluster-config.json

# Increase retry attempts (edit NixOS config)
# Or perform manual join (see below)

Verify Cluster Membership

On leader node:

# Chainfire members
curl -k https://localhost:2379/admin/cluster/members | jq

# FlareDB members
curl -k https://localhost:2479/admin/cluster/members | jq

Expected output:

{
  "members": [
    {"id": "node01", "raft_addr": "10.0.1.10:2380", "status": "healthy"},
    {"id": "node02", "raft_addr": "10.0.1.11:2380", "status": "healthy"},
    {"id": "node03", "raft_addr": "10.0.1.12:2380", "status": "healthy"}
  ]
}

Check Marker Files

# List all marker files
ls -la /var/lib/first-boot-automation/

# View timestamps
cat /var/lib/first-boot-automation/.chainfire-joined
cat /var/lib/first-boot-automation/.flaredb-joined

Reset and Re-join

Warning: This will remove the node from the cluster and rejoin.

# Stop services
systemctl stop chainfire.service flaredb.service

# Remove data and markers
rm -rf /var/lib/chainfire/*
rm -rf /var/lib/flaredb/*
rm /var/lib/first-boot-automation/.chainfire-*
rm /var/lib/first-boot-automation/.flaredb-*

# Restart (will auto-join)
systemctl start chainfire.service
systemctl restart chainfire-cluster-join.service

Manual Operations

Manual Cluster Join

If automation fails, perform manual join:

Chainfire:

# On joining node, ensure service is running and healthy
curl -k https://localhost:2379/health

# From any node, add member to cluster
curl -k -X POST https://node01.prod.example.com:2379/admin/member/add \
  -H "Content-Type: application/json" \
  -d '{
    "id": "node04",
    "raft_addr": "10.0.1.13:2380"
  }'

# Create marker to prevent auto-retry
mkdir -p /var/lib/first-boot-automation
date -Iseconds > /var/lib/first-boot-automation/.chainfire-joined

FlareDB:

curl -k -X POST https://node01.prod.example.com:2479/admin/member/add \
  -H "Content-Type: application/json" \
  -d '{
    "id": "node04",
    "raft_addr": "10.0.1.13:2480"
  }'

date -Iseconds > /var/lib/first-boot-automation/.flaredb-joined

Remove Node from Cluster

On leader:

# Chainfire
curl -k -X DELETE https://node01.prod.example.com:2379/admin/member/node04

# FlareDB
curl -k -X DELETE https://node01.prod.example.com:2479/admin/member/node04

On removed node:

# Stop services
systemctl stop chainfire.service flaredb.service

# Clean up data
rm -rf /var/lib/chainfire/*
rm -rf /var/lib/flaredb/*
rm /var/lib/first-boot-automation/.chainfire-*
rm /var/lib/first-boot-automation/.flaredb-*

Disable First-Boot Automation

If you need to disable automation:

# In NixOS configuration
services.first-boot-automation.enable = false;

Or stop services temporarily:

systemctl stop chainfire-cluster-join.service
systemctl disable chainfire-cluster-join.service

Re-enable After Manual Operations

After manual cluster operations:

# Create marker files to indicate join complete
mkdir -p /var/lib/first-boot-automation
date -Iseconds > /var/lib/first-boot-automation/.chainfire-joined
date -Iseconds > /var/lib/first-boot-automation/.flaredb-joined

# Or re-enable automation (will skip if markers exist)
systemctl enable --now chainfire-cluster-join.service

Security

TLS Certificates

Requirements:

All cluster communication uses TLS
Certificates must exist before first boot
Generated by T031 TLS automation

Certificate Paths:

/etc/nixos/secrets/
├── ca.crt              # CA certificate
├── node01.crt          # Node certificate
└── node01.key          # Node private key (mode 0600)

Permissions:

chmod 600 /etc/nixos/secrets/node01.key
chmod 644 /etc/nixos/secrets/node01.crt
chmod 644 /etc/nixos/secrets/ca.crt

Configuration File Security

Cluster configuration contains sensitive data:

IP addresses and network topology
Service URLs
Node identifiers

Recommended permissions:

chmod 600 /etc/nixos/secrets/cluster-config.json
chown root:root /etc/nixos/secrets/cluster-config.json

Network Security

Required firewall rules:

# Chainfire
iptables -A INPUT -p tcp --dport 2379 -s 10.0.1.0/24 -j ACCEPT  # API
iptables -A INPUT -p tcp --dport 2380 -s 10.0.1.0/24 -j ACCEPT  # Raft
iptables -A INPUT -p tcp --dport 2381 -s 10.0.1.0/24 -j ACCEPT  # Gossip

# FlareDB
iptables -A INPUT -p tcp --dport 2479 -s 10.0.1.0/24 -j ACCEPT  # API
iptables -A INPUT -p tcp --dport 2480 -s 10.0.1.0/24 -j ACCEPT  # Raft

# IAM
iptables -A INPUT -p tcp --dport 8080 -s 10.0.1.0/24 -j ACCEPT  # API

Production Considerations

For production deployments:

Remove -k flag from curl (validate TLS certificates)
Implement mTLS for client authentication
Rotate credentials regularly
Audit logs with structured logging
Monitor health endpoints continuously
Backup cluster state before changes

Examples

Example 1: 3-Node Bootstrap Cluster

Node 1:

{
  "node_id": "node01",
  "bootstrap": true,
  "raft_addr": "10.0.1.10:2380",
  "initial_peers": ["node01:2380", "node02:2380", "node03:2380"]
}

Node 2:

{
  "node_id": "node02",
  "bootstrap": true,
  "raft_addr": "10.0.1.11:2380",
  "initial_peers": ["node01:2380", "node02:2380", "node03:2380"]
}

Node 3:

{
  "node_id": "node03",
  "bootstrap": true,
  "raft_addr": "10.0.1.12:2380",
  "initial_peers": ["node01:2380", "node02:2380", "node03:2380"]
}

Provisioning:

# Provision all 3 nodes simultaneously
for i in {1..3}; do
  nixos-anywhere --flake .#node0$i root@node0$i.example.com &
done
wait

# Nodes will bootstrap automatically on first boot

Example 2: Join Existing Cluster

Node 4 (joining):

{
  "node_id": "node04",
  "bootstrap": false,
  "leader_url": "https://node01.prod.example.com:2379",
  "raft_addr": "10.0.1.13:2380"
}

Provisioning:

nixos-anywhere --flake .#node04 root@node04.example.com

# Node will automatically join on first boot

Example 3: Single-Node All-in-One

For development/testing:

{
  "node_id": "aio01",
  "bootstrap": true,
  "raft_addr": "10.0.2.10:2380",
  "initial_peers": ["aio01:2380"],
  "flaredb_peers": ["aio01:2480"]
}

Provisioning:

nixos-anywhere --flake .#aio01 root@aio01.example.com

Integration with Other Systems

T024 NixOS Modules

First-boot automation integrates with service modules:

{
  imports = [
    ./nix/modules/chainfire.nix
    ./nix/modules/flaredb.nix
    ./nix/modules/first-boot-automation.nix
  ];

  services.chainfire.enable = true;
  services.flaredb.enable = true;
  services.first-boot-automation.enable = true;
}

T025 Observability

Health checks integrate with Prometheus:

# prometheus.yml
scrape_configs:
  - job_name: 'cluster-health'
    static_configs:
      - targets: ['node01:2379', 'node02:2379', 'node03:2379']
    metrics_path: '/health'

T031 TLS Certificates

Certificates generated by T031 are used automatically:

# On provisioning server
./tls/generate-node-cert.sh node01.example.com 10.0.1.10

# Copied during nixos-anywhere
# First-boot automation reads from /etc/nixos/secrets/

Logs and Debugging

Structured Logging

All logs are JSON-formatted:

{
  "timestamp": "2025-12-10T10:30:45+00:00",
  "level": "INFO",
  "service": "chainfire",
  "operation": "cluster-join",
  "message": "Successfully joined cluster"
}

Query Examples

All first-boot logs:

journalctl -u "*cluster-join*" -u "*initial-setup*" -u "*health-check*"

Errors only:

journalctl -u chainfire-cluster-join.service | grep '"level":"ERROR"'

Last boot only:

journalctl -b -u chainfire-cluster-join.service

JSON output for parsing:

journalctl -u chainfire-cluster-join.service -o json | jq '.MESSAGE'

Performance Tuning

Timeout Configuration

Adjust timeouts in NixOS module:

services.first-boot-automation = {
  enable = true;

  # Override default ports if needed
  chainfirePort = 2379;
  flaredbPort = 2479;
};

Retry Configuration

Modify retry logic in scripts:

# baremetal/first-boot/cluster-join.sh
MAX_ATTEMPTS=10      # Increase from 5
RETRY_DELAY=15       # Increase from 10s

Health Check Interval

Adjust polling interval:

# In service scripts
sleep 10  # Increase from 5s for less aggressive polling

Support and Contributing

Getting Help

Check logs: journalctl -u chainfire-cluster-join.service
Review troubleshooting section above
Consult ARCHITECTURE.md for design details
Check cluster status on leader node

Reporting Issues

Include in bug reports:

# Gather diagnostic information
journalctl -u chainfire-cluster-join.service > cluster-join.log
systemctl status chainfire-cluster-join.service > service-status.txt
cat /etc/nixos/secrets/cluster-config.json > config.json  # Redact sensitive data!
ls -la /var/lib/first-boot-automation/ > markers.txt

Development

See ARCHITECTURE.md for contributing guidelines.

References

ARCHITECTURE.md: Detailed design documentation
T024: NixOS service modules
T025: Observability and monitoring
T031: TLS certificate automation
T032.S1-S3: PXE boot and provisioning
Design Document: /home/centra/cloud/docs/por/T032-baremetal-provisioning/design.md

License

Internal use only - Centra Cloud Platform

19 KiB Raw Blame History

First-Boot Automation for Bare-Metal Provisioning

Table of Contents

Overview

Features

Architecture

Quick Start

Prerequisites

Enable First-Boot Automation

First Boot

Configuration

cluster-config.json Format

Required Fields

Optional Fields

Example Configurations

Bootstrap vs Join

Bootstrap Mode (bootstrap: true)

Join Mode (bootstrap: false)

Decision Matrix

Systemd Services

chainfire-cluster-join.service

flaredb-cluster-join.service

iam-initial-setup.service

cluster-health-check.service

Troubleshooting

Check Service Status

Common Issues

1. Health Check Timeout

2. Leader Unreachable

3. Invalid Configuration

4. Already Member (Reboot)

5. Join Retry Exhausted

Verify Cluster Membership

Check Marker Files

Reset and Re-join

Manual Operations

Manual Cluster Join

Remove Node from Cluster

Disable First-Boot Automation

Re-enable After Manual Operations

Security

TLS Certificates

Configuration File Security

Network Security

Production Considerations

Examples

Example 1: 3-Node Bootstrap Cluster

Example 2: Join Existing Cluster

Example 3: Single-Node All-in-One

Integration with Other Systems

T024 NixOS Modules

T025 Observability

T031 TLS Certificates

Logs and Debugging

Structured Logging

Query Examples

Performance Tuning

Timeout Configuration

Retry Configuration

Health Check Interval

Support and Contributing

Getting Help

Reporting Issues

Development

References

License

19 KiB

Raw Blame History