centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere

- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-11 09:59:19 +09:00

23 KiB

Raw Blame History

First-Boot Automation Architecture

Overview

The first-boot automation system provides automated cluster joining and service initialization for bare-metal provisioned nodes. It handles two critical scenarios:

Bootstrap Mode: First 3 nodes initialize a new Raft cluster
Join Mode: Additional nodes join an existing cluster

This document describes the architecture, design decisions, and implementation details.

System Architecture

Component Hierarchy

┌─────────────────────────────────────────────────────────────┐
│                    NixOS Boot Process                        │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│              systemd.target: multi-user.target               │
└────────────────────┬────────────────────────────────────────┘
                     │
     ┌───────────────┼───────────────┐
     │               │               │
     ▼               ▼               ▼
┌──────────┐  ┌──────────┐  ┌──────────┐
│chainfire │  │ flaredb  │  │   iam    │
│.service  │  │.service  │  │.service  │
└────┬─────┘  └────┬─────┘  └────┬─────┘
     │             │               │
     ▼             ▼               ▼
┌──────────────────────────────────────────┐
│   chainfire-cluster-join.service         │
│   - Waits for local chainfire health     │
│   - Checks bootstrap flag                │
│   - Joins cluster if bootstrap=false     │
└────────────────┬─────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────┐
│   flaredb-cluster-join.service           │
│   - Requires chainfire-cluster-join      │
│   - Waits for local flaredb health       │
│   - Joins FlareDB cluster                │
└────────────────┬─────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────┐
│   iam-initial-setup.service              │
│   - Waits for IAM health                 │
│   - Creates admin user if needed         │
│   - Generates initial tokens             │
└────────────────┬─────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────┐
│   cluster-health-check.service           │
│   - Polls all service health endpoints   │
│   - Verifies cluster membership          │
│   - Reports to journald                  │
└──────────────────────────────────────────┘

Configuration Flow

┌─────────────────────────────────────────┐
│  Provisioning Server                    │
│  - Generates cluster-config.json        │
│  - Copies to /etc/nixos/secrets/        │
└────────────────┬────────────────────────┘
                 │
                 │ nixos-anywhere
                 │
                 ▼
┌─────────────────────────────────────────┐
│  Target Node                            │
│  /etc/nixos/secrets/cluster-config.json │
└────────────────┬────────────────────────┘
                 │
                 │ Read by NixOS module
                 │
                 ▼
┌─────────────────────────────────────────┐
│  first-boot-automation.nix              │
│  - Parses JSON config                   │
│  - Creates systemd services             │
│  - Sets up dependencies                 │
└────────────────┬────────────────────────┘
                 │
                 │ systemd activation
                 │
                 ▼
┌─────────────────────────────────────────┐
│  Cluster Join Services                  │
│  - Execute join logic                   │
│  - Create marker files                  │
│  - Log to journald                      │
└─────────────────────────────────────────┘

Bootstrap vs Join Decision Logic

Decision Tree

                    ┌─────────────────┐
                    │  Node Boots     │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │ Read cluster-   │
                    │ config.json     │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │ bootstrap=true? │
                    └────────┬────────┘
                             │
                ┌────────────┴────────────┐
                │                         │
           YES  ▼                         ▼  NO
    ┌─────────────────┐       ┌─────────────────┐
    │ Bootstrap Mode  │       │  Join Mode      │
    │                 │       │                 │
    │ - Skip cluster  │       │ - Wait for      │
    │   join API      │       │   local health  │
    │ - Raft cluster  │       │ - Contact       │
    │   initializes   │       │   leader        │
    │   internally    │       │ - POST to       │
    │ - Create marker │       │   /member/add   │
    │ - Exit success  │       │ - Retry 5x      │
    └─────────────────┘       └─────────────────┘

Bootstrap Mode (bootstrap: true)

When to use:

First 3 nodes in a new cluster
Nodes configured with matching initial_peers
No existing cluster to join

Behavior:

Service starts with --initial-cluster parameter containing all bootstrap peers
Raft consensus protocol automatically elects leader
Cluster join service detects bootstrap mode and exits immediately
No API calls to leader (cluster doesn't exist yet)

Configuration:

{
  "bootstrap": true,
  "initial_peers": ["node01:2380", "node02:2380", "node03:2380"]
}

Marker file: /var/lib/first-boot-automation/.chainfire-initialized

Join Mode (bootstrap: false)

When to use:

Nodes joining an existing cluster
Expansion or replacement nodes
Leader URL is known and reachable

Behavior:

Service starts with no initial cluster configuration
Cluster join service waits for local service health
POST to leader's /admin/member/add with node info
Leader adds member to Raft configuration
Node joins cluster and synchronizes state

Configuration:

{
  "bootstrap": false,
  "leader_url": "https://node01.example.com:2379",
  "raft_addr": "10.0.1.13:2380"
}

Marker file: /var/lib/first-boot-automation/.chainfire-joined

Idempotency and State Management

Marker Files

The system uses marker files to track initialization state:

/var/lib/first-boot-automation/
├── .chainfire-initialized    # Bootstrap node initialized
├── .chainfire-joined          # Node joined cluster
├── .flaredb-initialized       # FlareDB bootstrap
├── .flaredb-joined            # FlareDB joined
└── .iam-initialized           # IAM setup complete

Purpose:

Prevent duplicate join attempts on reboot
Support idempotent operations
Enable troubleshooting (check timestamps)

Format: ISO8601 timestamp of initialization

2025-12-10T10:30:45+00:00

State Transitions

┌──────────────┐
│ First Boot   │
│ (no marker)  │
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Check Config │
│ bootstrap=?  │
└──────┬───────┘
       │
       ├─(true)──▶ Bootstrap ──▶ Create .initialized ──▶ Done
       │
       └─(false)─▶ Join ──▶ Create .joined ──▶ Done
                     │
                     │ (reboot)
                     ▼
                  ┌──────────────┐
                  │ Marker Exists│
                  │ Skip Join    │
                  └──────────────┘

Retry Logic and Error Handling

Health Check Retry

Parameters:

Timeout: 120 seconds (configurable)
Retry Interval: 5 seconds
Max Elapsed: 300 seconds

Logic:

START_TIME=$(date +%s)
while true; do
    ELAPSED=$(($(date +%s) - START_TIME))
    if [[ $ELAPSED -ge $TIMEOUT ]]; then
        exit 1  # Timeout
    fi

    HTTP_CODE=$(curl -k -s -o /dev/null -w "%{http_code}" "$HEALTH_URL")
    if [[ "$HTTP_CODE" == "200" ]]; then
        exit 0  # Success
    fi

    sleep 5
done

Cluster Join Retry

Parameters:

Max Attempts: 5 (configurable)
Retry Delay: 10 seconds
Exponential Backoff: Optional (not implemented)

Logic:

for ATTEMPT in $(seq 1 $MAX_ATTEMPTS); do
    HTTP_CODE=$(curl -X POST "$LEADER_URL/admin/member/add" -d "$PAYLOAD")

    if [[ "$HTTP_CODE" == "200" || "$HTTP_CODE" == "201" ]]; then
        exit 0  # Success
    elif [[ "$HTTP_CODE" == "409" ]]; then
        exit 2  # Already member
    fi

    sleep $RETRY_DELAY
done

exit 1  # Max attempts exhausted

Error Codes

Health Check:

0: Service healthy
1: Timeout or unhealthy

Cluster Join:

0: Successfully joined
1: Failed after max attempts
2: Already joined (idempotent)
3: Invalid arguments

Bootstrap Detector:

0: Should bootstrap
1: Should join existing
2: Configuration error

Security Considerations

TLS Certificate Handling

Requirements:

All inter-node communication uses TLS
Self-signed certificates supported via -k flag to curl
Certificate validation in production (remove -k)

Certificate Paths:

{
  "tls": {
    "enabled": true,
    "ca_cert_path": "/etc/nixos/secrets/ca.crt",
    "node_cert_path": "/etc/nixos/secrets/node01.crt",
    "node_key_path": "/etc/nixos/secrets/node01.key"
  }
}

Integration with T031:

Certificates generated by T031 TLS automation
Copied to target during provisioning
Read by services at startup

Secrets Management

Cluster Configuration:

Stored in /etc/nixos/secrets/cluster-config.json
Permissions: 0600 root:root (recommended)
Contains sensitive data: URLs, IPs, topology

API Credentials:

IAM admin credentials (future implementation)
Stored in separate file: /etc/nixos/secrets/iam-admin.json
Never logged to journald

Attack Surface

Mitigations:

Network-level: Firewall rules restrict cluster API ports
Application-level: mTLS for authenticated requests
Access control: SystemD service isolation
Audit: All operations logged to journald with structured JSON

Integration Points

T024 NixOS Modules

The first-boot automation module imports and extends service modules:

# Example: netboot-control-plane.nix
{
  imports = [
    ../modules/chainfire.nix
    ../modules/flaredb.nix
    ../modules/iam.nix
    ../modules/first-boot-automation.nix
  ];

  services.first-boot-automation.enable = true;
}

T031 TLS Certificates

Dependencies:

TLS certificates must exist before first boot
Provisioning script copies certificates to /etc/nixos/secrets/
Services read certificates at startup

Certificate Generation:

# On provisioning server (T031)
./tls/generate-node-cert.sh node01.example.com 10.0.1.10

# Copied to target
scp ca.crt node01.crt node01.key root@10.0.1.10:/etc/nixos/secrets/

T032.S1-S3 PXE/Netboot

Boot Flow:

PXE boot loads iPXE firmware
iPXE chainloads NixOS kernel/initrd
NixOS installer runs (nixos-anywhere)
System installed to disk with first-boot automation
Reboot into installed system
First-boot automation executes

Configuration Injection:

# During nixos-anywhere provisioning
mkdir -p /mnt/etc/nixos/secrets
cp cluster-config.json /mnt/etc/nixos/secrets/
chmod 600 /mnt/etc/nixos/secrets/cluster-config.json

Service Dependencies

Systemd Ordering

Chainfire:

After:  network-online.target, chainfire.service
Before: flaredb-cluster-join.service
Wants:  network-online.target

FlareDB:

After:  chainfire-cluster-join.service, flaredb.service
Requires: chainfire-cluster-join.service
Before: iam-initial-setup.service

IAM:

After:  flaredb-cluster-join.service, iam.service
Before: cluster-health-check.service

Health Check:

After:  chainfire-cluster-join, flaredb-cluster-join, iam-initial-setup
Type:   oneshot (no RemainAfterExit)

Dependency Graph

network-online.target
        │
        ├──▶ chainfire.service
        │         │
        │         ▼
        │    chainfire-cluster-join.service
        │         │
        ├──▶ flaredb.service
        │         │
        │         ▼
        └────▶ flaredb-cluster-join.service
                  │
             ┌────┴────┐
             │         │
        iam.service    │
             │         │
             ▼         │
        iam-initial-setup.service
             │         │
             └────┬────┘
                  │
                  ▼
        cluster-health-check.service

Logging and Observability

Structured Logging

All scripts output JSON-formatted logs:

{
  "timestamp": "2025-12-10T10:30:45+00:00",
  "level": "INFO",
  "service": "chainfire",
  "operation": "cluster-join",
  "message": "Successfully joined cluster"
}

Benefits:

Machine-readable for log aggregation (T025)
Easy filtering with journalctl -o json
Includes context (service, operation, timestamp)

Querying Logs

View all first-boot automation logs:

journalctl -u chainfire-cluster-join.service -u flaredb-cluster-join.service \
           -u iam-initial-setup.service -u cluster-health-check.service

Filter by log level:

journalctl -u chainfire-cluster-join.service | grep '"level":"ERROR"'

Follow live:

journalctl -u chainfire-cluster-join.service -f

Health Check Integration

T025 Observability:

Health check service can POST to metrics endpoint
Prometheus scraping of /health endpoints
Alerts on cluster join failures

Future:

Webhook to provisioning server on completion
Slack/email notifications on errors
Dashboard showing cluster join status

Performance Characteristics

Boot Time Analysis

Typical Timeline (3-node cluster):

T+0s    : systemd starts
T+5s    : network-online.target reached
T+10s   : chainfire.service starts
T+15s   : chainfire healthy
T+15s   : chainfire-cluster-join runs (bootstrap, immediate exit)
T+20s   : flaredb.service starts
T+25s   : flaredb healthy
T+25s   : flaredb-cluster-join runs (bootstrap, immediate exit)
T+30s   : iam.service starts
T+35s   : iam healthy
T+35s   : iam-initial-setup runs
T+40s   : cluster-health-check runs
T+40s   : Node fully operational

Join Mode (node joining existing cluster):

T+0s    : systemd starts
T+5s    : network-online.target reached
T+10s   : chainfire.service starts
T+15s   : chainfire healthy
T+15s   : chainfire-cluster-join runs
T+20s   : POST to leader, wait for response
T+25s   : Successfully joined chainfire cluster
T+25s   : flaredb.service starts
T+30s   : flaredb healthy
T+30s   : flaredb-cluster-join runs
T+35s   : Successfully joined flaredb cluster
T+40s   : iam-initial-setup (skips, already initialized)
T+45s   : cluster-health-check runs
T+45s   : Node fully operational

Bottlenecks

Health Check Polling:

5-second intervals may be too aggressive
Recommendation: Exponential backoff

Network Latency:

Join requests block on network RTT
Mitigation: Ensure low-latency cluster network

Raft Synchronization:

New member must catch up on Raft log
Time depends on log size (seconds to minutes)

Failure Modes and Recovery

Common Failures

1. Leader Unreachable

Symptom:

{"level":"ERROR","message":"Join request failed: connection error"}

Diagnosis:

Check network connectivity: ping node01.example.com
Verify firewall rules: iptables -L
Check leader service status: systemctl status chainfire.service

Recovery:

# Fix network/firewall, then restart join service
systemctl restart chainfire-cluster-join.service

2. Invalid Configuration

Symptom:

{"level":"ERROR","message":"Configuration file not found"}

Diagnosis:

Verify file exists: ls -la /etc/nixos/secrets/cluster-config.json
Check JSON syntax: jq . /etc/nixos/secrets/cluster-config.json

Recovery:

# Fix configuration, then restart
systemctl restart chainfire-cluster-join.service

3. Service Not Healthy

Symptom:

{"level":"ERROR","message":"Health check timeout"}

Diagnosis:

Check service logs: journalctl -u chainfire.service
Verify service is running: systemctl status chainfire.service
Test health endpoint: curl -k https://localhost:2379/health

Recovery:

# Restart the main service
systemctl restart chainfire.service

# Join service will auto-retry after RestartSec

4. Already Member

Symptom:

{"level":"WARN","message":"Node already member of cluster (HTTP 409)"}

Diagnosis:

This is normal on reboots
Marker file created to prevent future attempts

Recovery:

No action needed (idempotent behavior)

Manual Cluster Join

If automation fails, manual join:

Chainfire:

curl -k -X POST https://node01.example.com:2379/admin/member/add \
  -H "Content-Type: application/json" \
  -d '{"id":"node04","raft_addr":"10.0.1.13:2380"}'

# Create marker to prevent auto-retry
mkdir -p /var/lib/first-boot-automation
date -Iseconds > /var/lib/first-boot-automation/.chainfire-joined

FlareDB:

curl -k -X POST https://node01.example.com:2479/admin/member/add \
  -H "Content-Type: application/json" \
  -d '{"id":"node04","raft_addr":"10.0.1.13:2480"}'

date -Iseconds > /var/lib/first-boot-automation/.flaredb-joined

Rollback Procedure

Remove from cluster:

# On leader
curl -k -X DELETE https://node01.example.com:2379/admin/member/node04

# On node being removed
systemctl stop chainfire.service
rm -rf /var/lib/chainfire/*
rm /var/lib/first-boot-automation/.chainfire-joined

# Re-enable automation
systemctl restart chainfire-cluster-join.service

Future Enhancements

Planned Improvements

1. Exponential Backoff

Current: Fixed 10-second delay
Future: 1s, 2s, 4s, 8s, 16s exponential backoff

2. Leader Discovery

Current: Static leader URL in config
Future: DNS SRV records for dynamic discovery

3. Webhook Notifications

POST to provisioning server on completion
Include node info, join time, cluster health

4. Pre-flight Checks

Validate network connectivity before attempting join
Check TLS certificate validity
Verify disk space, memory, CPU requirements

5. Automated Testing

Integration tests with real cluster
Simulate failures (network partitions, leader crashes)
Validate idempotency

6. Configuration Validation

JSON schema validation at boot
Fail fast on invalid configuration
Provide clear error messages

References

T024: NixOS service modules
T025: Observability and monitoring
T031: TLS certificate automation
T032.S1-S3: PXE boot, netboot images, provisioning
Design Document: /home/centra/cloud/docs/por/T032-baremetal-provisioning/design.md

Appendix: Configuration Schema

cluster-config.json Schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["node_id", "node_role", "bootstrap", "cluster_name", "leader_url", "raft_addr"],
  "properties": {
    "node_id": {
      "type": "string",
      "description": "Unique node identifier"
    },
    "node_role": {
      "type": "string",
      "enum": ["control-plane", "worker", "all-in-one"]
    },
    "bootstrap": {
      "type": "boolean",
      "description": "True for first 3 nodes, false for join"
    },
    "cluster_name": {
      "type": "string"
    },
    "leader_url": {
      "type": "string",
      "format": "uri"
    },
    "raft_addr": {
      "type": "string",
      "pattern": "^[0-9.]+:[0-9]+$"
    },
    "initial_peers": {
      "type": "array",
      "items": {"type": "string"}
    },
    "flaredb_peers": {
      "type": "array",
      "items": {"type": "string"}
    }
  }
}

23 KiB Raw Blame History

First-Boot Automation Architecture

Overview

System Architecture

Component Hierarchy

Configuration Flow

Bootstrap vs Join Decision Logic

Decision Tree

Bootstrap Mode (bootstrap: true)

Join Mode (bootstrap: false)

Idempotency and State Management

Marker Files

State Transitions

Retry Logic and Error Handling

Health Check Retry

Cluster Join Retry

Error Codes

Security Considerations

TLS Certificate Handling

Secrets Management

Attack Surface

Integration Points

T024 NixOS Modules

T031 TLS Certificates

T032.S1-S3 PXE/Netboot

Service Dependencies

Systemd Ordering

Dependency Graph

Logging and Observability

Structured Logging

Querying Logs

Health Check Integration

Performance Characteristics

Boot Time Analysis

Bottlenecks

Failure Modes and Recovery

Common Failures

Manual Cluster Join

Rollback Procedure

Future Enhancements

Planned Improvements

References

Appendix: Configuration Schema

cluster-config.json Schema

23 KiB

Raw Blame History