photoncloud-monorepo/baremetal/first-boot/ARCHITECTURE.md
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

23 KiB

First-Boot Automation Architecture

Overview

The first-boot automation system provides automated cluster joining and service initialization for bare-metal provisioned nodes. It handles two critical scenarios:

  1. Bootstrap Mode: First 3 nodes initialize a new Raft cluster
  2. Join Mode: Additional nodes join an existing cluster

This document describes the architecture, design decisions, and implementation details.

System Architecture

Component Hierarchy

┌─────────────────────────────────────────────────────────────┐
│                    NixOS Boot Process                        │
└────────────────────┬────────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────────┐
│              systemd.target: multi-user.target               │
└────────────────────┬────────────────────────────────────────┘
                     │
     ┌───────────────┼───────────────┐
     │               │               │
     ▼               ▼               ▼
┌──────────┐  ┌──────────┐  ┌──────────┐
│chainfire │  │ flaredb  │  │   iam    │
│.service  │  │.service  │  │.service  │
└────┬─────┘  └────┬─────┘  └────┬─────┘
     │             │               │
     ▼             ▼               ▼
┌──────────────────────────────────────────┐
│   chainfire-cluster-join.service         │
│   - Waits for local chainfire health     │
│   - Checks bootstrap flag                │
│   - Joins cluster if bootstrap=false     │
└────────────────┬─────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────┐
│   flaredb-cluster-join.service           │
│   - Requires chainfire-cluster-join      │
│   - Waits for local flaredb health       │
│   - Joins FlareDB cluster                │
└────────────────┬─────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────┐
│   iam-initial-setup.service              │
│   - Waits for IAM health                 │
│   - Creates admin user if needed         │
│   - Generates initial tokens             │
└────────────────┬─────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────┐
│   cluster-health-check.service           │
│   - Polls all service health endpoints   │
│   - Verifies cluster membership          │
│   - Reports to journald                  │
└──────────────────────────────────────────┘

Configuration Flow

┌─────────────────────────────────────────┐
│  Provisioning Server                    │
│  - Generates cluster-config.json        │
│  - Copies to /etc/nixos/secrets/        │
└────────────────┬────────────────────────┘
                 │
                 │ nixos-anywhere
                 │
                 ▼
┌─────────────────────────────────────────┐
│  Target Node                            │
│  /etc/nixos/secrets/cluster-config.json │
└────────────────┬────────────────────────┘
                 │
                 │ Read by NixOS module
                 │
                 ▼
┌─────────────────────────────────────────┐
│  first-boot-automation.nix              │
│  - Parses JSON config                   │
│  - Creates systemd services             │
│  - Sets up dependencies                 │
└────────────────┬────────────────────────┘
                 │
                 │ systemd activation
                 │
                 ▼
┌─────────────────────────────────────────┐
│  Cluster Join Services                  │
│  - Execute join logic                   │
│  - Create marker files                  │
│  - Log to journald                      │
└─────────────────────────────────────────┘

Bootstrap vs Join Decision Logic

Decision Tree

                    ┌─────────────────┐
                    │  Node Boots     │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │ Read cluster-   │
                    │ config.json     │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │ bootstrap=true? │
                    └────────┬────────┘
                             │
                ┌────────────┴────────────┐
                │                         │
           YES  ▼                         ▼  NO
    ┌─────────────────┐       ┌─────────────────┐
    │ Bootstrap Mode  │       │  Join Mode      │
    │                 │       │                 │
    │ - Skip cluster  │       │ - Wait for      │
    │   join API      │       │   local health  │
    │ - Raft cluster  │       │ - Contact       │
    │   initializes   │       │   leader        │
    │   internally    │       │ - POST to       │
    │ - Create marker │       │   /member/add   │
    │ - Exit success  │       │ - Retry 5x      │
    └─────────────────┘       └─────────────────┘

Bootstrap Mode (bootstrap: true)

When to use:

  • First 3 nodes in a new cluster
  • Nodes configured with matching initial_peers
  • No existing cluster to join

Behavior:

  1. Service starts with --initial-cluster parameter containing all bootstrap peers
  2. Raft consensus protocol automatically elects leader
  3. Cluster join service detects bootstrap mode and exits immediately
  4. No API calls to leader (cluster doesn't exist yet)

Configuration:

{
  "bootstrap": true,
  "initial_peers": ["node01:2380", "node02:2380", "node03:2380"]
}

Marker file: /var/lib/first-boot-automation/.chainfire-initialized

Join Mode (bootstrap: false)

When to use:

  • Nodes joining an existing cluster
  • Expansion or replacement nodes
  • Leader URL is known and reachable

Behavior:

  1. Service starts with no initial cluster configuration
  2. Cluster join service waits for local service health
  3. POST to leader's /admin/member/add with node info
  4. Leader adds member to Raft configuration
  5. Node joins cluster and synchronizes state

Configuration:

{
  "bootstrap": false,
  "leader_url": "https://node01.example.com:2379",
  "raft_addr": "10.0.1.13:2380"
}

Marker file: /var/lib/first-boot-automation/.chainfire-joined

Idempotency and State Management

Marker Files

The system uses marker files to track initialization state:

/var/lib/first-boot-automation/
├── .chainfire-initialized    # Bootstrap node initialized
├── .chainfire-joined          # Node joined cluster
├── .flaredb-initialized       # FlareDB bootstrap
├── .flaredb-joined            # FlareDB joined
└── .iam-initialized           # IAM setup complete

Purpose:

  • Prevent duplicate join attempts on reboot
  • Support idempotent operations
  • Enable troubleshooting (check timestamps)

Format: ISO8601 timestamp of initialization

2025-12-10T10:30:45+00:00

State Transitions

┌──────────────┐
│ First Boot   │
│ (no marker)  │
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Check Config │
│ bootstrap=?  │
└──────┬───────┘
       │
       ├─(true)──▶ Bootstrap ──▶ Create .initialized ──▶ Done
       │
       └─(false)─▶ Join ──▶ Create .joined ──▶ Done
                     │
                     │ (reboot)
                     ▼
                  ┌──────────────┐
                  │ Marker Exists│
                  │ Skip Join    │
                  └──────────────┘

Retry Logic and Error Handling

Health Check Retry

Parameters:

  • Timeout: 120 seconds (configurable)
  • Retry Interval: 5 seconds
  • Max Elapsed: 300 seconds

Logic:

START_TIME=$(date +%s)
while true; do
    ELAPSED=$(($(date +%s) - START_TIME))
    if [[ $ELAPSED -ge $TIMEOUT ]]; then
        exit 1  # Timeout
    fi

    HTTP_CODE=$(curl -k -s -o /dev/null -w "%{http_code}" "$HEALTH_URL")
    if [[ "$HTTP_CODE" == "200" ]]; then
        exit 0  # Success
    fi

    sleep 5
done

Cluster Join Retry

Parameters:

  • Max Attempts: 5 (configurable)
  • Retry Delay: 10 seconds
  • Exponential Backoff: Optional (not implemented)

Logic:

for ATTEMPT in $(seq 1 $MAX_ATTEMPTS); do
    HTTP_CODE=$(curl -X POST "$LEADER_URL/admin/member/add" -d "$PAYLOAD")

    if [[ "$HTTP_CODE" == "200" || "$HTTP_CODE" == "201" ]]; then
        exit 0  # Success
    elif [[ "$HTTP_CODE" == "409" ]]; then
        exit 2  # Already member
    fi

    sleep $RETRY_DELAY
done

exit 1  # Max attempts exhausted

Error Codes

Health Check:

  • 0: Service healthy
  • 1: Timeout or unhealthy

Cluster Join:

  • 0: Successfully joined
  • 1: Failed after max attempts
  • 2: Already joined (idempotent)
  • 3: Invalid arguments

Bootstrap Detector:

  • 0: Should bootstrap
  • 1: Should join existing
  • 2: Configuration error

Security Considerations

TLS Certificate Handling

Requirements:

  • All inter-node communication uses TLS
  • Self-signed certificates supported via -k flag to curl
  • Certificate validation in production (remove -k)

Certificate Paths:

{
  "tls": {
    "enabled": true,
    "ca_cert_path": "/etc/nixos/secrets/ca.crt",
    "node_cert_path": "/etc/nixos/secrets/node01.crt",
    "node_key_path": "/etc/nixos/secrets/node01.key"
  }
}

Integration with T031:

  • Certificates generated by T031 TLS automation
  • Copied to target during provisioning
  • Read by services at startup

Secrets Management

Cluster Configuration:

  • Stored in /etc/nixos/secrets/cluster-config.json
  • Permissions: 0600 root:root (recommended)
  • Contains sensitive data: URLs, IPs, topology

API Credentials:

  • IAM admin credentials (future implementation)
  • Stored in separate file: /etc/nixos/secrets/iam-admin.json
  • Never logged to journald

Attack Surface

Mitigations:

  1. Network-level: Firewall rules restrict cluster API ports
  2. Application-level: mTLS for authenticated requests
  3. Access control: SystemD service isolation
  4. Audit: All operations logged to journald with structured JSON

Integration Points

T024 NixOS Modules

The first-boot automation module imports and extends service modules:

# Example: netboot-control-plane.nix
{
  imports = [
    ../modules/chainfire.nix
    ../modules/flaredb.nix
    ../modules/iam.nix
    ../modules/first-boot-automation.nix
  ];

  services.first-boot-automation.enable = true;
}

T031 TLS Certificates

Dependencies:

  • TLS certificates must exist before first boot
  • Provisioning script copies certificates to /etc/nixos/secrets/
  • Services read certificates at startup

Certificate Generation:

# On provisioning server (T031)
./tls/generate-node-cert.sh node01.example.com 10.0.1.10

# Copied to target
scp ca.crt node01.crt node01.key root@10.0.1.10:/etc/nixos/secrets/

T032.S1-S3 PXE/Netboot

Boot Flow:

  1. PXE boot loads iPXE firmware
  2. iPXE chainloads NixOS kernel/initrd
  3. NixOS installer runs (nixos-anywhere)
  4. System installed to disk with first-boot automation
  5. Reboot into installed system
  6. First-boot automation executes

Configuration Injection:

# During nixos-anywhere provisioning
mkdir -p /mnt/etc/nixos/secrets
cp cluster-config.json /mnt/etc/nixos/secrets/
chmod 600 /mnt/etc/nixos/secrets/cluster-config.json

Service Dependencies

Systemd Ordering

Chainfire:

After:  network-online.target, chainfire.service
Before: flaredb-cluster-join.service
Wants:  network-online.target

FlareDB:

After:  chainfire-cluster-join.service, flaredb.service
Requires: chainfire-cluster-join.service
Before: iam-initial-setup.service

IAM:

After:  flaredb-cluster-join.service, iam.service
Before: cluster-health-check.service

Health Check:

After:  chainfire-cluster-join, flaredb-cluster-join, iam-initial-setup
Type:   oneshot (no RemainAfterExit)

Dependency Graph

network-online.target
        │
        ├──▶ chainfire.service
        │         │
        │         ▼
        │    chainfire-cluster-join.service
        │         │
        ├──▶ flaredb.service
        │         │
        │         ▼
        └────▶ flaredb-cluster-join.service
                  │
             ┌────┴────┐
             │         │
        iam.service    │
             │         │
             ▼         │
        iam-initial-setup.service
             │         │
             └────┬────┘
                  │
                  ▼
        cluster-health-check.service

Logging and Observability

Structured Logging

All scripts output JSON-formatted logs:

{
  "timestamp": "2025-12-10T10:30:45+00:00",
  "level": "INFO",
  "service": "chainfire",
  "operation": "cluster-join",
  "message": "Successfully joined cluster"
}

Benefits:

  • Machine-readable for log aggregation (T025)
  • Easy filtering with journalctl -o json
  • Includes context (service, operation, timestamp)

Querying Logs

View all first-boot automation logs:

journalctl -u chainfire-cluster-join.service -u flaredb-cluster-join.service \
           -u iam-initial-setup.service -u cluster-health-check.service

Filter by log level:

journalctl -u chainfire-cluster-join.service | grep '"level":"ERROR"'

Follow live:

journalctl -u chainfire-cluster-join.service -f

Health Check Integration

T025 Observability:

  • Health check service can POST to metrics endpoint
  • Prometheus scraping of /health endpoints
  • Alerts on cluster join failures

Future:

  • Webhook to provisioning server on completion
  • Slack/email notifications on errors
  • Dashboard showing cluster join status

Performance Characteristics

Boot Time Analysis

Typical Timeline (3-node cluster):

T+0s    : systemd starts
T+5s    : network-online.target reached
T+10s   : chainfire.service starts
T+15s   : chainfire healthy
T+15s   : chainfire-cluster-join runs (bootstrap, immediate exit)
T+20s   : flaredb.service starts
T+25s   : flaredb healthy
T+25s   : flaredb-cluster-join runs (bootstrap, immediate exit)
T+30s   : iam.service starts
T+35s   : iam healthy
T+35s   : iam-initial-setup runs
T+40s   : cluster-health-check runs
T+40s   : Node fully operational

Join Mode (node joining existing cluster):

T+0s    : systemd starts
T+5s    : network-online.target reached
T+10s   : chainfire.service starts
T+15s   : chainfire healthy
T+15s   : chainfire-cluster-join runs
T+20s   : POST to leader, wait for response
T+25s   : Successfully joined chainfire cluster
T+25s   : flaredb.service starts
T+30s   : flaredb healthy
T+30s   : flaredb-cluster-join runs
T+35s   : Successfully joined flaredb cluster
T+40s   : iam-initial-setup (skips, already initialized)
T+45s   : cluster-health-check runs
T+45s   : Node fully operational

Bottlenecks

Health Check Polling:

  • 5-second intervals may be too aggressive
  • Recommendation: Exponential backoff

Network Latency:

  • Join requests block on network RTT
  • Mitigation: Ensure low-latency cluster network

Raft Synchronization:

  • New member must catch up on Raft log
  • Time depends on log size (seconds to minutes)

Failure Modes and Recovery

Common Failures

1. Leader Unreachable

Symptom:

{"level":"ERROR","message":"Join request failed: connection error"}

Diagnosis:

  • Check network connectivity: ping node01.example.com
  • Verify firewall rules: iptables -L
  • Check leader service status: systemctl status chainfire.service

Recovery:

# Fix network/firewall, then restart join service
systemctl restart chainfire-cluster-join.service

2. Invalid Configuration

Symptom:

{"level":"ERROR","message":"Configuration file not found"}

Diagnosis:

  • Verify file exists: ls -la /etc/nixos/secrets/cluster-config.json
  • Check JSON syntax: jq . /etc/nixos/secrets/cluster-config.json

Recovery:

# Fix configuration, then restart
systemctl restart chainfire-cluster-join.service

3. Service Not Healthy

Symptom:

{"level":"ERROR","message":"Health check timeout"}

Diagnosis:

  • Check service logs: journalctl -u chainfire.service
  • Verify service is running: systemctl status chainfire.service
  • Test health endpoint: curl -k https://localhost:2379/health

Recovery:

# Restart the main service
systemctl restart chainfire.service

# Join service will auto-retry after RestartSec

4. Already Member

Symptom:

{"level":"WARN","message":"Node already member of cluster (HTTP 409)"}

Diagnosis:

  • This is normal on reboots
  • Marker file created to prevent future attempts

Recovery:

  • No action needed (idempotent behavior)

Manual Cluster Join

If automation fails, manual join:

Chainfire:

curl -k -X POST https://node01.example.com:2379/admin/member/add \
  -H "Content-Type: application/json" \
  -d '{"id":"node04","raft_addr":"10.0.1.13:2380"}'

# Create marker to prevent auto-retry
mkdir -p /var/lib/first-boot-automation
date -Iseconds > /var/lib/first-boot-automation/.chainfire-joined

FlareDB:

curl -k -X POST https://node01.example.com:2479/admin/member/add \
  -H "Content-Type: application/json" \
  -d '{"id":"node04","raft_addr":"10.0.1.13:2480"}'

date -Iseconds > /var/lib/first-boot-automation/.flaredb-joined

Rollback Procedure

Remove from cluster:

# On leader
curl -k -X DELETE https://node01.example.com:2379/admin/member/node04

# On node being removed
systemctl stop chainfire.service
rm -rf /var/lib/chainfire/*
rm /var/lib/first-boot-automation/.chainfire-joined

# Re-enable automation
systemctl restart chainfire-cluster-join.service

Future Enhancements

Planned Improvements

1. Exponential Backoff

  • Current: Fixed 10-second delay
  • Future: 1s, 2s, 4s, 8s, 16s exponential backoff

2. Leader Discovery

  • Current: Static leader URL in config
  • Future: DNS SRV records for dynamic discovery

3. Webhook Notifications

  • POST to provisioning server on completion
  • Include node info, join time, cluster health

4. Pre-flight Checks

  • Validate network connectivity before attempting join
  • Check TLS certificate validity
  • Verify disk space, memory, CPU requirements

5. Automated Testing

  • Integration tests with real cluster
  • Simulate failures (network partitions, leader crashes)
  • Validate idempotency

6. Configuration Validation

  • JSON schema validation at boot
  • Fail fast on invalid configuration
  • Provide clear error messages

References

  • T024: NixOS service modules
  • T025: Observability and monitoring
  • T031: TLS certificate automation
  • T032.S1-S3: PXE boot, netboot images, provisioning
  • Design Document: /home/centra/cloud/docs/por/T032-baremetal-provisioning/design.md

Appendix: Configuration Schema

cluster-config.json Schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["node_id", "node_role", "bootstrap", "cluster_name", "leader_url", "raft_addr"],
  "properties": {
    "node_id": {
      "type": "string",
      "description": "Unique node identifier"
    },
    "node_role": {
      "type": "string",
      "enum": ["control-plane", "worker", "all-in-one"]
    },
    "bootstrap": {
      "type": "boolean",
      "description": "True for first 3 nodes, false for join"
    },
    "cluster_name": {
      "type": "string"
    },
    "leader_url": {
      "type": "string",
      "format": "uri"
    },
    "raft_addr": {
      "type": "string",
      "pattern": "^[0-9.]+:[0-9]+$"
    },
    "initial_peers": {
      "type": "array",
      "items": {"type": "string"}
    },
    "flaredb_peers": {
      "type": "array",
      "items": {"type": "string"}
    }
  }
}