photoncloud-monorepo/baremetal/first-boot/ARCHITECTURE.md
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

763 lines
23 KiB
Markdown

# First-Boot Automation Architecture
## Overview
The first-boot automation system provides automated cluster joining and service initialization for bare-metal provisioned nodes. It handles two critical scenarios:
1. **Bootstrap Mode**: First 3 nodes initialize a new Raft cluster
2. **Join Mode**: Additional nodes join an existing cluster
This document describes the architecture, design decisions, and implementation details.
## System Architecture
### Component Hierarchy
```
┌─────────────────────────────────────────────────────────────┐
│ NixOS Boot Process │
└────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ systemd.target: multi-user.target │
└────────────────────┬────────────────────────────────────────┘
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│chainfire │ │ flaredb │ │ iam │
│.service │ │.service │ │.service │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────┐
│ chainfire-cluster-join.service │
│ - Waits for local chainfire health │
│ - Checks bootstrap flag │
│ - Joins cluster if bootstrap=false │
└────────────────┬─────────────────────────┘
┌──────────────────────────────────────────┐
│ flaredb-cluster-join.service │
│ - Requires chainfire-cluster-join │
│ - Waits for local flaredb health │
│ - Joins FlareDB cluster │
└────────────────┬─────────────────────────┘
┌──────────────────────────────────────────┐
│ iam-initial-setup.service │
│ - Waits for IAM health │
│ - Creates admin user if needed │
│ - Generates initial tokens │
└────────────────┬─────────────────────────┘
┌──────────────────────────────────────────┐
│ cluster-health-check.service │
│ - Polls all service health endpoints │
│ - Verifies cluster membership │
│ - Reports to journald │
└──────────────────────────────────────────┘
```
### Configuration Flow
```
┌─────────────────────────────────────────┐
│ Provisioning Server │
│ - Generates cluster-config.json │
│ - Copies to /etc/nixos/secrets/ │
└────────────────┬────────────────────────┘
│ nixos-anywhere
┌─────────────────────────────────────────┐
│ Target Node │
│ /etc/nixos/secrets/cluster-config.json │
└────────────────┬────────────────────────┘
│ Read by NixOS module
┌─────────────────────────────────────────┐
│ first-boot-automation.nix │
│ - Parses JSON config │
│ - Creates systemd services │
│ - Sets up dependencies │
└────────────────┬────────────────────────┘
│ systemd activation
┌─────────────────────────────────────────┐
│ Cluster Join Services │
│ - Execute join logic │
│ - Create marker files │
│ - Log to journald │
└─────────────────────────────────────────┘
```
## Bootstrap vs Join Decision Logic
### Decision Tree
```
┌─────────────────┐
│ Node Boots │
└────────┬────────┘
┌────────▼────────┐
│ Read cluster- │
│ config.json │
└────────┬────────┘
┌────────▼────────┐
│ bootstrap=true? │
└────────┬────────┘
┌────────────┴────────────┐
│ │
YES ▼ ▼ NO
┌─────────────────┐ ┌─────────────────┐
│ Bootstrap Mode │ │ Join Mode │
│ │ │ │
│ - Skip cluster │ │ - Wait for │
│ join API │ │ local health │
│ - Raft cluster │ │ - Contact │
│ initializes │ │ leader │
│ internally │ │ - POST to │
│ - Create marker │ │ /member/add │
│ - Exit success │ │ - Retry 5x │
└─────────────────┘ └─────────────────┘
```
### Bootstrap Mode (bootstrap: true)
**When to use:**
- First 3 nodes in a new cluster
- Nodes configured with matching `initial_peers`
- No existing cluster to join
**Behavior:**
1. Service starts with `--initial-cluster` parameter containing all bootstrap peers
2. Raft consensus protocol automatically elects leader
3. Cluster join service detects bootstrap mode and exits immediately
4. No API calls to leader (cluster doesn't exist yet)
**Configuration:**
```json
{
"bootstrap": true,
"initial_peers": ["node01:2380", "node02:2380", "node03:2380"]
}
```
**Marker file:** `/var/lib/first-boot-automation/.chainfire-initialized`
### Join Mode (bootstrap: false)
**When to use:**
- Nodes joining an existing cluster
- Expansion or replacement nodes
- Leader URL is known and reachable
**Behavior:**
1. Service starts with no initial cluster configuration
2. Cluster join service waits for local service health
3. POST to leader's `/admin/member/add` with node info
4. Leader adds member to Raft configuration
5. Node joins cluster and synchronizes state
**Configuration:**
```json
{
"bootstrap": false,
"leader_url": "https://node01.example.com:2379",
"raft_addr": "10.0.1.13:2380"
}
```
**Marker file:** `/var/lib/first-boot-automation/.chainfire-joined`
## Idempotency and State Management
### Marker Files
The system uses marker files to track initialization state:
```
/var/lib/first-boot-automation/
├── .chainfire-initialized # Bootstrap node initialized
├── .chainfire-joined # Node joined cluster
├── .flaredb-initialized # FlareDB bootstrap
├── .flaredb-joined # FlareDB joined
└── .iam-initialized # IAM setup complete
```
**Purpose:**
- Prevent duplicate join attempts on reboot
- Support idempotent operations
- Enable troubleshooting (check timestamps)
**Format:** ISO8601 timestamp of initialization
```
2025-12-10T10:30:45+00:00
```
### State Transitions
```
┌──────────────┐
│ First Boot │
│ (no marker) │
└──────┬───────┘
┌──────────────┐
│ Check Config │
│ bootstrap=? │
└──────┬───────┘
├─(true)──▶ Bootstrap ──▶ Create .initialized ──▶ Done
└─(false)─▶ Join ──▶ Create .joined ──▶ Done
│ (reboot)
┌──────────────┐
│ Marker Exists│
│ Skip Join │
└──────────────┘
```
## Retry Logic and Error Handling
### Health Check Retry
**Parameters:**
- Timeout: 120 seconds (configurable)
- Retry Interval: 5 seconds
- Max Elapsed: 300 seconds
**Logic:**
```bash
START_TIME=$(date +%s)
while true; do
ELAPSED=$(($(date +%s) - START_TIME))
if [[ $ELAPSED -ge $TIMEOUT ]]; then
exit 1 # Timeout
fi
HTTP_CODE=$(curl -k -s -o /dev/null -w "%{http_code}" "$HEALTH_URL")
if [[ "$HTTP_CODE" == "200" ]]; then
exit 0 # Success
fi
sleep 5
done
```
### Cluster Join Retry
**Parameters:**
- Max Attempts: 5 (configurable)
- Retry Delay: 10 seconds
- Exponential Backoff: Optional (not implemented)
**Logic:**
```bash
for ATTEMPT in $(seq 1 $MAX_ATTEMPTS); do
HTTP_CODE=$(curl -X POST "$LEADER_URL/admin/member/add" -d "$PAYLOAD")
if [[ "$HTTP_CODE" == "200" || "$HTTP_CODE" == "201" ]]; then
exit 0 # Success
elif [[ "$HTTP_CODE" == "409" ]]; then
exit 2 # Already member
fi
sleep $RETRY_DELAY
done
exit 1 # Max attempts exhausted
```
### Error Codes
**Health Check:**
- `0`: Service healthy
- `1`: Timeout or unhealthy
**Cluster Join:**
- `0`: Successfully joined
- `1`: Failed after max attempts
- `2`: Already joined (idempotent)
- `3`: Invalid arguments
**Bootstrap Detector:**
- `0`: Should bootstrap
- `1`: Should join existing
- `2`: Configuration error
## Security Considerations
### TLS Certificate Handling
**Requirements:**
- All inter-node communication uses TLS
- Self-signed certificates supported via `-k` flag to curl
- Certificate validation in production (remove `-k`)
**Certificate Paths:**
```json
{
"tls": {
"enabled": true,
"ca_cert_path": "/etc/nixos/secrets/ca.crt",
"node_cert_path": "/etc/nixos/secrets/node01.crt",
"node_key_path": "/etc/nixos/secrets/node01.key"
}
}
```
**Integration with T031:**
- Certificates generated by T031 TLS automation
- Copied to target during provisioning
- Read by services at startup
### Secrets Management
**Cluster Configuration:**
- Stored in `/etc/nixos/secrets/cluster-config.json`
- Permissions: `0600 root:root` (recommended)
- Contains sensitive data: URLs, IPs, topology
**API Credentials:**
- IAM admin credentials (future implementation)
- Stored in separate file: `/etc/nixos/secrets/iam-admin.json`
- Never logged to journald
### Attack Surface
**Mitigations:**
1. **Network-level**: Firewall rules restrict cluster API ports
2. **Application-level**: mTLS for authenticated requests
3. **Access control**: SystemD service isolation
4. **Audit**: All operations logged to journald with structured JSON
## Integration Points
### T024 NixOS Modules
The first-boot automation module imports and extends service modules:
```nix
# Example: netboot-control-plane.nix
{
imports = [
../modules/chainfire.nix
../modules/flaredb.nix
../modules/iam.nix
../modules/first-boot-automation.nix
];
services.first-boot-automation.enable = true;
}
```
### T031 TLS Certificates
**Dependencies:**
- TLS certificates must exist before first boot
- Provisioning script copies certificates to `/etc/nixos/secrets/`
- Services read certificates at startup
**Certificate Generation:**
```bash
# On provisioning server (T031)
./tls/generate-node-cert.sh node01.example.com 10.0.1.10
# Copied to target
scp ca.crt node01.crt node01.key root@10.0.1.10:/etc/nixos/secrets/
```
### T032.S1-S3 PXE/Netboot
**Boot Flow:**
1. PXE boot loads iPXE firmware
2. iPXE chainloads NixOS kernel/initrd
3. NixOS installer runs (nixos-anywhere)
4. System installed to disk with first-boot automation
5. Reboot into installed system
6. First-boot automation executes
**Configuration Injection:**
```bash
# During nixos-anywhere provisioning
mkdir -p /mnt/etc/nixos/secrets
cp cluster-config.json /mnt/etc/nixos/secrets/
chmod 600 /mnt/etc/nixos/secrets/cluster-config.json
```
## Service Dependencies
### Systemd Ordering
**Chainfire:**
```
After: network-online.target, chainfire.service
Before: flaredb-cluster-join.service
Wants: network-online.target
```
**FlareDB:**
```
After: chainfire-cluster-join.service, flaredb.service
Requires: chainfire-cluster-join.service
Before: iam-initial-setup.service
```
**IAM:**
```
After: flaredb-cluster-join.service, iam.service
Before: cluster-health-check.service
```
**Health Check:**
```
After: chainfire-cluster-join, flaredb-cluster-join, iam-initial-setup
Type: oneshot (no RemainAfterExit)
```
### Dependency Graph
```
network-online.target
├──▶ chainfire.service
│ │
│ ▼
│ chainfire-cluster-join.service
│ │
├──▶ flaredb.service
│ │
│ ▼
└────▶ flaredb-cluster-join.service
┌────┴────┐
│ │
iam.service │
│ │
▼ │
iam-initial-setup.service
│ │
└────┬────┘
cluster-health-check.service
```
## Logging and Observability
### Structured Logging
All scripts output JSON-formatted logs:
```json
{
"timestamp": "2025-12-10T10:30:45+00:00",
"level": "INFO",
"service": "chainfire",
"operation": "cluster-join",
"message": "Successfully joined cluster"
}
```
**Benefits:**
- Machine-readable for log aggregation (T025)
- Easy filtering with `journalctl -o json`
- Includes context (service, operation, timestamp)
### Querying Logs
**View all first-boot automation logs:**
```bash
journalctl -u chainfire-cluster-join.service -u flaredb-cluster-join.service \
-u iam-initial-setup.service -u cluster-health-check.service
```
**Filter by log level:**
```bash
journalctl -u chainfire-cluster-join.service | grep '"level":"ERROR"'
```
**Follow live:**
```bash
journalctl -u chainfire-cluster-join.service -f
```
### Health Check Integration
**T025 Observability:**
- Health check service can POST to metrics endpoint
- Prometheus scraping of `/health` endpoints
- Alerts on cluster join failures
**Future:**
- Webhook to provisioning server on completion
- Slack/email notifications on errors
- Dashboard showing cluster join status
## Performance Characteristics
### Boot Time Analysis
**Typical Timeline (3-node cluster):**
```
T+0s : systemd starts
T+5s : network-online.target reached
T+10s : chainfire.service starts
T+15s : chainfire healthy
T+15s : chainfire-cluster-join runs (bootstrap, immediate exit)
T+20s : flaredb.service starts
T+25s : flaredb healthy
T+25s : flaredb-cluster-join runs (bootstrap, immediate exit)
T+30s : iam.service starts
T+35s : iam healthy
T+35s : iam-initial-setup runs
T+40s : cluster-health-check runs
T+40s : Node fully operational
```
**Join Mode (node joining existing cluster):**
```
T+0s : systemd starts
T+5s : network-online.target reached
T+10s : chainfire.service starts
T+15s : chainfire healthy
T+15s : chainfire-cluster-join runs
T+20s : POST to leader, wait for response
T+25s : Successfully joined chainfire cluster
T+25s : flaredb.service starts
T+30s : flaredb healthy
T+30s : flaredb-cluster-join runs
T+35s : Successfully joined flaredb cluster
T+40s : iam-initial-setup (skips, already initialized)
T+45s : cluster-health-check runs
T+45s : Node fully operational
```
### Bottlenecks
**Health Check Polling:**
- 5-second intervals may be too aggressive
- Recommendation: Exponential backoff
**Network Latency:**
- Join requests block on network RTT
- Mitigation: Ensure low-latency cluster network
**Raft Synchronization:**
- New member must catch up on Raft log
- Time depends on log size (seconds to minutes)
## Failure Modes and Recovery
### Common Failures
**1. Leader Unreachable**
**Symptom:**
```json
{"level":"ERROR","message":"Join request failed: connection error"}
```
**Diagnosis:**
- Check network connectivity: `ping node01.example.com`
- Verify firewall rules: `iptables -L`
- Check leader service status: `systemctl status chainfire.service`
**Recovery:**
```bash
# Fix network/firewall, then restart join service
systemctl restart chainfire-cluster-join.service
```
**2. Invalid Configuration**
**Symptom:**
```json
{"level":"ERROR","message":"Configuration file not found"}
```
**Diagnosis:**
- Verify file exists: `ls -la /etc/nixos/secrets/cluster-config.json`
- Check JSON syntax: `jq . /etc/nixos/secrets/cluster-config.json`
**Recovery:**
```bash
# Fix configuration, then restart
systemctl restart chainfire-cluster-join.service
```
**3. Service Not Healthy**
**Symptom:**
```json
{"level":"ERROR","message":"Health check timeout"}
```
**Diagnosis:**
- Check service logs: `journalctl -u chainfire.service`
- Verify service is running: `systemctl status chainfire.service`
- Test health endpoint: `curl -k https://localhost:2379/health`
**Recovery:**
```bash
# Restart the main service
systemctl restart chainfire.service
# Join service will auto-retry after RestartSec
```
**4. Already Member**
**Symptom:**
```json
{"level":"WARN","message":"Node already member of cluster (HTTP 409)"}
```
**Diagnosis:**
- This is normal on reboots
- Marker file created to prevent future attempts
**Recovery:**
- No action needed (idempotent behavior)
### Manual Cluster Join
If automation fails, manual join:
**Chainfire:**
```bash
curl -k -X POST https://node01.example.com:2379/admin/member/add \
-H "Content-Type: application/json" \
-d '{"id":"node04","raft_addr":"10.0.1.13:2380"}'
# Create marker to prevent auto-retry
mkdir -p /var/lib/first-boot-automation
date -Iseconds > /var/lib/first-boot-automation/.chainfire-joined
```
**FlareDB:**
```bash
curl -k -X POST https://node01.example.com:2479/admin/member/add \
-H "Content-Type: application/json" \
-d '{"id":"node04","raft_addr":"10.0.1.13:2480"}'
date -Iseconds > /var/lib/first-boot-automation/.flaredb-joined
```
### Rollback Procedure
**Remove from cluster:**
```bash
# On leader
curl -k -X DELETE https://node01.example.com:2379/admin/member/node04
# On node being removed
systemctl stop chainfire.service
rm -rf /var/lib/chainfire/*
rm /var/lib/first-boot-automation/.chainfire-joined
# Re-enable automation
systemctl restart chainfire-cluster-join.service
```
## Future Enhancements
### Planned Improvements
**1. Exponential Backoff**
- Current: Fixed 10-second delay
- Future: 1s, 2s, 4s, 8s, 16s exponential backoff
**2. Leader Discovery**
- Current: Static leader URL in config
- Future: DNS SRV records for dynamic discovery
**3. Webhook Notifications**
- POST to provisioning server on completion
- Include node info, join time, cluster health
**4. Pre-flight Checks**
- Validate network connectivity before attempting join
- Check TLS certificate validity
- Verify disk space, memory, CPU requirements
**5. Automated Testing**
- Integration tests with real cluster
- Simulate failures (network partitions, leader crashes)
- Validate idempotency
**6. Configuration Validation**
- JSON schema validation at boot
- Fail fast on invalid configuration
- Provide clear error messages
## References
- **T024**: NixOS service modules
- **T025**: Observability and monitoring
- **T031**: TLS certificate automation
- **T032.S1-S3**: PXE boot, netboot images, provisioning
- **Design Document**: `/home/centra/cloud/docs/por/T032-baremetal-provisioning/design.md`
## Appendix: Configuration Schema
### cluster-config.json Schema
```json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["node_id", "node_role", "bootstrap", "cluster_name", "leader_url", "raft_addr"],
"properties": {
"node_id": {
"type": "string",
"description": "Unique node identifier"
},
"node_role": {
"type": "string",
"enum": ["control-plane", "worker", "all-in-one"]
},
"bootstrap": {
"type": "boolean",
"description": "True for first 3 nodes, false for join"
},
"cluster_name": {
"type": "string"
},
"leader_url": {
"type": "string",
"format": "uri"
},
"raft_addr": {
"type": "string",
"pattern": "^[0-9.]+:[0-9]+$"
},
"initial_peers": {
"type": "array",
"items": {"type": "string"}
},
"flaredb_peers": {
"type": "array",
"items": {"type": "string"}
}
}
}
```