photoncloud-monorepo/plans/baremetal-maas-simplification-2026-04-04.md
2026-04-04 16:33:03 +09:00

571 lines
13 KiB
Markdown

# Bare Metal / MaaS-like Simplification Plan (2026-04-04)
## Summary
UltraCloud already has many of the right building blocks:
- `Nix` modules and flake outputs for host configuration
- `deployer` for bootstrap, enrollment, and inventory
- `nix-agent` for host OS reconciliation
- `fleet-scheduler` and `node-agent` for native service placement/runtime
The problem is not "missing everything". The problem is that the boundaries are still muddy:
- source of truth is duplicated
- install-time and runtime configuration are mixed together
- registration, inventory, credential issuance, and install-plan rendering are coupled
- bootstrap and scheduling are conceptually separate but still feel entangled in the repo
This document proposes a simpler target architecture for bare metal and MaaS-like provisioning, based on both the current repo and patterns used by existing systems.
## What Existing Systems Consistently Separate
### MAAS
Useful pattern:
- machine lifecycle is explicit
- commissioning, testing, deployment, release, rescue, and broken states are operator-visible
- registration/inventory is not the same thing as workload placement
Relevant docs:
- https://discourse.maas.io/t/about-maas/5511
- https://discourse.maas.io/t/machines-do-the-heavy-lifting/5080
### Ironic and Metal3
Useful pattern:
- enrollment, manageable, available, deploy, clean, rescue are explicit provisioning states
- inspection and cleaning are first-class lifecycle steps
- root device selection is modeled explicitly instead of relying on `/dev/sdX`
Relevant docs:
- https://docs.openstack.org/ironic/latest/install/enrollment.html
- https://book.metal3.io/bmo/automated_cleaning
- https://book.metal3.io/bmo/root_device_hints
### Tinkerbell
Useful pattern:
- hardware inventory, workflow/template, and install worker are separate concepts
- the installer environment is generic
- the workflow engine is distinct from hardware registration
Relevant docs:
- https://tinkerbell.org/docs/services/tink-worker/
- https://tinkerbell.org/docs/v0.22/services/tink-controller/
### Talos and Omni
Useful pattern:
- a minimal boot medium is used only to join management
- machine classes and labels drive config selection
- machine configuration is acquired over an API instead of being hard-coded into per-node install media
Relevant docs:
- https://omni.siderolabs.com/how-to-guides/registering-machines
- https://docs.siderolabs.com/talos/v1.10/overview/what-is-talos
### NixOS deployment tools
Useful pattern:
- installation and host rollout are separate concerns
- unattended install should be repeatable from declarative config
- activation needs timeout, health gate, and rollback semantics
Relevant docs:
- https://github.com/nix-community/nixos-anywhere
- https://github.com/serokell/deploy-rs
- https://colmena.cli.rs/0.4/reference/cli.html
## Current UltraCloud Pain Points
### 1. Source of truth is duplicated
Today the repo has overlapping schema and generation paths:
- `ultracloud.cluster` generates per-node cluster config, `nix-nos` topology, and deployer cluster state
- `nix-nos` still has its own cluster schema and `generateClusterConfig`
- `deployer-types::ClusterStateSpec` is another whole-cluster model on the Rust side
This makes it too easy to author the same concept twice.
Current references:
- `nix/modules/ultracloud-cluster.nix`
- `nix-nos/modules/topology.nix`
- `nix-nos/lib/cluster-config-lib.nix`
- `deployer/crates/deployer-types/src/lib.rs`
### 2. Install-time and runtime configuration are mixed
`NodeConfig` currently contains all of the following:
- hostname and IP
- labels, pool, node class
- services
- Nix profile
- install plan
That is too much for a single object. A bootstrap/install contract should not be the same object as runtime scheduling hints.
Current references:
- `deployer/crates/deployer-types/src/lib.rs`
- `deployer/crates/deployer-server/src/phone_home.rs`
### 3. `phone_home` is carrying too much responsibility
The current flow combines:
- machine identity lookup
- enrollment-rule matching
- node assignment
- inventory summarization
- cluster node record persistence
- SSH/TLS issuance
- install-plan return
This works, but it is difficult to reason about and difficult to evolve.
### 4. ISO bootstrap is still node-path oriented
The generic ISO still falls back to node-specific paths like:
- `nix/nodes/vm-cluster/$NODE_ID/disko.nix`
That prevents profile/class-based provisioning from becoming the main path.
### 5. Host rollout and runtime scheduling are separated in code but not in the mental model
The repo already has:
- `nix-agent` for host OS state
- host deployment reconciliation for writing `desired-system`
- `fleet-scheduler` for native service placement
- `node-agent` for process/container reconcile
These are the right components, but the naming and schema boundaries do not make the split obvious.
## Design Goal
The simplest viable target is not "build all of MAAS".
The simplest viable target is:
1. `Nix` is the only authoring surface for static cluster intent.
2. bootstrap deals only with discovery, assignment, credentials, and install plans.
3. host rollout is a separate controller/agent path.
4. service scheduling is entirely downstream of host rollout.
5. BMC and PXE are optional extensions, not required for the base design.
For your current 6-machine, no-BMC environment, this is the right scope.
## Proposed Target Model
### Layer 1: Static model in Nix
Create a single Nix library as the canonical schema. Do not create a fourth schema; promote the existing `cluster-config-lib` into the canonical one.
Recommended file:
- `nix/lib/cluster-schema.nix`
Practical migration:
- move or copy `nix-nos/lib/cluster-config-lib.nix` to `nix/lib/cluster-schema.nix`
- make `ultracloud-cluster.nix` and `nix-nos/modules/topology.nix` thin wrappers over it
- stop adding new schema logic anywhere else
This library should define only stable declarative objects:
- cluster
- networks
- install profiles
- disk policies
- node classes
- pools
- enrollment rules
- nodes
- host deployments
- service policies
From that one schema, generate these artifacts:
- `nixosConfigurations.<node>`
- bootstrap install-plan data
- deployer cluster-state JSON
- test-cluster topology
## Recommended Nix Object Split
### `installProfiles`
Purpose:
- reusable OS install targets
- used during discovery/bootstrap
Fields:
- flake attribute or system profile reference
- disk policy reference
- network policy reference
- bootstrap package set / image bundle reference
### `diskPolicies`
Purpose:
- stable root-disk selection
- avoid hardcoding `/dev/sda` or node-specific Disko paths
Fields:
- root device hints
- partition layout
- wipe/cleaning policy
Borrowed directly from Ironic/Metal3 thinking: disk choice must be modeled, not guessed.
### `nodeClasses`
Purpose:
- describe intended hardware/software role
Fields:
- install profile
- default labels
- runtime capabilities
- minimum hardware traits
### `enrollmentRules`
Purpose:
- match discovered machines to class/pool/labels
Fields:
- selectors on machine-id, MAC, DMI, disk traits, NIC traits
- assigned node class
- assigned pool
- optional hostname/node-id policy
### `nodes`
Purpose:
- explicit identity for fixed nodes when you want them
Use this for:
- control plane seeds
- gateways
- special hardware
Do not require this for every worker in the generic path.
### `hostDeployments`
Purpose:
- rollout desired host OS state to already-installed machines
This is not bootstrap.
### `servicePolicies`
Purpose:
- runtime placement intent for `fleet-scheduler`
This is not host provisioning.
## Proposed Rust/API Object Split
Replace the current "fat" `NodeConfig` mental model with explicit smaller objects.
### `MachineInventory`
Owned by:
- bootstrap discovery
Contains:
- machine identity
- hardware facts
- last inventory hash
- boot method support
- optional power capability metadata
### `NodeAssignment`
Owned by:
- deployer enrollment logic
Contains:
- stable `node_id`
- hostname
- class
- pool
- labels
- failure domain
### `BootstrapSecrets`
Owned by:
- deployer credential issuer
Contains:
- SSH host key
- TLS cert/key
- bootstrap token or short-lived install token
### `InstallPlan`
Owned by:
- deployer plan renderer
Contains:
- node assignment reference
- install profile reference
- resolved flake attr or system reference
- resolved disk policy or root-device selection
- network bootstrap data
- image/bundle URL
### `DesiredSystem`
Owned by:
- host rollout controller
Contains:
- target system
- activation strategy
- health check
- rollback policy
### `ServiceSpec`
Owned by:
- runtime scheduler
Contains:
- service placement and instance policy only
It should not be returned by bootstrap APIs.
## Recommended Controller Split
### 1. Deployer server
Keep responsibility limited to:
- discovery
- enrollment / assignment
- inventory storage
- credential issuance
- install-plan rendering
Do not make it the host rollout engine and do not make it the runtime scheduler.
### 2. Host deployment controller
Make this an explicit first-class component. Today that logic exists in `ultracloud-reconciler hosts`.
Responsibility:
- watch `HostDeployment`
- select nodes
- write `desired-system`
- respect rollout budget and drain policy
Recommendation:
- rename it conceptually to `host-controller`
- keep it separate from `fleet-scheduler`
### 3. `nix-agent`
This should borrow deploy-rs style semantics:
- activation timeout
- confirmation/health gate
- rollback on failure
- staged reboot handling
### 4. `fleet-scheduler`
Responsibility:
- service placement only
Do not allow bootstrap/install concerns to leak here.
## Recommended Bootstrap Flow
Keep one generic installer image, but make the protocol explicit.
### Step 1: discover
Installer boots and sends:
- machine identity
- hardware facts
- observed network facts
### Step 2: assign
Deployer resolves:
- class
- pool
- hostname/node-id
- install profile
### Step 3: fetch plan
Installer receives:
- `NodeAssignment`
- `BootstrapSecrets`
- `InstallPlan`
### Step 4: install
Installer:
- fetches source bundle
- resolves disk policy
- runs Disko
- installs NixOS
- reports status
### Step 5: first boot
Installed system starts:
- core static services
- `nix-agent`
- runtime agent only if needed for that class
This is closer to Tinkerbell and Talos than to the current monolithic `node_config` flow, while remaining much smaller than MAAS or Ironic.
## Recommended Lifecycle State Model
Adopt a visible state machine. At minimum:
- `discovered`
- `inspected`
- `commissioned`
- `install-pending`
- `installing`
- `installed`
- `active`
- `draining`
- `reprovisioning`
- `rescue`
- `failed`
Keep these orthogonal to:
- power state
- host rollout state
- runtime service health
This separation is important. MAAS and Ironic both benefit from not collapsing every concern into one state field.
## Concrete Repo Changes Recommended
### Phase A: schema simplification
1. Promote `nix-nos/lib/cluster-config-lib.nix` into `nix/lib/cluster-schema.nix`.
2. Remove duplicated schema logic from `nix-nos/modules/topology.nix`.
3. Keep `ultracloud-cluster.nix` as an exporter/generator module, not a second schema definition.
### Phase B: bootstrap contract simplification
1. Deprecate `NodeConfig` as the primary bootstrap payload.
2. Introduce separate Rust types for:
- assignment
- bootstrap secrets
- install plan
3. Keep `phone_home` endpoint if desired, but split the implementation internally into separate phases/functions.
### Phase C: installer simplification
1. Remove node-specific fallback logic from `nix/iso/ultracloud-iso.nix`.
2. Require a resolved install profile or disk policy in the returned install plan.
3. Resolve disk targets using stable hints or explicit by-id paths.
### Phase D: controller clarification
1. Make the host rollout controller a named subsystem.
2. Document `nix-agent` as host OS reconcile only.
3. Document `fleet-scheduler` and `node-agent` as runtime-only.
### Phase E: operator UX
1. Add an inventory/commission view to `deployer-ctl`.
2. Make lifecycle transitions explicit.
3. Add reinstall/rescue flows that work even without BMC.
## What Not To Build Yet
Do not start with:
- a full MAAS clone
- full Ironic feature parity
- mandatory PXE
- mandatory BMC
- scheduler-driven bootstrap for all control-plane services
For the current environment, that would add complexity faster than value.
## Smallest Useful End State For The 6-PC Lab
The smallest useful design is:
- one generic ISO
- hardware discovery
- rule-based assignment to class/pool/profile
- explicit install plan
- stable disk policy
- first-boot `nix-agent`
- host rollout separate from runtime service scheduling
That gives you a MaaS-like system for real hardware without forcing MAAS-scale complexity into the repo.
## Immediate Next Design Tasks
1. Write `nix/lib/cluster-schema.nix` by extracting and renaming the existing cluster library.
2. Redesign the Rust bootstrap payloads around `NodeAssignment`, `BootstrapSecrets`, and `InstallPlan`.
3. Update the ISO to consume only the new install-plan contract.
4. Write a short architecture doc that shows the four control loops:
- discovery/enrollment
- installation
- host rollout
- runtime scheduling