13 KiB
Bare Metal / MaaS-like Simplification Plan (2026-04-04)
Summary
UltraCloud already has many of the right building blocks:
Nixmodules and flake outputs for host configurationdeployerfor bootstrap, enrollment, and inventorynix-agentfor host OS reconciliationfleet-schedulerandnode-agentfor native service placement/runtime
The problem is not "missing everything". The problem is that the boundaries are still muddy:
- source of truth is duplicated
- install-time and runtime configuration are mixed together
- registration, inventory, credential issuance, and install-plan rendering are coupled
- bootstrap and scheduling are conceptually separate but still feel entangled in the repo
This document proposes a simpler target architecture for bare metal and MaaS-like provisioning, based on both the current repo and patterns used by existing systems.
What Existing Systems Consistently Separate
MAAS
Useful pattern:
- machine lifecycle is explicit
- commissioning, testing, deployment, release, rescue, and broken states are operator-visible
- registration/inventory is not the same thing as workload placement
Relevant docs:
- https://discourse.maas.io/t/about-maas/5511
- https://discourse.maas.io/t/machines-do-the-heavy-lifting/5080
Ironic and Metal3
Useful pattern:
- enrollment, manageable, available, deploy, clean, rescue are explicit provisioning states
- inspection and cleaning are first-class lifecycle steps
- root device selection is modeled explicitly instead of relying on
/dev/sdX
Relevant docs:
- https://docs.openstack.org/ironic/latest/install/enrollment.html
- https://book.metal3.io/bmo/automated_cleaning
- https://book.metal3.io/bmo/root_device_hints
Tinkerbell
Useful pattern:
- hardware inventory, workflow/template, and install worker are separate concepts
- the installer environment is generic
- the workflow engine is distinct from hardware registration
Relevant docs:
- https://tinkerbell.org/docs/services/tink-worker/
- https://tinkerbell.org/docs/v0.22/services/tink-controller/
Talos and Omni
Useful pattern:
- a minimal boot medium is used only to join management
- machine classes and labels drive config selection
- machine configuration is acquired over an API instead of being hard-coded into per-node install media
Relevant docs:
- https://omni.siderolabs.com/how-to-guides/registering-machines
- https://docs.siderolabs.com/talos/v1.10/overview/what-is-talos
NixOS deployment tools
Useful pattern:
- installation and host rollout are separate concerns
- unattended install should be repeatable from declarative config
- activation needs timeout, health gate, and rollback semantics
Relevant docs:
- https://github.com/nix-community/nixos-anywhere
- https://github.com/serokell/deploy-rs
- https://colmena.cli.rs/0.4/reference/cli.html
Current UltraCloud Pain Points
1. Source of truth is duplicated
Today the repo has overlapping schema and generation paths:
ultracloud.clustergenerates per-node cluster config,nix-nostopology, and deployer cluster statenix-nosstill has its own cluster schema andgenerateClusterConfigdeployer-types::ClusterStateSpecis another whole-cluster model on the Rust side
This makes it too easy to author the same concept twice.
Current references:
nix/modules/ultracloud-cluster.nixnix-nos/modules/topology.nixnix-nos/lib/cluster-config-lib.nixdeployer/crates/deployer-types/src/lib.rs
2. Install-time and runtime configuration are mixed
NodeConfig currently contains all of the following:
- hostname and IP
- labels, pool, node class
- services
- Nix profile
- install plan
That is too much for a single object. A bootstrap/install contract should not be the same object as runtime scheduling hints.
Current references:
deployer/crates/deployer-types/src/lib.rsdeployer/crates/deployer-server/src/phone_home.rs
3. phone_home is carrying too much responsibility
The current flow combines:
- machine identity lookup
- enrollment-rule matching
- node assignment
- inventory summarization
- cluster node record persistence
- SSH/TLS issuance
- install-plan return
This works, but it is difficult to reason about and difficult to evolve.
4. ISO bootstrap is still node-path oriented
The generic ISO still falls back to node-specific paths like:
nix/nodes/vm-cluster/$NODE_ID/disko.nix
That prevents profile/class-based provisioning from becoming the main path.
5. Host rollout and runtime scheduling are separated in code but not in the mental model
The repo already has:
nix-agentfor host OS state- host deployment reconciliation for writing
desired-system fleet-schedulerfor native service placementnode-agentfor process/container reconcile
These are the right components, but the naming and schema boundaries do not make the split obvious.
Design Goal
The simplest viable target is not "build all of MAAS".
The simplest viable target is:
Nixis the only authoring surface for static cluster intent.- bootstrap deals only with discovery, assignment, credentials, and install plans.
- host rollout is a separate controller/agent path.
- service scheduling is entirely downstream of host rollout.
- BMC and PXE are optional extensions, not required for the base design.
For your current 6-machine, no-BMC environment, this is the right scope.
Proposed Target Model
Layer 1: Static model in Nix
Create a single Nix library as the canonical schema. Do not create a fourth schema; promote the existing cluster-config-lib into the canonical one.
Recommended file:
nix/lib/cluster-schema.nix
Practical migration:
- move or copy
nix-nos/lib/cluster-config-lib.nixtonix/lib/cluster-schema.nix - make
ultracloud-cluster.nixandnix-nos/modules/topology.nixthin wrappers over it - stop adding new schema logic anywhere else
This library should define only stable declarative objects:
- cluster
- networks
- install profiles
- disk policies
- node classes
- pools
- enrollment rules
- nodes
- host deployments
- service policies
From that one schema, generate these artifacts:
nixosConfigurations.<node>- bootstrap install-plan data
- deployer cluster-state JSON
- test-cluster topology
Recommended Nix Object Split
installProfiles
Purpose:
- reusable OS install targets
- used during discovery/bootstrap
Fields:
- flake attribute or system profile reference
- disk policy reference
- network policy reference
- bootstrap package set / image bundle reference
diskPolicies
Purpose:
- stable root-disk selection
- avoid hardcoding
/dev/sdaor node-specific Disko paths
Fields:
- root device hints
- partition layout
- wipe/cleaning policy
Borrowed directly from Ironic/Metal3 thinking: disk choice must be modeled, not guessed.
nodeClasses
Purpose:
- describe intended hardware/software role
Fields:
- install profile
- default labels
- runtime capabilities
- minimum hardware traits
enrollmentRules
Purpose:
- match discovered machines to class/pool/labels
Fields:
- selectors on machine-id, MAC, DMI, disk traits, NIC traits
- assigned node class
- assigned pool
- optional hostname/node-id policy
nodes
Purpose:
- explicit identity for fixed nodes when you want them
Use this for:
- control plane seeds
- gateways
- special hardware
Do not require this for every worker in the generic path.
hostDeployments
Purpose:
- rollout desired host OS state to already-installed machines
This is not bootstrap.
servicePolicies
Purpose:
- runtime placement intent for
fleet-scheduler
This is not host provisioning.
Proposed Rust/API Object Split
Replace the current "fat" NodeConfig mental model with explicit smaller objects.
MachineInventory
Owned by:
- bootstrap discovery
Contains:
- machine identity
- hardware facts
- last inventory hash
- boot method support
- optional power capability metadata
NodeAssignment
Owned by:
- deployer enrollment logic
Contains:
- stable
node_id - hostname
- class
- pool
- labels
- failure domain
BootstrapSecrets
Owned by:
- deployer credential issuer
Contains:
- SSH host key
- TLS cert/key
- bootstrap token or short-lived install token
InstallPlan
Owned by:
- deployer plan renderer
Contains:
- node assignment reference
- install profile reference
- resolved flake attr or system reference
- resolved disk policy or root-device selection
- network bootstrap data
- image/bundle URL
DesiredSystem
Owned by:
- host rollout controller
Contains:
- target system
- activation strategy
- health check
- rollback policy
ServiceSpec
Owned by:
- runtime scheduler
Contains:
- service placement and instance policy only
It should not be returned by bootstrap APIs.
Recommended Controller Split
1. Deployer server
Keep responsibility limited to:
- discovery
- enrollment / assignment
- inventory storage
- credential issuance
- install-plan rendering
Do not make it the host rollout engine and do not make it the runtime scheduler.
2. Host deployment controller
Make this an explicit first-class component. Today that logic exists in ultracloud-reconciler hosts.
Responsibility:
- watch
HostDeployment - select nodes
- write
desired-system - respect rollout budget and drain policy
Recommendation:
- rename it conceptually to
host-controller - keep it separate from
fleet-scheduler
3. nix-agent
This should borrow deploy-rs style semantics:
- activation timeout
- confirmation/health gate
- rollback on failure
- staged reboot handling
4. fleet-scheduler
Responsibility:
- service placement only
Do not allow bootstrap/install concerns to leak here.
Recommended Bootstrap Flow
Keep one generic installer image, but make the protocol explicit.
Step 1: discover
Installer boots and sends:
- machine identity
- hardware facts
- observed network facts
Step 2: assign
Deployer resolves:
- class
- pool
- hostname/node-id
- install profile
Step 3: fetch plan
Installer receives:
NodeAssignmentBootstrapSecretsInstallPlan
Step 4: install
Installer:
- fetches source bundle
- resolves disk policy
- runs Disko
- installs NixOS
- reports status
Step 5: first boot
Installed system starts:
- core static services
nix-agent- runtime agent only if needed for that class
This is closer to Tinkerbell and Talos than to the current monolithic node_config flow, while remaining much smaller than MAAS or Ironic.
Recommended Lifecycle State Model
Adopt a visible state machine. At minimum:
discoveredinspectedcommissionedinstall-pendinginstallinginstalledactivedrainingreprovisioningrescuefailed
Keep these orthogonal to:
- power state
- host rollout state
- runtime service health
This separation is important. MAAS and Ironic both benefit from not collapsing every concern into one state field.
Concrete Repo Changes Recommended
Phase A: schema simplification
- Promote
nix-nos/lib/cluster-config-lib.nixintonix/lib/cluster-schema.nix. - Remove duplicated schema logic from
nix-nos/modules/topology.nix. - Keep
ultracloud-cluster.nixas an exporter/generator module, not a second schema definition.
Phase B: bootstrap contract simplification
- Deprecate
NodeConfigas the primary bootstrap payload. - Introduce separate Rust types for:
- assignment
- bootstrap secrets
- install plan
- Keep
phone_homeendpoint if desired, but split the implementation internally into separate phases/functions.
Phase C: installer simplification
- Remove node-specific fallback logic from
nix/iso/ultracloud-iso.nix. - Require a resolved install profile or disk policy in the returned install plan.
- Resolve disk targets using stable hints or explicit by-id paths.
Phase D: controller clarification
- Make the host rollout controller a named subsystem.
- Document
nix-agentas host OS reconcile only. - Document
fleet-schedulerandnode-agentas runtime-only.
Phase E: operator UX
- Add an inventory/commission view to
deployer-ctl. - Make lifecycle transitions explicit.
- Add reinstall/rescue flows that work even without BMC.
What Not To Build Yet
Do not start with:
- a full MAAS clone
- full Ironic feature parity
- mandatory PXE
- mandatory BMC
- scheduler-driven bootstrap for all control-plane services
For the current environment, that would add complexity faster than value.
Smallest Useful End State For The 6-PC Lab
The smallest useful design is:
- one generic ISO
- hardware discovery
- rule-based assignment to class/pool/profile
- explicit install plan
- stable disk policy
- first-boot
nix-agent - host rollout separate from runtime service scheduling
That gives you a MaaS-like system for real hardware without forcing MAAS-scale complexity into the repo.
Immediate Next Design Tasks
- Write
nix/lib/cluster-schema.nixby extracting and renaming the existing cluster library. - Redesign the Rust bootstrap payloads around
NodeAssignment,BootstrapSecrets, andInstallPlan. - Update the ISO to consume only the new install-plan contract.
- Write a short architecture doc that shows the four control loops:
- discovery/enrollment
- installation
- host rollout
- runtime scheduling