# Bare Metal / MaaS-like Simplification Plan (2026-04-04) ## Summary UltraCloud already has many of the right building blocks: - `Nix` modules and flake outputs for host configuration - `deployer` for bootstrap, enrollment, and inventory - `nix-agent` for host OS reconciliation - `fleet-scheduler` and `node-agent` for native service placement/runtime The problem is not "missing everything". The problem is that the boundaries are still muddy: - source of truth is duplicated - install-time and runtime configuration are mixed together - registration, inventory, credential issuance, and install-plan rendering are coupled - bootstrap and scheduling are conceptually separate but still feel entangled in the repo This document proposes a simpler target architecture for bare metal and MaaS-like provisioning, based on both the current repo and patterns used by existing systems. ## What Existing Systems Consistently Separate ### MAAS Useful pattern: - machine lifecycle is explicit - commissioning, testing, deployment, release, rescue, and broken states are operator-visible - registration/inventory is not the same thing as workload placement Relevant docs: - https://discourse.maas.io/t/about-maas/5511 - https://discourse.maas.io/t/machines-do-the-heavy-lifting/5080 ### Ironic and Metal3 Useful pattern: - enrollment, manageable, available, deploy, clean, rescue are explicit provisioning states - inspection and cleaning are first-class lifecycle steps - root device selection is modeled explicitly instead of relying on `/dev/sdX` Relevant docs: - https://docs.openstack.org/ironic/latest/install/enrollment.html - https://book.metal3.io/bmo/automated_cleaning - https://book.metal3.io/bmo/root_device_hints ### Tinkerbell Useful pattern: - hardware inventory, workflow/template, and install worker are separate concepts - the installer environment is generic - the workflow engine is distinct from hardware registration Relevant docs: - https://tinkerbell.org/docs/services/tink-worker/ - https://tinkerbell.org/docs/v0.22/services/tink-controller/ ### Talos and Omni Useful pattern: - a minimal boot medium is used only to join management - machine classes and labels drive config selection - machine configuration is acquired over an API instead of being hard-coded into per-node install media Relevant docs: - https://omni.siderolabs.com/how-to-guides/registering-machines - https://docs.siderolabs.com/talos/v1.10/overview/what-is-talos ### NixOS deployment tools Useful pattern: - installation and host rollout are separate concerns - unattended install should be repeatable from declarative config - activation needs timeout, health gate, and rollback semantics Relevant docs: - https://github.com/nix-community/nixos-anywhere - https://github.com/serokell/deploy-rs - https://colmena.cli.rs/0.4/reference/cli.html ## Current UltraCloud Pain Points ### 1. Source of truth is duplicated Today the repo has overlapping schema and generation paths: - `ultracloud.cluster` generates per-node cluster config, `nix-nos` topology, and deployer cluster state - `nix-nos` still has its own cluster schema and `generateClusterConfig` - `deployer-types::ClusterStateSpec` is another whole-cluster model on the Rust side This makes it too easy to author the same concept twice. Current references: - `nix/modules/ultracloud-cluster.nix` - `nix-nos/modules/topology.nix` - `nix-nos/lib/cluster-config-lib.nix` - `deployer/crates/deployer-types/src/lib.rs` ### 2. Install-time and runtime configuration are mixed `NodeConfig` currently contains all of the following: - hostname and IP - labels, pool, node class - services - Nix profile - install plan That is too much for a single object. A bootstrap/install contract should not be the same object as runtime scheduling hints. Current references: - `deployer/crates/deployer-types/src/lib.rs` - `deployer/crates/deployer-server/src/phone_home.rs` ### 3. `phone_home` is carrying too much responsibility The current flow combines: - machine identity lookup - enrollment-rule matching - node assignment - inventory summarization - cluster node record persistence - SSH/TLS issuance - install-plan return This works, but it is difficult to reason about and difficult to evolve. ### 4. ISO bootstrap is still node-path oriented The generic ISO still falls back to node-specific paths like: - `nix/nodes/vm-cluster/$NODE_ID/disko.nix` That prevents profile/class-based provisioning from becoming the main path. ### 5. Host rollout and runtime scheduling are separated in code but not in the mental model The repo already has: - `nix-agent` for host OS state - host deployment reconciliation for writing `desired-system` - `fleet-scheduler` for native service placement - `node-agent` for process/container reconcile These are the right components, but the naming and schema boundaries do not make the split obvious. ## Design Goal The simplest viable target is not "build all of MAAS". The simplest viable target is: 1. `Nix` is the only authoring surface for static cluster intent. 2. bootstrap deals only with discovery, assignment, credentials, and install plans. 3. host rollout is a separate controller/agent path. 4. service scheduling is entirely downstream of host rollout. 5. BMC and PXE are optional extensions, not required for the base design. For your current 6-machine, no-BMC environment, this is the right scope. ## Proposed Target Model ### Layer 1: Static model in Nix Create a single Nix library as the canonical schema. Do not create a fourth schema; promote the existing `cluster-config-lib` into the canonical one. Recommended file: - `nix/lib/cluster-schema.nix` Practical migration: - move or copy `nix-nos/lib/cluster-config-lib.nix` to `nix/lib/cluster-schema.nix` - make `ultracloud-cluster.nix` and `nix-nos/modules/topology.nix` thin wrappers over it - stop adding new schema logic anywhere else This library should define only stable declarative objects: - cluster - networks - install profiles - disk policies - node classes - pools - enrollment rules - nodes - host deployments - service policies From that one schema, generate these artifacts: - `nixosConfigurations.` - bootstrap install-plan data - deployer cluster-state JSON - test-cluster topology ## Recommended Nix Object Split ### `installProfiles` Purpose: - reusable OS install targets - used during discovery/bootstrap Fields: - flake attribute or system profile reference - disk policy reference - network policy reference - bootstrap package set / image bundle reference ### `diskPolicies` Purpose: - stable root-disk selection - avoid hardcoding `/dev/sda` or node-specific Disko paths Fields: - root device hints - partition layout - wipe/cleaning policy Borrowed directly from Ironic/Metal3 thinking: disk choice must be modeled, not guessed. ### `nodeClasses` Purpose: - describe intended hardware/software role Fields: - install profile - default labels - runtime capabilities - minimum hardware traits ### `enrollmentRules` Purpose: - match discovered machines to class/pool/labels Fields: - selectors on machine-id, MAC, DMI, disk traits, NIC traits - assigned node class - assigned pool - optional hostname/node-id policy ### `nodes` Purpose: - explicit identity for fixed nodes when you want them Use this for: - control plane seeds - gateways - special hardware Do not require this for every worker in the generic path. ### `hostDeployments` Purpose: - rollout desired host OS state to already-installed machines This is not bootstrap. ### `servicePolicies` Purpose: - runtime placement intent for `fleet-scheduler` This is not host provisioning. ## Proposed Rust/API Object Split Replace the current "fat" `NodeConfig` mental model with explicit smaller objects. ### `MachineInventory` Owned by: - bootstrap discovery Contains: - machine identity - hardware facts - last inventory hash - boot method support - optional power capability metadata ### `NodeAssignment` Owned by: - deployer enrollment logic Contains: - stable `node_id` - hostname - class - pool - labels - failure domain ### `BootstrapSecrets` Owned by: - deployer credential issuer Contains: - SSH host key - TLS cert/key - bootstrap token or short-lived install token ### `InstallPlan` Owned by: - deployer plan renderer Contains: - node assignment reference - install profile reference - resolved flake attr or system reference - resolved disk policy or root-device selection - network bootstrap data - image/bundle URL ### `DesiredSystem` Owned by: - host rollout controller Contains: - target system - activation strategy - health check - rollback policy ### `ServiceSpec` Owned by: - runtime scheduler Contains: - service placement and instance policy only It should not be returned by bootstrap APIs. ## Recommended Controller Split ### 1. Deployer server Keep responsibility limited to: - discovery - enrollment / assignment - inventory storage - credential issuance - install-plan rendering Do not make it the host rollout engine and do not make it the runtime scheduler. ### 2. Host deployment controller Make this an explicit first-class component. Today that logic exists in `ultracloud-reconciler hosts`. Responsibility: - watch `HostDeployment` - select nodes - write `desired-system` - respect rollout budget and drain policy Recommendation: - rename it conceptually to `host-controller` - keep it separate from `fleet-scheduler` ### 3. `nix-agent` This should borrow deploy-rs style semantics: - activation timeout - confirmation/health gate - rollback on failure - staged reboot handling ### 4. `fleet-scheduler` Responsibility: - service placement only Do not allow bootstrap/install concerns to leak here. ## Recommended Bootstrap Flow Keep one generic installer image, but make the protocol explicit. ### Step 1: discover Installer boots and sends: - machine identity - hardware facts - observed network facts ### Step 2: assign Deployer resolves: - class - pool - hostname/node-id - install profile ### Step 3: fetch plan Installer receives: - `NodeAssignment` - `BootstrapSecrets` - `InstallPlan` ### Step 4: install Installer: - fetches source bundle - resolves disk policy - runs Disko - installs NixOS - reports status ### Step 5: first boot Installed system starts: - core static services - `nix-agent` - runtime agent only if needed for that class This is closer to Tinkerbell and Talos than to the current monolithic `node_config` flow, while remaining much smaller than MAAS or Ironic. ## Recommended Lifecycle State Model Adopt a visible state machine. At minimum: - `discovered` - `inspected` - `commissioned` - `install-pending` - `installing` - `installed` - `active` - `draining` - `reprovisioning` - `rescue` - `failed` Keep these orthogonal to: - power state - host rollout state - runtime service health This separation is important. MAAS and Ironic both benefit from not collapsing every concern into one state field. ## Concrete Repo Changes Recommended ### Phase A: schema simplification 1. Promote `nix-nos/lib/cluster-config-lib.nix` into `nix/lib/cluster-schema.nix`. 2. Remove duplicated schema logic from `nix-nos/modules/topology.nix`. 3. Keep `ultracloud-cluster.nix` as an exporter/generator module, not a second schema definition. ### Phase B: bootstrap contract simplification 1. Deprecate `NodeConfig` as the primary bootstrap payload. 2. Introduce separate Rust types for: - assignment - bootstrap secrets - install plan 3. Keep `phone_home` endpoint if desired, but split the implementation internally into separate phases/functions. ### Phase C: installer simplification 1. Remove node-specific fallback logic from `nix/iso/ultracloud-iso.nix`. 2. Require a resolved install profile or disk policy in the returned install plan. 3. Resolve disk targets using stable hints or explicit by-id paths. ### Phase D: controller clarification 1. Make the host rollout controller a named subsystem. 2. Document `nix-agent` as host OS reconcile only. 3. Document `fleet-scheduler` and `node-agent` as runtime-only. ### Phase E: operator UX 1. Add an inventory/commission view to `deployer-ctl`. 2. Make lifecycle transitions explicit. 3. Add reinstall/rescue flows that work even without BMC. ## What Not To Build Yet Do not start with: - a full MAAS clone - full Ironic feature parity - mandatory PXE - mandatory BMC - scheduler-driven bootstrap for all control-plane services For the current environment, that would add complexity faster than value. ## Smallest Useful End State For The 6-PC Lab The smallest useful design is: - one generic ISO - hardware discovery - rule-based assignment to class/pool/profile - explicit install plan - stable disk policy - first-boot `nix-agent` - host rollout separate from runtime service scheduling That gives you a MaaS-like system for real hardware without forcing MAAS-scale complexity into the repo. ## Immediate Next Design Tasks 1. Write `nix/lib/cluster-schema.nix` by extracting and renaming the existing cluster library. 2. Redesign the Rust bootstrap payloads around `NodeAssignment`, `BootstrapSecrets`, and `InstallPlan`. 3. Update the ISO to consume only the new install-plan contract. 4. Write a short architecture doc that shows the four control loops: - discovery/enrollment - installation - host rollout - runtime scheduling