centra/photoncloud-monorepo

Fork 0

centra b8ebd24d4e

nix-nos削除

2026-04-04 16:33:03 +09:00

13 KiB

Raw Blame History

Bare Metal / MaaS-like Simplification Plan (2026-04-04)

Summary

UltraCloud already has many of the right building blocks:

Nix modules and flake outputs for host configuration
deployer for bootstrap, enrollment, and inventory
nix-agent for host OS reconciliation
fleet-scheduler and node-agent for native service placement/runtime

The problem is not "missing everything". The problem is that the boundaries are still muddy:

source of truth is duplicated
install-time and runtime configuration are mixed together
registration, inventory, credential issuance, and install-plan rendering are coupled
bootstrap and scheduling are conceptually separate but still feel entangled in the repo

This document proposes a simpler target architecture for bare metal and MaaS-like provisioning, based on both the current repo and patterns used by existing systems.

What Existing Systems Consistently Separate

MAAS

Useful pattern:

machine lifecycle is explicit
commissioning, testing, deployment, release, rescue, and broken states are operator-visible
registration/inventory is not the same thing as workload placement

Relevant docs:

Ironic and Metal3

Useful pattern:

enrollment, manageable, available, deploy, clean, rescue are explicit provisioning states
inspection and cleaning are first-class lifecycle steps
root device selection is modeled explicitly instead of relying on /dev/sdX

Relevant docs:

Tinkerbell

Useful pattern:

hardware inventory, workflow/template, and install worker are separate concepts
the installer environment is generic
the workflow engine is distinct from hardware registration

Relevant docs:

Talos and Omni

Useful pattern:

a minimal boot medium is used only to join management
machine classes and labels drive config selection
machine configuration is acquired over an API instead of being hard-coded into per-node install media

Relevant docs:

NixOS deployment tools

Useful pattern:

installation and host rollout are separate concerns
unattended install should be repeatable from declarative config
activation needs timeout, health gate, and rollback semantics

Relevant docs:

Current UltraCloud Pain Points

1. Source of truth is duplicated

Today the repo has overlapping schema and generation paths:

ultracloud.cluster generates per-node cluster config, nix-nos topology, and deployer cluster state
nix-nos still has its own cluster schema and generateClusterConfig
deployer-types::ClusterStateSpec is another whole-cluster model on the Rust side

This makes it too easy to author the same concept twice.

Current references:

nix/modules/ultracloud-cluster.nix
nix-nos/modules/topology.nix
nix-nos/lib/cluster-config-lib.nix
deployer/crates/deployer-types/src/lib.rs

2. Install-time and runtime configuration are mixed

NodeConfig currently contains all of the following:

hostname and IP
labels, pool, node class
services
Nix profile
install plan

That is too much for a single object. A bootstrap/install contract should not be the same object as runtime scheduling hints.

Current references:

deployer/crates/deployer-types/src/lib.rs
deployer/crates/deployer-server/src/phone_home.rs

3. `phone_home` is carrying too much responsibility

The current flow combines:

machine identity lookup
enrollment-rule matching
node assignment
inventory summarization
cluster node record persistence
SSH/TLS issuance
install-plan return

This works, but it is difficult to reason about and difficult to evolve.

4. ISO bootstrap is still node-path oriented

The generic ISO still falls back to node-specific paths like:

nix/nodes/vm-cluster/$NODE_ID/disko.nix

That prevents profile/class-based provisioning from becoming the main path.

5. Host rollout and runtime scheduling are separated in code but not in the mental model

The repo already has:

nix-agent for host OS state
host deployment reconciliation for writing desired-system
fleet-scheduler for native service placement
node-agent for process/container reconcile

These are the right components, but the naming and schema boundaries do not make the split obvious.

Design Goal

The simplest viable target is not "build all of MAAS".

The simplest viable target is:

Nix is the only authoring surface for static cluster intent.
bootstrap deals only with discovery, assignment, credentials, and install plans.
host rollout is a separate controller/agent path.
service scheduling is entirely downstream of host rollout.
BMC and PXE are optional extensions, not required for the base design.

For your current 6-machine, no-BMC environment, this is the right scope.

Proposed Target Model

Layer 1: Static model in Nix

Create a single Nix library as the canonical schema. Do not create a fourth schema; promote the existing cluster-config-lib into the canonical one.

Recommended file:

nix/lib/cluster-schema.nix

Practical migration:

move or copy nix-nos/lib/cluster-config-lib.nix to nix/lib/cluster-schema.nix
make ultracloud-cluster.nix and nix-nos/modules/topology.nix thin wrappers over it
stop adding new schema logic anywhere else

This library should define only stable declarative objects:

cluster
networks
install profiles
disk policies
node classes
pools
enrollment rules
nodes
host deployments
service policies

From that one schema, generate these artifacts:

nixosConfigurations.<node>
bootstrap install-plan data
deployer cluster-state JSON
test-cluster topology

Recommended Nix Object Split

`installProfiles`

Purpose:

reusable OS install targets
used during discovery/bootstrap

Fields:

flake attribute or system profile reference
disk policy reference
network policy reference
bootstrap package set / image bundle reference

`diskPolicies`

Purpose:

stable root-disk selection
avoid hardcoding /dev/sda or node-specific Disko paths

Fields:

root device hints
partition layout
wipe/cleaning policy

Borrowed directly from Ironic/Metal3 thinking: disk choice must be modeled, not guessed.

`nodeClasses`

Purpose:

describe intended hardware/software role

Fields:

install profile
default labels
runtime capabilities
minimum hardware traits

`enrollmentRules`

Purpose:

match discovered machines to class/pool/labels

Fields:

selectors on machine-id, MAC, DMI, disk traits, NIC traits
assigned node class
assigned pool
optional hostname/node-id policy

`nodes`

Purpose:

explicit identity for fixed nodes when you want them

Use this for:

control plane seeds
gateways
special hardware

Do not require this for every worker in the generic path.

`hostDeployments`

Purpose:

rollout desired host OS state to already-installed machines

This is not bootstrap.

`servicePolicies`

Purpose:

runtime placement intent for fleet-scheduler

This is not host provisioning.

Proposed Rust/API Object Split

Replace the current "fat" NodeConfig mental model with explicit smaller objects.

`MachineInventory`

Owned by:

bootstrap discovery

Contains:

machine identity
hardware facts
last inventory hash
boot method support
optional power capability metadata

`NodeAssignment`

Owned by:

deployer enrollment logic

Contains:

stable node_id
hostname
class
pool
labels
failure domain

`BootstrapSecrets`

Owned by:

deployer credential issuer

Contains:

SSH host key
TLS cert/key
bootstrap token or short-lived install token

`InstallPlan`

Owned by:

deployer plan renderer

Contains:

node assignment reference
install profile reference
resolved flake attr or system reference
resolved disk policy or root-device selection
network bootstrap data
image/bundle URL

`DesiredSystem`

Owned by:

host rollout controller

Contains:

target system
activation strategy
health check
rollback policy

`ServiceSpec`

Owned by:

runtime scheduler

Contains:

service placement and instance policy only

It should not be returned by bootstrap APIs.

Recommended Controller Split

1. Deployer server

Keep responsibility limited to:

discovery
enrollment / assignment
inventory storage
credential issuance
install-plan rendering

Do not make it the host rollout engine and do not make it the runtime scheduler.

2. Host deployment controller

Make this an explicit first-class component. Today that logic exists in ultracloud-reconciler hosts.

Responsibility:

watch HostDeployment
select nodes
write desired-system
respect rollout budget and drain policy

Recommendation:

rename it conceptually to host-controller
keep it separate from fleet-scheduler

3. `nix-agent`

This should borrow deploy-rs style semantics:

activation timeout
confirmation/health gate
rollback on failure
staged reboot handling

4. `fleet-scheduler`

Responsibility:

service placement only

Do not allow bootstrap/install concerns to leak here.

Recommended Bootstrap Flow

Keep one generic installer image, but make the protocol explicit.

Step 1: discover

Installer boots and sends:

machine identity
hardware facts
observed network facts

Step 2: assign

Deployer resolves:

class
pool
hostname/node-id
install profile

Step 3: fetch plan

Installer receives:

NodeAssignment
BootstrapSecrets
InstallPlan

Step 4: install

Installer:

fetches source bundle
resolves disk policy
runs Disko
installs NixOS
reports status

Step 5: first boot

Installed system starts:

core static services
nix-agent
runtime agent only if needed for that class

This is closer to Tinkerbell and Talos than to the current monolithic node_config flow, while remaining much smaller than MAAS or Ironic.

Recommended Lifecycle State Model

Adopt a visible state machine. At minimum:

discovered
inspected
commissioned
install-pending
installing
installed
active
draining
reprovisioning
rescue
failed

Keep these orthogonal to:

power state
host rollout state
runtime service health

This separation is important. MAAS and Ironic both benefit from not collapsing every concern into one state field.

Concrete Repo Changes Recommended

Phase A: schema simplification

Promote nix-nos/lib/cluster-config-lib.nix into nix/lib/cluster-schema.nix.
Remove duplicated schema logic from nix-nos/modules/topology.nix.
Keep ultracloud-cluster.nix as an exporter/generator module, not a second schema definition.

Phase B: bootstrap contract simplification

Deprecate NodeConfig as the primary bootstrap payload.
Introduce separate Rust types for:
- assignment
- bootstrap secrets
- install plan
Keep phone_home endpoint if desired, but split the implementation internally into separate phases/functions.

Phase C: installer simplification

Remove node-specific fallback logic from nix/iso/ultracloud-iso.nix.
Require a resolved install profile or disk policy in the returned install plan.
Resolve disk targets using stable hints or explicit by-id paths.

Phase D: controller clarification

Make the host rollout controller a named subsystem.
Document nix-agent as host OS reconcile only.
Document fleet-scheduler and node-agent as runtime-only.

Phase E: operator UX

Add an inventory/commission view to deployer-ctl.
Make lifecycle transitions explicit.
Add reinstall/rescue flows that work even without BMC.

What Not To Build Yet

Do not start with:

a full MAAS clone
full Ironic feature parity
mandatory PXE
mandatory BMC
scheduler-driven bootstrap for all control-plane services

For the current environment, that would add complexity faster than value.

Smallest Useful End State For The 6-PC Lab

The smallest useful design is:

one generic ISO
hardware discovery
rule-based assignment to class/pool/profile
explicit install plan
stable disk policy
first-boot nix-agent
host rollout separate from runtime service scheduling

That gives you a MaaS-like system for real hardware without forcing MAAS-scale complexity into the repo.

Immediate Next Design Tasks

Write nix/lib/cluster-schema.nix by extracting and renaming the existing cluster library.
Redesign the Rust bootstrap payloads around NodeAssignment, BootstrapSecrets, and InstallPlan.
Update the ISO to consume only the new install-plan contract.
Write a short architecture doc that shows the four control loops:
- discovery/enrollment
- installation
- host rollout
- runtime scheduling

13 KiB Raw Blame History

Bare Metal / MaaS-like Simplification Plan (2026-04-04)

Summary

What Existing Systems Consistently Separate

MAAS

Ironic and Metal3

Tinkerbell

Talos and Omni

NixOS deployment tools

Current UltraCloud Pain Points

1. Source of truth is duplicated

2. Install-time and runtime configuration are mixed

3. phone_home is carrying too much responsibility

4. ISO bootstrap is still node-path oriented

5. Host rollout and runtime scheduling are separated in code but not in the mental model

Design Goal

Proposed Target Model

Layer 1: Static model in Nix

Recommended Nix Object Split

installProfiles

diskPolicies

nodeClasses

enrollmentRules

nodes

hostDeployments

servicePolicies

Proposed Rust/API Object Split

MachineInventory

NodeAssignment

BootstrapSecrets

InstallPlan

DesiredSystem

ServiceSpec

Recommended Controller Split

1. Deployer server

2. Host deployment controller

3. nix-agent

4. fleet-scheduler

Recommended Bootstrap Flow

Step 1: discover

Step 2: assign

Step 3: fetch plan

Step 4: install

Step 5: first boot

Recommended Lifecycle State Model

Concrete Repo Changes Recommended

Phase A: schema simplification

Phase B: bootstrap contract simplification

Phase C: installer simplification

Phase D: controller clarification

Phase E: operator UX

What Not To Build Yet

Smallest Useful End State For The 6-PC Lab

Immediate Next Design Tasks

13 KiB

Raw Blame History

3. `phone_home` is carrying too much responsibility

`installProfiles`

`diskPolicies`

`nodeClasses`

`enrollmentRules`

`nodes`

`hostDeployments`

`servicePolicies`

`MachineInventory`

`NodeAssignment`

`BootstrapSecrets`

`InstallPlan`

`DesiredSystem`

`ServiceSpec`

3. `nix-agent`

4. `fleet-scheduler`