photoncloud-monorepo/plans/nix-nos-simplification-2026-04-04.md
2026-04-04 16:33:03 +09:00

4.6 KiB

Nix-NOS Simplification Plan (2026-04-04)

Summary

nix-nos should not remain a second cluster authoring surface.

Status update:

  • ultracloud.cluster is now the only in-repo cluster authoring path
  • services.first-boot-automation no longer has a useNixNOS mode
  • root flake.nix no longer imports nix-nos
  • topology-specific nix-nos files have been removed
  • the remaining nix-nos tree is only network/BGP/routing primitives

The right plan is:

  • keep ultracloud.cluster as the only cluster source of truth
  • keep nix-nos only as a compatibility facade for older topology-driven flows
  • eventually shrink nix-nos down to network primitives, or remove it entirely if those primitives are moved into the main Nix module tree

Current State

Today the repo is already halfway through this transition.

  • nix/lib/cluster-schema.nix is the actual schema/helper library
  • nix/modules/ultracloud-cluster.nix generates:
    • per-node cluster-config.json
    • nix-nos.clusters
    • deployer cluster state
  • nix-nos/modules/topology.nix no longer owns its own schema logic; it delegates to cluster-schema.nix
  • services.first-boot-automation still has a useNixNOS path and still treats nix-nos.generateClusterConfig as a real config source

So the duplication is smaller than before, but the user-facing model is still confusing because there are still two apparent ways to describe a cluster.

Recommendation

The recommended target is:

  1. ultracloud.cluster is the only supported cluster authoring API.
  2. nix-nos is explicitly legacy-compatibility only for topology consumers that have not been migrated yet.
  3. nix-nos should stop presenting itself as a general cluster definition layer.
  4. first-boot-automation should stop depending on nix-nos as a primary provider.

This keeps the repo simpler without forcing a big-bang removal.

What Nix-NOS Should Still Own

Only keep the parts that are actually distinct:

  • interface/VLAN primitives
  • BGP primitives
  • static routing primitives
  • any truly reusable NOS-style networking submodules

These are valid low-level modules.

What nix-nos should not own anymore:

  • whole-cluster source of truth
  • bootstrap node selection rules
  • cluster-config generation semantics
  • host inventory / deployer state generation

Those belong in ultracloud.cluster and cluster-schema.nix.

Target Shape

Primary path

  • user writes ultracloud.cluster
  • cluster-schema.nix derives:
    • node cluster config
    • deployer cluster state
    • compatibility topology objects if needed

Compatibility path

  • nix-nos may still expose clusters and generateClusterConfig
  • but they are documented and warned as legacy compatibility only
  • ideally they become thin read-only views over cluster-schema.nix, not an authoring API

First boot

services.first-boot-automation should eventually have only these modes:

  • use generated UltraCloud cluster config
  • use an explicit file path

It should not need a separate useNixNOS mode long-term.

Migration Plan

Phase 1: Freeze

  • do not add new functionality to nix-nos.clusters
  • mark nix-nos topology usage as legacy in warnings/docs
  • keep all schema changes in cluster-schema.nix

Phase 2: Move first-boot off Nix-NOS

  • make services.first-boot-automation prefer ultracloud.cluster.generated.nodeClusterConfig
  • keep nix-nos only as fallback/compat, not as the preferred path
  • stop using useNixNOS in normal tests/configurations

Phase 3: Remove topology authoring role

  • deprecate direct authoring of nix-nos.clusters
  • remove nix/modules/nix-nos/cluster-config-generator.nix
  • collapse any remaining direct topology generation onto cluster-schema.nix

Phase 4: Decide final fate

Choose one:

  • keep nix-nos as a small network-primitives library
  • or move those network primitives under nix/modules/network/* and delete nix-nos

The first option is lower risk. The second is cleaner.

Recommended decision:

  • short term: keep nix-nos, but only as a compatibility/network-primitives layer
  • medium term: remove nix-nos as a cluster authoring concept
  • long term: either rename/rehome the remaining network modules, or delete nix-nos if nothing substantial remains

Immediate Next Steps

  1. Mark nix-nos.clusters and services.first-boot-automation.useNixNOS as legacy in evaluation warnings.
  2. Reduce test usage so only one compatibility smoke test still exercises direct nix-nos authoring.
  3. Change docs/examples to author clusters through ultracloud.cluster only.
  4. After that, remove the standalone cluster-config-generator.nix path.