photoncloud-monorepo/plans/nix-nos-simplification-2026-04-04.md
2026-04-04 16:33:03 +09:00

133 lines
4.6 KiB
Markdown

# Nix-NOS Simplification Plan (2026-04-04)
## Summary
`nix-nos` should not remain a second cluster authoring surface.
Status update:
- `ultracloud.cluster` is now the only in-repo cluster authoring path
- `services.first-boot-automation` no longer has a `useNixNOS` mode
- root `flake.nix` no longer imports `nix-nos`
- topology-specific `nix-nos` files have been removed
- the remaining `nix-nos` tree is only network/BGP/routing primitives
The right plan is:
- keep `ultracloud.cluster` as the only cluster source of truth
- keep `nix-nos` only as a compatibility facade for older topology-driven flows
- eventually shrink `nix-nos` down to network primitives, or remove it entirely if those primitives are moved into the main Nix module tree
## Current State
Today the repo is already halfway through this transition.
- `nix/lib/cluster-schema.nix` is the actual schema/helper library
- `nix/modules/ultracloud-cluster.nix` generates:
- per-node `cluster-config.json`
- `nix-nos.clusters`
- deployer cluster state
- `nix-nos/modules/topology.nix` no longer owns its own schema logic; it delegates to `cluster-schema.nix`
- `services.first-boot-automation` still has a `useNixNOS` path and still treats `nix-nos.generateClusterConfig` as a real config source
So the duplication is smaller than before, but the user-facing model is still confusing because there are still two apparent ways to describe a cluster.
## Recommendation
The recommended target is:
1. `ultracloud.cluster` is the only supported cluster authoring API.
2. `nix-nos` is explicitly legacy-compatibility only for topology consumers that have not been migrated yet.
3. `nix-nos` should stop presenting itself as a general cluster definition layer.
4. `first-boot-automation` should stop depending on `nix-nos` as a primary provider.
This keeps the repo simpler without forcing a big-bang removal.
## What Nix-NOS Should Still Own
Only keep the parts that are actually distinct:
- interface/VLAN primitives
- BGP primitives
- static routing primitives
- any truly reusable NOS-style networking submodules
These are valid low-level modules.
What `nix-nos` should not own anymore:
- whole-cluster source of truth
- bootstrap node selection rules
- cluster-config generation semantics
- host inventory / deployer state generation
Those belong in `ultracloud.cluster` and `cluster-schema.nix`.
## Target Shape
### Primary path
- user writes `ultracloud.cluster`
- `cluster-schema.nix` derives:
- node cluster config
- deployer cluster state
- compatibility topology objects if needed
### Compatibility path
- `nix-nos` may still expose `clusters` and `generateClusterConfig`
- but they are documented and warned as legacy compatibility only
- ideally they become thin read-only views over `cluster-schema.nix`, not an authoring API
### First boot
`services.first-boot-automation` should eventually have only these modes:
- use generated UltraCloud cluster config
- use an explicit file path
It should not need a separate `useNixNOS` mode long-term.
## Migration Plan
### Phase 1: Freeze
- do not add new functionality to `nix-nos.clusters`
- mark `nix-nos` topology usage as legacy in warnings/docs
- keep all schema changes in `cluster-schema.nix`
### Phase 2: Move first-boot off Nix-NOS
- make `services.first-boot-automation` prefer `ultracloud.cluster.generated.nodeClusterConfig`
- keep `nix-nos` only as fallback/compat, not as the preferred path
- stop using `useNixNOS` in normal tests/configurations
### Phase 3: Remove topology authoring role
- deprecate direct authoring of `nix-nos.clusters`
- remove `nix/modules/nix-nos/cluster-config-generator.nix`
- collapse any remaining direct topology generation onto `cluster-schema.nix`
### Phase 4: Decide final fate
Choose one:
- keep `nix-nos` as a small network-primitives library
- or move those network primitives under `nix/modules/network/*` and delete `nix-nos`
The first option is lower risk. The second is cleaner.
## Recommended Decision
Recommended decision:
- short term: keep `nix-nos`, but only as a compatibility/network-primitives layer
- medium term: remove `nix-nos` as a cluster authoring concept
- long term: either rename/rehome the remaining network modules, or delete `nix-nos` if nothing substantial remains
## Immediate Next Steps
1. Mark `nix-nos.clusters` and `services.first-boot-automation.useNixNOS` as legacy in evaluation warnings.
2. Reduce test usage so only one compatibility smoke test still exercises direct `nix-nos` authoring.
3. Change docs/examples to author clusters through `ultracloud.cluster` only.
4. After that, remove the standalone `cluster-config-generator.nix` path.