photoncloud-monorepo/docs/component-matrix.md

98 lines
10 KiB
Markdown

# Component Matrix
UltraCloud now fixes the public support surface to three canonical profiles. This page defines the required and optional component bundles for each profile and keeps everything else explicitly outside the core contract.
## Canonical Profiles
### `single-node dev`
- Required components: `chainfire`, `flaredb`, `iam`, `plasmavmc`, `prismnet`
- Optional components: `lightningstor`, `coronafs`, `flashdns`, `fiberlb`, `apigateway`, `nightlight`, `creditservice`, `k8shost`
- Canonical entrypoints: `nix run .#single-node-quickstart`, `nix run .#single-node-trial`, `nix build .#single-node-trial-vm`, `nixosConfigurations.single-node-quickstart`, and companion install image `nixosConfigurations.netboot-all-in-one`
- Optional component toggles: `ultracloud.quickstart.enableLightningStor`, `enableCoronafs`, `enableFlashDNS`, `enableFiberLB`, `enableApiGateway`, `enableNightlight`, `enableCreditService`, `enableK8sHost`
- Primary use: one-command local bring-up, API development, and one-box VM experimentation without the HA control-plane or rollout-stack overhead
- Trial artifact: `single-node-trial-vm` is the supported buildable VM appliance for local use; the `single-node-quickstart` or `single-node-trial` app is the smoke launcher for that same minimal surface
### `3-node HA control plane`
- Required components: `chainfire`, `flaredb`, `iam`, `nix-agent` on every control-plane node, plus `deployer` on the bootstrap node
- Optional components: `fleet-scheduler`, `node-agent`, `prismnet`, `flashdns`, `fiberlb`, `plasmavmc`, `lightningstor`, `coronafs`, `k8shost`, `apigateway`, `nightlight`, `creditservice`
- Canonical entrypoints: `nixosConfigurations.node01`, `nixosConfigurations.node02`, `nixosConfigurations.node03`, and companion install image `nixosConfigurations.netboot-control-plane`
- Primary use: stable replicated control plane that can later accept worker, storage, and edge bundles without redefining the bootstrap path
### `bare-metal bootstrap`
- Required components: `deployer`, `first-boot-automation`, `install-target`, `nix-agent`
- Optional components: `node-agent`, `fleet-scheduler`, and higher-level storage or edge services after the first successful rollout
- Canonical entrypoints: `nix run ./nix/test-cluster#cluster -- baremetal-iso`, `nixosConfigurations.ultracloud-iso`, `nixosConfigurations.baremetal-qemu-control-plane`, `nixosConfigurations.baremetal-qemu-worker`, `checks.x86_64-linux.baremetal-iso-e2e`, and the built runner `./result/bin/baremetal-iso-e2e` for the exact host-KVM proof
- Primary use: boot the installer ISO, phone home to `deployer`, fetch the flake bundle, run Disko, reboot, and converge QEMU-emulated or real machines into either the single-node or HA profile
## Companion And Helper Outputs
- `nixosConfigurations.netboot-all-in-one`: canonical companion install image for `single-node dev`
- `nixosConfigurations.netboot-control-plane`: canonical companion install image for `3-node HA control plane`
- `packages.single-node-trial-vm`: low-friction buildable VM appliance for the minimal VM-platform core
- `nixosConfigurations.netboot-worker`: archived/non-product worker helper kept in-tree for manual lab debugging only
## Cluster Authoring Source
`ultracloud.cluster` backed by `nix/lib/cluster-schema.nix` is the only supported cluster authoring source. It is the canonical input for deployer classes and pools, service placement state, rollout objects, and per-node bootstrap metadata.
`nix-nos` is limited to legacy compatibility and low-level network primitives such as interfaces, VLANs, BGP, and static routing. It is not the canonical source for cluster topology, rollout intent, or scheduler state.
## Optional Composition Bundles
The optional bundles below remain important, but they are layered on top of the canonical profiles rather than treated as separate top-level products:
- control-plane core: `chainfire + flaredb + iam`
- network provider bundle: `prismnet + flashdns + fiberlb`
- VM hosting bundle: `plasmavmc + prismnet + coronafs + lightningstor`
- Kubernetes-style hosting bundle: `k8shost + prismnet + flashdns + fiberlb`
- edge and tenant bundle: `apigateway + iam + nightlight + creditservice`
- native rollout bundle: `deployer + chainfire + nix-agent + fleet-scheduler + node-agent`
`fresh-matrix` is the publishable composition proof because it rebuilds the host-side VM images before validating these bundles on the VM cluster.
For the edge and tenant bundle, the published contract now means: APIGateway is supported as stateless replicated instances behind an external L4 or VIP layer, but config rollout is restart-based and live in-process reload is not promised; NightLight is supported as a single-node WAL/snapshot service with instance-wide retention rather than replicated HA metrics storage; and CreditService stays scoped to quota, wallet, reservation, and admission control, with export or backend migration handled as offline export/import or backend-native snapshot workflows instead of live mixed-writer migration.
For the network provider bundle specifically, the published contract now means: PrismNet can create tenant VPC/subnet/port state and add then delete security-group ACLs deterministically, FlashDNS can publish records for those workloads, and FiberLB can front them with TCP plus TLS-terminated `Https` / `TerminatedHttps` listeners. `provider-vm-reality-proof` is the artifact-producing companion lane for that surface; it records authoritative DNS answers plus FiberLB backend drain and re-convergence under `./work/provider-vm-reality-proof/latest`. The shipped FiberLB L4 algorithms stay under targeted server tests in-tree.
PrismNet real OVS/OVN dataplane validation remains outside the supported local KVM surface.
FiberLB native BGP or BFD peer interop and hardware VIP ownership remain outside the supported local KVM surface.
FiberLB HTTPS health checks currently do not verify backend TLS certificates. Supported scope is limited to TCP reachability plus HTTP status for the backend endpoint until CA-aware verification is wired through config, server code, and the canonical harness.
For the VM hosting bundle, the published PlasmaVMC contract is the KVM-backed VM lifecycle path plus PrismNet-attached guest networking. `provider-vm-reality-proof` records KVM shared-storage migration and post-migration restart artifacts on the worker pair. Real-hardware migration or storage handoff remains a later hardware proof. Firecracker and mvisor code stays in-tree only as archived non-product backend scaffolding until it has end-to-end tenant-network coverage and publishable suite proof.
## Responsibility Boundaries
- `k8shost`: tenant workload API surface. It manages pod, deployment, and service semantics, then delegates network publication to `prismnet`, `flashdns`, and `fiberlb`.
- `k8shost` is fixed as an API/control-plane product surface. Supported binaries stop at `k8shost-server`, and `k8shost-cni`, `lightningstor-csi`, plus `k8shost-controllers` stay archived non-product until they have their own published coverage and a real network or storage dataplane contract.
- `plasmavmc`: tenant VM API surface. The supported public backend is KVM; it can run against explicit remote IAM, PrismNet, and FlareDB endpoints, and other backend implementations stay outside the canonical contract until they have end-to-end runtime and tenant-network coverage.
- `creditservice`: tenant quota, wallet, reservation, and admission-control surface. It stays in the supported bundle because `fresh-matrix` exercises both its direct APIs and the API-gateway path.
- `fleet-scheduler`: bare-metal service placement surface. It schedules host-native service instances from declarative cluster state generated from `ultracloud.cluster` plus `node-agent` heartbeats, without exposing Kubernetes APIs.
- `deployer`: enrollment and rollout authority. It serves `/api/v1/phone-home`, stores install plans and desired-system references, and seeds cluster metadata from the generated `ultracloud.cluster` state.
- `nix-agent`: host OS reconciler. It turns `deployer` desired-system references into `switch-to-configuration` actions plus rollback and health-check handling.
- `node-agent`: host runtime reconciler. It applies scheduled service-instance state, keeps runtime heartbeats fresh, and reports host-local execution status back to the scheduler.
The intended layering is `deployer -> nix-agent` for machine image or NixOS generation changes, and `deployer -> fleet-scheduler -> node-agent` for host-native service placement changes. `k8shost` stays separate because it is the tenant workload control plane, not the native service scheduler. The `single-node dev` profile intentionally stops before that rollout stack and keeps only the VM-platform core plus explicit add-ons.
## Standalone Stories
- `single-node-trial-vm` and `single-node-quickstart` are the standalone VM-platform story for the minimal KVM-backed surface.
- `deployer-vm-smoke`, `portable-control-plane-regressions`, and `baremetal-iso` are the standalone rollout-stack story for `deployer -> nix-agent` and `deployer -> fleet-scheduler -> node-agent`.
- OCI/Docker artifact is intentionally not the public trial surface because the supported VM-platform contract depends on a guest kernel plus host KVM, `/dev/net/tun`, and OVS/libvirt semantics.
## Archived Scaffolds
- `k8shost-cni`: internal scaffold for old tenant-network experiments; excluded from default workspace members and canonical docs
- `k8shost-controllers`: controller prototype scaffold; excluded from default workspace members and canonical docs
- `lightningstor-csi`: storage helper prototype; excluded from default workspace members and canonical docs
- Firecracker and mvisor: archived PlasmaVMC backend scaffolds outside the supported KVM-only contract and excluded from the default PlasmaVMC workspace members
- `nixosConfigurations.netboot-worker`: archived worker helper image outside canonical profile guards
- `baremetal/vm-cluster`: `legacy/manual` debugging path outside the main product surface
## Non-Canonical Paths
- `baremetal/vm-cluster` remains `legacy/manual`
- standalone use of `netboot-control-plane` or `netboot-all-in-one` outside the documented profiles is a debugging path, not a fourth supported profile
- `netboot-worker`, Firecracker, mvisor, `k8shost-cni`, `lightningstor-csi`, and `k8shost-controllers` are archived non-product scaffolds rather than canonical entrypoints
- `netboot-base`, `pxe-server`, and `vm-smoke-target` are internal or legacy helpers, not supported profiles by themselves
- ad hoc shell-driven cluster bring-up is for debugging only and should not be presented as the canonical public path