# Testing UltraCloud treats VM-first validation as the canonical local proof path and keeps the public support contract limited to three profiles. ## Canonical Profiles | Profile | Canonical entrypoints | Required components | Optional components | | --- | --- | --- | --- | | `single-node dev` | `nix run .#single-node-quickstart`, `nix run .#single-node-trial`, `nix build .#single-node-trial-vm`, `nixosConfigurations.single-node-quickstart`, companion install image `nixosConfigurations.netboot-all-in-one` | `chainfire`, `flaredb`, `iam`, `plasmavmc`, `prismnet` | `lightningstor`, `coronafs`, `flashdns`, `fiberlb`, `apigateway`, `nightlight`, `creditservice`, `k8shost` | | `3-node HA control plane` | `nixosConfigurations.node01`, `nixosConfigurations.node02`, `nixosConfigurations.node03`, companion install image `nixosConfigurations.netboot-control-plane` | `chainfire`, `flaredb`, `iam`, `nix-agent` on every control-plane node, plus `deployer` on the bootstrap node | `fleet-scheduler`, `node-agent`, `prismnet`, `flashdns`, `fiberlb`, `plasmavmc`, `lightningstor`, `coronafs`, `k8shost`, `apigateway`, `nightlight`, `creditservice` | | `bare-metal bootstrap` | `nix run ./nix/test-cluster#cluster -- baremetal-iso`, `nixosConfigurations.ultracloud-iso`, `nixosConfigurations.baremetal-qemu-control-plane`, `nixosConfigurations.baremetal-qemu-worker`, `checks.x86_64-linux.baremetal-iso-e2e` | `deployer`, `first-boot-automation`, `install-target`, `nix-agent` | `node-agent`, `fleet-scheduler`, and higher-level storage or edge services after bootstrap | `nixosConfigurations.netboot-all-in-one` and `nixosConfigurations.netboot-control-plane` are canonical companion images for the single-node and HA profiles. `nixosConfigurations.netboot-worker` is an archived worker helper outside the canonical profiles and their guard set, and `baremetal/vm-cluster` remains a `legacy/manual` debugging path rather than a publishable entrypoint. ## Cluster Authoring Source `ultracloud.cluster` backed by `nix/lib/cluster-schema.nix` is the only supported cluster authoring source. The supported rollout and scheduling tests consume cluster state generated from that module rather than treating `nix-nos` or ad hoc shell state as a primary source. `nix-nos` is limited to legacy compatibility and low-level network primitives such as interfaces, VLANs, BGP, and static routing. ## Quickstart Smoke ```bash nix flake show . --all-systems | rg -n "quickstart|single-node|trial|container|oci" nix build .#single-node-trial-vm nix eval --no-eval-cache .#nixosConfigurations.single-node-quickstart.config.system.build.toplevel.drvPath --raw nix run .#single-node-quickstart ``` `single-node-trial-vm` is the buildable trial artifact for the minimal VM-platform core, and `single-node-quickstart` is the automated smoke launcher for that same surface. The launcher boots the minimal VM stack under QEMU, waits for `chainfire`, `flaredb`, `iam`, `prismnet`, and `plasmavmc`, verifies their health from inside the guest, and checks the machine-readable product-surface manifest shipped in the VM. The launcher uses the generated NixOS VM runner, so it can fall back to TCG when `/dev/kvm` is absent. `single-node-trial` is a public alias for the same smoke launcher. OCI/Docker artifact is intentionally not the public trial surface because the supported scope needs a guest kernel plus host KVM, `/dev/net/tun`, and OVS/libvirt semantics; a privileged container would not represent the same contract. For debugging, keep the VM alive after the smoke passes: ```bash ULTRACLOUD_QUICKSTART_KEEP_VM=1 nix run .#single-node-quickstart ``` ## 3-Node HA Control Plane ```bash nix eval --no-eval-cache .#nixosConfigurations.node01.config.system.build.toplevel.drvPath --raw nix eval --no-eval-cache .#nixosConfigurations.node02.config.system.build.toplevel.drvPath --raw nix eval --no-eval-cache .#nixosConfigurations.node03.config.system.build.toplevel.drvPath --raw nix eval --no-eval-cache .#nixosConfigurations.netboot-control-plane.config.system.build.toplevel.drvPath --raw ``` These are the canonical HA control-plane entrypoints. The publishable six-node VM-cluster suite under `./nix/test-cluster` extends this baseline with worker and optional service nodes, but it does not redefine the supported profile names. ## Canonical Bare-Metal Proof ```bash nix eval --no-eval-cache .#nixosConfigurations.baremetal-qemu-control-plane.config.system.build.toplevel.drvPath --raw nix eval --no-eval-cache .#nixosConfigurations.baremetal-qemu-worker.config.system.build.toplevel.drvPath --raw nix run ./nix/test-cluster#cluster -- baremetal-iso nix build .#checks.x86_64-linux.baremetal-iso-e2e ./result/bin/baremetal-iso-e2e ./work/baremetal-iso-e2e/latest nix run ./nix/test-cluster#hardware-smoke -- preflight ``` `baremetal-iso` is the canonical install path for QEMU-as-bare-metal validation. It boots `nixosConfigurations.ultracloud-iso`, waits for `/api/v1/phone-home`, downloads the flake bundle from `deployer`, runs Disko, reboots, confirms the first post-install boot markers, and waits for `nix-agent` to report the desired system as `active` for both `baremetal-qemu-control-plane` and `baremetal-qemu-worker`. `baremetal-iso-e2e` now keeps the exact flake attr but changes the execution model: `nix build .#checks.x86_64-linux.baremetal-iso-e2e` materializes `./result/bin/baremetal-iso-e2e`, and that built runner executes the same `nix/test-cluster/verify-baremetal-iso.sh` harness with host KVM and logs under `./work` by default. This avoids the old daemon-sandbox path where a `nixbld` build fell back to `TCG` instead of the host's `/dev/kvm`. The local proof intentionally mirrors the real hardware route. Build `nixosConfigurations.ultracloud-iso`, then either boot that ISO in QEMU with KVM or put the same image on USB or BMC virtual media for the target machine. The live installer consumes the same bootstrap parameters in every environment: - `ultracloud.deployer_url=` for the reachable `deployer` endpoint - `ultracloud.bootstrap_token=` for authenticated phone-home, or a lab-only `deployer` with `allow_unauthenticated=true` - `ultracloud.ca_cert_url=` when `deployer` is TLS-enabled with a private CA - `ultracloud.binary_cache_url=` when you want the installer to fetch host-built closures instead of compiling locally - `ultracloud.node_id=` and `ultracloud.hostname=` only when you need to override the DMI-serial or hostname-derived identity The networking assumptions are also the same. The ISO needs DHCP or equivalent IP configuration that can reach `deployer` before Disko starts, and it must also reach the optional binary cache when that URL is set. The QEMU harness uses user-mode NAT and the built-in `10.0.2.2` fallback endpoints for the local host; physical installs should set the deployer and cache URLs explicitly to routable control-plane addresses. The proven marker sequence from `nix/test-cluster/verify-baremetal-iso.sh` is the same sequence you should expect on hardware: `pre-install.boot`, `pre-install.phone-home.complete`, `install.bundle-downloaded`, `install.disko.complete`, `install.nixos-install.complete`, `reboot`, `post-install.boot`, and finally `nix-agent` reporting the desired system as `active`. USB and BMC virtual media change only how the ISO is presented to the machine; they do not change the bootstrap contract. ## Hardware Bring-Up Pack ```bash nix run ./nix/test-cluster#hardware-smoke -- preflight nix run ./nix/test-cluster#hardware-smoke -- run nix run ./nix/test-cluster#hardware-smoke -- capture ``` `hardware-smoke` is the canonical USB/BMC/Redfish bridge for the physical-node proof. It always writes artifacts under `./work/hardware-smoke/` and refreshes `./work/hardware-smoke/latest`. - `preflight` emits `kernel-params.txt`, `expected-markers.txt`, `failure-markers.txt`, `operator-handoff.md`, and `status.env`. - With no USB device or BMC/Redfish credentials, `preflight` records `status=blocked` and the exact missing transport inputs in `missing-requirements.txt`. - With transport present, the same wrapper can write USB media or call Redfish virtual media and then capture the real `desired-system active` evidence through SSH or a supplied serial log. - The expected hardware markers are the same `ULTRACLOUD_MARKER pre-install.boot.*`, `pre-install.phone-home.complete.*`, `install.disko.complete.*`, `reboot.*`, `post-install.boot.*`, and `desired-system-active.*` lines used by `verify-baremetal-iso.sh`. Hardware runbook for the same canonical path: 1. Build `nixosConfigurations.ultracloud-iso` and the target install profiles you want the installer to materialize. 2. Publish cluster state where each reusable node class owns `install_plan.nixos_configuration`, `install_plan.disko_config_path`, and a stable disk selector. Prefer `install_plan.target_disk_by_id` on hardware; the QEMU proof now uses `/dev/disk/by-id/virtio-uc-control-root` and `/dev/disk/by-id/virtio-uc-worker-root` to exercise the same contract. When the live ISO can reach a binary cache, also publish `desired_system.target_system` with the prebuilt closure for that class so `nix-agent` converges to the exact shipped system instead of rebuilding a dirty local copy. 3. Make `deployer` and the optional binary cache reachable from the live ISO, then boot the ISO through USB or BMC virtual media with `ultracloud.deployer_url=...`, `ultracloud.bootstrap_token=...`, and optional `ultracloud.binary_cache_url=...`. 4. Confirm the live installer resolves the install profile, downloads the flake bundle, runs Disko against the selected disk, reboots, and lands on the post-install marker. 5. Confirm `nix-agent` on the installed node converges the desired system to `active`. QEMU-to-hardware mapping for the proof: | QEMU harness proof | Hardware proof | | --- | --- | | `nix run ./nix/test-cluster#cluster -- baremetal-iso` | boot the same `nixosConfigurations.ultracloud-iso` through USB or BMC virtual media | | user-mode NAT fallback to `10.0.2.2` | routable `ultracloud.deployer_url` and optional `ultracloud.binary_cache_url` | | virtio disk by-id selectors seeded by explicit QEMU serials | server, NVMe, or RAID-controller `/dev/disk/by-id/...` selectors in the node class | | host-local QEMU logs and SSH on `127.0.0.1:22231/22232` | serial-over-LAN, BMC console, or physical console plus SSH on the installed host | | same marker sequence and `nix-agent` active gate | same marker sequence and `nix-agent` active gate | Host prerequisites for the KVM-backed proof are a Linux host with readable and writable `/dev/kvm`, nested virtualization enabled, and enough free space under `./work` or `ULTRACLOUD_WORK_ROOT` for VM disks, logs, and temporary build state. The checked-in wrappers force local Nix builders and derive `max-jobs` and per-build cores from the host CPU count unless `ULTRACLOUD_LOCAL_NIX_MAX_JOBS`, `ULTRACLOUD_LOCAL_NIX_BUILD_CORES`, `PHOTON_CLUSTER_NIX_MAX_JOBS`, or `PHOTON_CLUSTER_NIX_BUILD_CORES` override them. ## Regression Guards ```bash nix build .#checks.x86_64-linux.canonical-profile-eval-guards nix build .#checks.x86_64-linux.canonical-profile-build-guards ``` These two checks are the fast fail-first drift gates for the supported surface: - `canonical-profile-eval-guards`: forces evaluation of every canonical profile entrypoint, so broken attrs fail before any long-running harness work starts. - `canonical-profile-build-guards`: realizes the single-node VM, the HA control-plane configs and companion image, and the ISO or bare-metal outputs so build-time drift is caught even when a cluster harness is not running. - `supported-surface-guard`: rejects unfinished public-surface wording across the published docs, add-on workspaces, and VM-cluster harness files, fails on shipped public server code that still contains `Status::unimplemented`, `unimplemented!()`, `todo!()`, or other intentional stub responses, blocks high-signal completeness markers such as `TODO:`, `FIXME`, or `best-effort` in the supported FiberLB, PrismNet, PlasmaVMC, and K8sHost server code paths, and also fails if archived helpers such as `netboot-worker`, `plasmavmc-firecracker`, `k8shost-cni`, `k8shost-csi`, or `k8shost-controllers` re-enter the default product surface. ## Portable Local Proof ```bash nix build .#checks.x86_64-linux.canonical-profile-eval-guards nix build .#checks.x86_64-linux.portable-control-plane-regressions ``` Use this lane on Linux hosts that do not expose `/dev/kvm`: - `portable-control-plane-regressions`: TCG-safe aggregate check that keeps the canonical profile eval guard, `deployer-bootstrap-e2e`, `host-lifecycle-e2e`, `deployer-vm-smoke`, and `fleet-scheduler-e2e` green together. - It also links in `supported-surface-guard`, so unsupported product-surface wording, code-level public API stubs, or high-signal completeness markers in the supported provider/backend servers fail in the same low-cost lane before a publishable rerun. - It intentionally does not boot the six-node nested-KVM VM suite, so it is a developer regression path, not the publishable multi-node proof. - CI runs `canonical-profile-eval-guards` and `portable-control-plane-regressions` on every relevant change from `.github/workflows/nix.yml`. ## Publishable Checks ```bash nix run .#single-node-quickstart nix run ./nix/test-cluster#cluster -- baremetal-iso nix run ./nix/test-cluster#cluster -- fresh-smoke nix run ./nix/test-cluster#cluster -- fresh-demo-vm-webapp nix run ./nix/test-cluster#cluster -- fresh-matrix nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof nix run ./nix/test-cluster#cluster -- provider-vm-reality-proof nix run ./nix/test-cluster#cluster -- rollout-soak ./nix/test-cluster/run-publishable-kvm-suite.sh ./work/publishable-kvm-suite ./nix/test-cluster/run-supported-surface-final-proof.sh ./work/final-proofs/latest nix build .#checks.x86_64-linux.baremetal-iso-e2e nix build .#checks.x86_64-linux.baremetal-iso-e2e && ./result/bin/baremetal-iso-e2e ./work/baremetal-iso-e2e/latest nix build .#checks.x86_64-linux.deployer-vm-smoke ``` Use these commands as the release-facing local proof set: - `single-node-quickstart`: productized one-command quickstart gate for the minimal VM platform profile - `single-node-trial-vm`: buildable VM appliance for the same minimal VM-platform profile - `baremetal-iso`: canonical bare-metal bootstrap gate covering pre-install boot, phone-home, flake bundle fetch, Disko install, reboot, post-install boot, and desired-system activation on one control-plane node plus one worker-equivalent node - `fresh-smoke`: base VM-cluster gate for the six-node harness that extends the canonical `3-node HA control plane`, including readiness, core behavior, and fault injection - `fresh-smoke` also proves the supported PlasmaVMC backend contract by requiring both worker registrations to advertise `HYPERVISOR_TYPE_KVM` and nothing broader on the public surface - `fresh-demo-vm-webapp`: optional VM-hosting bundle proof for `plasmavmc + prismnet` with state persisted through `lightningstor` - `fresh-matrix`: optional composition proof for provider bundles such as `prismnet + flashdns + fiberlb` and `plasmavmc + coronafs + lightningstor`, including PrismNet security-group ACL add/remove, FiberLB TCP plus TLS-terminated `Https` / `TerminatedHttps` listeners, LightningStor bucket metadata plus object-version APIs, the published `k8shost` pod-watch surface, and the KVM-only PlasmaVMC worker contract - `chainfire-live-membership-proof`: focused local-KVM ChainFire lane that starts from the canonical 3-node control plane, adds a temporary learner on `node04`, promotes it to voter, transfers leadership to another live voter, restarts the temporary voter, removes the current leader, re-adds the removed leader, and scales back into the canonical 3-node shape while proving local serializable reads through each transition - `provider-vm-reality-proof`: focused local-KVM provider and VM-hosting lane that writes dated artifacts under `./work/provider-vm-reality-proof/latest`, captures authoritative FlashDNS answers, FiberLB backend drain and re-convergence, and PlasmaVMC KVM shared-storage migration plus post-migration restart state - `rollout-soak`: focused longer-run control-plane and rollout lane that rebuilds from clean local runtime state, writes dated artifacts under `./work/rollout-soak/latest`, repeats `draining` maintenance and worker power-loss, then restarts `deployer`, `fleet-scheduler`, `node-agent`, `chainfire`, and `flaredb` while recording explicit `nix-agent` scope markers for the steady-state KVM nodes - `durability-proof`: canonical chainfire flaredb deployer backup/restore lane. It stores artifacts under `./work/durability-proof/latest`, proves logical backup/restore for ChainFire keys and FlareDB SQL rows, uses the canonical Deployer admin pre-register request itself as the backup artifact, verifies that the pre-registered node survives a `deployer.service` restart, replays the same request idempotently, and injects CoronaFS plus LightningStor failures on the live KVM cluster - `run-publishable-kvm-suite.sh`: reproducible wrapper that captures the KVM environment, requires real `/dev/kvm` access, keeps runtime state under `./work` by default, and runs the publishable nested-KVM application lanes plus the focused ChainFire live-membership proof in a single command - `run-supported-surface-final-proof.sh`: one-shot local wrapper that keeps builders local, records environment metadata, builds `single-node-trial-vm`, runs `supported-surface-guard`, `single-node-quickstart`, and then the publishable nested-KVM suite into one dated log root - `baremetal-iso-e2e`: materialized exact proof runner for the same canonical ISO harness; the build output keeps the attr stable, and `./result/bin/baremetal-iso-e2e` runs the real host-KVM proof with persisted log/meta - `deployer-vm-smoke`: lightweight regression proving that `nix-agent` can activate a host-built target closure without guest-side compilation - `deployer-vm-rollback`: smallest reproducible `nix-agent` rollback proof. It publishes a desired system with a failing `health_check_command`, expects observed status `rolled-back`, and confirms the node does not stay on the rejected target generation `single-node-trial-vm` and `single-node-quickstart` are the standalone VM-platform story. They keep the minimal KVM-backed surface separate from the rollout stack. The checked-in entrypoint for the publishable KVM proof is the local wrapper `./nix/test-cluster/run-publishable-kvm-suite.sh`. Runner-specific workflow wiring from `task/f5c70db0-baseline-profiles` is intentionally excluded from this baseline branch. The 2026-04-10 local AMD/KVM proof snapshot is recorded under `./work/final-proofs/32f64c10-1b74-4d8a-8d7d-b2cc6bf6b4f0-final` for `supported-surface-guard`, `single-node-trial-vm`, and `single-node-quickstart`, under `./work/publishable-kvm-suite` for the passing `fresh-smoke`, `fresh-demo-vm-webapp`, `fresh-matrix`, and wrapper environment capture, and under `./work/rollout-soak/20260410T164549+0900` for the longer-running rollout/control-plane soak. The 2026-04-10 exact bare-metal check-runner proof is recorded under `./work/baremetal-iso-e2e/0de75570-dabd-471b-95fe-5898c54e2e8c`; its outer `environment.txt` records `execution_model=materialized-check-runner`, while `state/environment.txt` records `vm_accelerator_mode=kvm`. ## Responsibility Coverage - `baremetal-iso` and `baremetal-iso-e2e` are the canonical proof for `deployer -> installer -> nix-agent`. They cover phone-home, install-plan materialization, Disko, reboot, and desired-system activation, and they now share the same `verify-baremetal-iso.sh` runtime harness. - `deployer-vm-smoke` is the smallest regression for the same `deployer -> nix-agent` boundary. It proves that a node can receive a prebuilt target closure and activate it without guest-side compilation. - `deployer-vm-rollback` is the canonical operator proof for `nix-agent` health-check, rollback, and partial failure recovery. Use it with [rollout-bundle.md](rollout-bundle.md) when documenting or changing the host-local rollback contract. - `portable-control-plane-regressions` keeps the main non-KVM-safe boundaries under continuous coverage by composing `deployer-bootstrap-e2e`, `host-lifecycle-e2e`, `deployer-vm-smoke`, and `fleet-scheduler-e2e` behind the canonical profile eval guard. - `fresh-smoke` and `fresh-matrix` are the canonical proof for `deployer -> fleet-scheduler -> node-agent`. They cover native service placement, heartbeats, failover, and runtime reconciliation. - `fresh-smoke` proves the supported `fleet-scheduler` maintenance semantics: short-lived `active -> draining -> active` transitions, fail-stop worker loss, and replica restoration after the node returns. - `chainfire-live-membership-proof` is the canonical KVM proof for ChainFire live reconfiguration on the supported surface. It covers learner add, local replica catch-up, voter promotion, live leader transfer, temporary-voter restart and rejoin, current-leader removal, removed-leader re-add, and final scale-in on the canonical control-plane shape. - `rollout-soak` is the longer-running companion lane for the same bundle. It validates exactly one planned drain cycle and one fail-stop worker-loss cycle on the two native-runtime workers, holds each degraded state for 30 seconds, restarts `deployer`, `fleet-scheduler`, `node-agent`, `chainfire`, and `flaredb`, and then revalidates the live cluster. It also writes `scope-fixed-contract.json`, `deployer-scope-fixed.txt`, and `fleet-scheduler-scope-fixed.txt` so the supported release boundary is captured in the proof root. The steady-state KVM nodes do not ship `nix-agent.service`, so the lane records scope markers there and leaves executable `nix-agent` proof to `deployer-vm-rollback`, `baremetal-iso`, and `baremetal-iso-e2e`. - Multi-hour maintenance windows, arbitrary multi-voter ChainFire swaps that still need joint consensus, larger-cluster or hardware ChainFire live membership reconfiguration beyond the canonical KVM proof lane, destructive FlareDB schema rewrites, fully automated online migration, and large-cluster drain storms remain outside the release-proven scope and are called out explicitly in [rollout-bundle.md](rollout-bundle.md) and [control-plane-ops.md](control-plane-ops.md). - `fresh-smoke` also covers `k8shost` separately from `fleet-scheduler`: `k8shost` exposes tenant pod and service semantics, while `fleet-scheduler` handles bare-metal host services. `k8shost` is fixed as an API/control-plane product surface; runtime dataplane helpers stay archived non-product. - `fresh-matrix` keeps the shipped add-on surface honest: it exercises the supported `creditservice` quota, wallet, reservation, and API-gateway flows, the published `k8shost-server` API contract, the supported LightningStor bucket metadata plus object-version APIs, and the network-provider bundle contract for PrismNet ACL lifecycle plus FiberLB TCP and TLS-terminated listeners. - `provider-vm-reality-proof` is the artifact-producing companion lane for that same provider or VM-hosting bundle. It records PrismNet port and ACL state, authoritative FlashDNS answers, FiberLB listener drain or restore artifacts, and PlasmaVMC migration or storage-handoff state in one dated proof root. - PrismNet real OVS/OVN dataplane validation remains outside the supported local KVM surface. The current provider proof keeps tenant API lifecycle and attached-VM networking honest, but not a release-grade `ovn-nbctl` or hardware-switch dataplane path. - FiberLB native BGP or BFD peer interop and hardware VIP ownership remain outside the supported local KVM surface. The current provider proof fixes the shipped contract to listener publication plus backend drain and re-convergence inside the lab. - PlasmaVMC real-hardware migration or storage handoff remains a later hardware proof. The current provider proof fixes the release surface to KVM shared-storage migration on the local worker pair. - Within that edge bundle, APIGateway is supported as stateless replicated instances behind an external L4 or VIP layer, but the release-facing proof remains the shipped single gateway-node layout on `node06`; live in-process reload is not promised, and config rollout stays restart-based. - NightLight is supported as a single-node WAL/snapshot service; replicated HA metrics storage and per-tenant retention enforcement are not part of the current product contract. - CreditService export and backend migration are supported as offline export/import or backend-native snapshot workflows, not live mixed-writer migration. - FiberLB HTTPS health checks currently do not verify backend TLS certificates. Supported scope is limited to TCP reachability plus HTTP status for the backend endpoint until CA-aware verification is wired through config, server code, and the canonical harness. - `durability-proof` is the canonical backup, restore, and failure-injection companion lane for the publishable KVM suite. Use it after `fresh-matrix` when you need persisted artifacts for `chainfire`, `flaredb`, `deployer`, `coronafs`, and `lightningstor`. - `rollout-soak` is the longer-running maintenance and DR companion lane for the same control-plane and rollout bundle. Use it when a change is supposed to survive the current release boundary of one planned drain cycle, one fail-stop worker-loss cycle, and service-restart churn on the live KVM lab instead of only the short `fresh-smoke` window. - `run-core-control-plane-ops-proof.sh` is the focused operator lifecycle proof for the core control plane. It records the published ChainFire API boundary, the FlareDB additive-first migration and destructive-DDL boundary, and the standalone IAM bootstrap hardening plus signing-key, credential, and mTLS rotation proof under `./work/core-control-plane-ops-proof`. - The supported `deployer` HA and DR boundary is scope-fixed to one active writer plus optional cold-standby restore, not automatic multi-instance failover. The canonical runbook is to recover one writer, re-apply `ultracloud.cluster` generated state with `deployer-ctl apply`, replay preserved admin pre-register requests, and then verify state through the admin API or `deployer-ctl node inspect`; the unsupported multi-instance boundary is fixed in [rollout-bundle.md](rollout-bundle.md). - The supported `node-agent` product contract is also fixed in [rollout-bundle.md](rollout-bundle.md): per-instance logs and pid metadata live under `${stateDir}/pids`, secrets must already exist in the rendered spec or mounted host files, host-path volumes are passed through but not provisioned, and upgrades are replace-and-reconcile operations rather than in-place patching. - The dated 2026-04-10 proof root for that lane is `./work/durability-proof/20260410T120618+0900`; `result.json` records `success=true`, and the artifact set includes `deployer-post-restart-list.json`, `coronafs-node04-local-state.json`, and `lightningstor-head-during-node05-outage.json`. - `single-node-quickstart` intentionally excludes `deployer`, `nix-agent`, `node-agent`, and `fleet-scheduler`, so the smallest trial surface stays focused on the VM-platform core instead of mixing rollout and scheduling responsibilities. The three `fresh-*` VM-cluster commands plus `chainfire-live-membership-proof` make up the publishable nested-KVM suite. They require a Linux host with `/dev/kvm` and nested virtualization, and the harness stops at preflight by design when that device is absent. `single-node-quickstart` and `baremetal-iso` can still fall back to `TCG` for debugging, but the release-facing `baremetal-iso-e2e` runner now requires host KVM so the exact proof lane matches the shipped hardware proxy route. `deployer-vm-smoke` and `portable-control-plane-regressions` remain the supported non-KVM developer lanes. Release-facing completion now requires both of these to be green on the same branch: - the canonical bare-metal proof: `nix run ./nix/test-cluster#cluster -- baremetal-iso` plus `nix build .#checks.x86_64-linux.baremetal-iso-e2e` and `./result/bin/baremetal-iso-e2e` - the publishable nested-KVM suite: `fresh-smoke`, `fresh-demo-vm-webapp`, `fresh-matrix`, and `chainfire-live-membership-proof`, preferably through `./nix/test-cluster/run-publishable-kvm-suite.sh` Focused operator lifecycle proof for the core control plane: ```bash ./nix/test-cluster/run-core-control-plane-ops-proof.sh ./work/core-control-plane-ops-proof/latest ``` This proof is lighter than the full KVM suite. It keeps `supported-surface-guard` honest for the control-plane contract, runs the standalone IAM signing-key rotation, credential rotation, and mTLS overlap rotation tests, and records the explicit ChainFire membership, FlareDB schema migration or destructive-DDL boundary, and IAM bootstrap hardening markers that the public docs now promise. The dated 2026-04-10 artifact root for that lane is `./work/core-control-plane-ops-proof/20260410T172148+09:00`; it includes `iam-key-rotation-tests.log`, `iam-credential-rotation-tests.log`, `iam-mtls-rotation-tests.log`, `scope-fixed-contract.json`, and `result.json`. ## Work Root Budget ```bash ./nix/test-cluster/work-root-budget.sh status ./nix/test-cluster/work-root-budget.sh enforce ./nix/test-cluster/work-root-budget.sh cleanup-advice ./nix/test-cluster/work-root-budget.sh prune-proof-logs 2 ``` Use `./nix/test-cluster/work-root-budget.sh status` for reporting, `./nix/test-cluster/work-root-budget.sh enforce` when a local proof run should fail on budget overrun, and `./nix/test-cluster/work-root-budget.sh prune-proof-logs 2` for a safer dated-proof cleanup dry-run. The helper keeps the local proof path practical by reporting the current size of `./work`, `./work/test-cluster/state`, disposable runtime directories such as `./work/tmp` and `./work/publishable-kvm-runtime`, and the dated proof roots including `./work/provider-vm-reality-proof` and `./work/hardware-smoke`. The `enforce` mode turns those soft budgets into a non-zero local gate, and `prune-proof-logs` gives a safer dated-proof cleanup workflow before the final `nix store gc`. ## Extended Measurements ```bash nix run ./nix/test-cluster#cluster -- fresh-bench-storage ``` `fresh-bench-storage` remains useful for storage regression tracking, but it is a benchmark path, not part of the minimal canonical publish gate. ## Operational Commands ```bash nix run ./nix/test-cluster#cluster -- status nix run ./nix/test-cluster#cluster -- logs node01 nix run ./nix/test-cluster#cluster -- ssh node04 nix run ./nix/test-cluster#cluster -- demo-vm-webapp nix run ./nix/test-cluster#cluster -- serve-vm-webapp nix run ./nix/test-cluster#cluster -- matrix nix run ./nix/test-cluster#cluster -- bench-storage nix run ./nix/test-cluster#cluster -- fresh-matrix nix run ./nix/test-cluster#cluster -- fresh-bench-storage nix run ./nix/test-cluster#cluster -- stop nix run ./nix/test-cluster#cluster -- clean ``` ## Validation Philosophy - package unit tests are useful but not sufficient - host-built VM clusters are the main integration signal - bootstrap and rollout paths must stay evaluable independently of the larger VM-hosting feature set - distributed storage and virtualization paths must be checked under failure, not only at steady state ## Legacy And Experimental Paths - `baremetal/vm-cluster` manual launch scripts are `legacy/manual`, not canonical validation - direct `nix develop ./nix/test-cluster -c ./nix/test-cluster/run-cluster.sh ...` usage is a debugging path, not the publishable entrypoint - standalone use of `netboot-control-plane` or `netboot-all-in-one` outside the documented profiles is a debugging path, not a fourth supported profile - `netboot-worker`, Firecracker, mvisor, `k8shost-cni`, `k8shost-controllers`, and `lightningstor-csi` are archived non-product helpers and should not be presented as canonical entrypoints - `netboot-base`, `pxe-server`, `vm-smoke-target`, and other helper images are internal or legacy building blocks, not supported profiles by themselves