photoncloud-monorepo/docs/testing.md
2026-05-05 22:49:03 +09:00

32 KiB

Testing

UltraCloud treats VM-first validation as the canonical local proof path and keeps the public support contract limited to three profiles.

Canonical Profiles

Profile Canonical entrypoints Required components Optional components
single-node dev nix run .#single-node-quickstart, nix run .#single-node-trial, nix build .#single-node-trial-vm, nixosConfigurations.single-node-quickstart, companion install image nixosConfigurations.netboot-all-in-one chainfire, flaredb, iam, plasmavmc, prismnet lightningstor, coronafs, flashdns, fiberlb, apigateway, nightlight, creditservice, k8shost
3-node HA control plane nixosConfigurations.node01, nixosConfigurations.node02, nixosConfigurations.node03, companion install image nixosConfigurations.netboot-control-plane chainfire, flaredb, iam, nix-agent on every control-plane node, plus deployer on the bootstrap node fleet-scheduler, node-agent, prismnet, flashdns, fiberlb, plasmavmc, lightningstor, coronafs, k8shost, apigateway, nightlight, creditservice
bare-metal bootstrap nix run ./nix/test-cluster#cluster -- baremetal-iso, nixosConfigurations.ultracloud-iso, nixosConfigurations.baremetal-qemu-control-plane, nixosConfigurations.baremetal-qemu-worker, checks.x86_64-linux.baremetal-iso-e2e deployer, first-boot-automation, install-target, nix-agent node-agent, fleet-scheduler, and higher-level storage or edge services after bootstrap

nixosConfigurations.netboot-all-in-one and nixosConfigurations.netboot-control-plane are canonical companion images for the single-node and HA profiles. nixosConfigurations.netboot-worker is an archived worker helper outside the canonical profiles and their guard set, and baremetal/vm-cluster remains a legacy/manual debugging path rather than a publishable entrypoint.

Cluster Authoring Source

ultracloud.cluster backed by nix/lib/cluster-schema.nix is the only supported cluster authoring source. The supported rollout and scheduling tests consume cluster state generated from that module rather than treating nix-nos or ad hoc shell state as a primary source.

nix-nos is limited to legacy compatibility and low-level network primitives such as interfaces, VLANs, BGP, and static routing.

Quickstart Smoke

nix flake show . --all-systems | rg -n "quickstart|single-node|trial|container|oci"
nix build .#single-node-trial-vm
nix eval --no-eval-cache .#nixosConfigurations.single-node-quickstart.config.system.build.toplevel.drvPath --raw
nix run .#single-node-quickstart

single-node-trial-vm is the buildable trial artifact for the minimal VM-platform core, and single-node-quickstart is the automated smoke launcher for that same surface. The launcher boots the minimal VM stack under QEMU, waits for chainfire, flaredb, iam, prismnet, and plasmavmc, verifies their health from inside the guest, and checks the machine-readable product-surface manifest shipped in the VM. The launcher uses the generated NixOS VM runner, so it can fall back to TCG when /dev/kvm is absent.

single-node-trial is a public alias for the same smoke launcher. OCI/Docker artifact is intentionally not the public trial surface because the supported scope needs a guest kernel plus host KVM, /dev/net/tun, and OVS/libvirt semantics; a privileged container would not represent the same contract.

For debugging, keep the VM alive after the smoke passes:

ULTRACLOUD_QUICKSTART_KEEP_VM=1 nix run .#single-node-quickstart

3-Node HA Control Plane

nix eval --no-eval-cache .#nixosConfigurations.node01.config.system.build.toplevel.drvPath --raw
nix eval --no-eval-cache .#nixosConfigurations.node02.config.system.build.toplevel.drvPath --raw
nix eval --no-eval-cache .#nixosConfigurations.node03.config.system.build.toplevel.drvPath --raw
nix eval --no-eval-cache .#nixosConfigurations.netboot-control-plane.config.system.build.toplevel.drvPath --raw

These are the canonical HA control-plane entrypoints. The publishable six-node VM-cluster suite under ./nix/test-cluster extends this baseline with worker and optional service nodes, but it does not redefine the supported profile names.

Canonical Bare-Metal Proof

nix eval --no-eval-cache .#nixosConfigurations.baremetal-qemu-control-plane.config.system.build.toplevel.drvPath --raw
nix eval --no-eval-cache .#nixosConfigurations.baremetal-qemu-worker.config.system.build.toplevel.drvPath --raw
nix run ./nix/test-cluster#cluster -- baremetal-iso
nix build .#checks.x86_64-linux.baremetal-iso-e2e
./result/bin/baremetal-iso-e2e ./work/baremetal-iso-e2e/latest
nix run ./nix/test-cluster#hardware-smoke -- preflight

baremetal-iso is the canonical install path for QEMU-as-bare-metal validation. It boots nixosConfigurations.ultracloud-iso, waits for /api/v1/phone-home, downloads the flake bundle from deployer, runs Disko, reboots, confirms the first post-install boot markers, and waits for nix-agent to report the desired system as active for both baremetal-qemu-control-plane and baremetal-qemu-worker.

baremetal-iso-e2e now keeps the exact flake attr but changes the execution model: nix build .#checks.x86_64-linux.baremetal-iso-e2e materializes ./result/bin/baremetal-iso-e2e, and that built runner executes the same nix/test-cluster/verify-baremetal-iso.sh harness with host KVM and logs under ./work by default. This avoids the old daemon-sandbox path where a nixbld build fell back to TCG instead of the host's /dev/kvm.

The local proof intentionally mirrors the real hardware route. Build nixosConfigurations.ultracloud-iso, then either boot that ISO in QEMU with KVM or put the same image on USB or BMC virtual media for the target machine. The live installer consumes the same bootstrap parameters in every environment:

  • ultracloud.deployer_url=<scheme://host:port> for the reachable deployer endpoint
  • ultracloud.bootstrap_token=<token> for authenticated phone-home, or a lab-only deployer with allow_unauthenticated=true
  • ultracloud.ca_cert_url=<https://.../ca.crt> when deployer is TLS-enabled with a private CA
  • ultracloud.binary_cache_url=<http://cache:8090> when you want the installer to fetch host-built closures instead of compiling locally
  • ultracloud.node_id= and ultracloud.hostname= only when you need to override the DMI-serial or hostname-derived identity

The networking assumptions are also the same. The ISO needs DHCP or equivalent IP configuration that can reach deployer before Disko starts, and it must also reach the optional binary cache when that URL is set. The QEMU harness uses user-mode NAT and the built-in 10.0.2.2 fallback endpoints for the local host; physical installs should set the deployer and cache URLs explicitly to routable control-plane addresses.

The proven marker sequence from nix/test-cluster/verify-baremetal-iso.sh is the same sequence you should expect on hardware: pre-install.boot, pre-install.phone-home.complete, install.bundle-downloaded, install.disko.complete, install.nixos-install.complete, reboot, post-install.boot, and finally nix-agent reporting the desired system as active. USB and BMC virtual media change only how the ISO is presented to the machine; they do not change the bootstrap contract.

Hardware Bring-Up Pack

nix run ./nix/test-cluster#hardware-smoke -- preflight
nix run ./nix/test-cluster#hardware-smoke -- run
nix run ./nix/test-cluster#hardware-smoke -- capture

hardware-smoke is the canonical USB/BMC/Redfish bridge for the physical-node proof. It always writes artifacts under ./work/hardware-smoke/<run-id> and refreshes ./work/hardware-smoke/latest.

  • preflight emits kernel-params.txt, expected-markers.txt, failure-markers.txt, operator-handoff.md, and status.env.
  • With no USB device or BMC/Redfish credentials, preflight records status=blocked and the exact missing transport inputs in missing-requirements.txt.
  • With transport present, the same wrapper can write USB media or call Redfish virtual media and then capture the real desired-system active evidence through SSH or a supplied serial log.
  • The expected hardware markers are the same ULTRACLOUD_MARKER pre-install.boot.*, pre-install.phone-home.complete.*, install.disko.complete.*, reboot.*, post-install.boot.*, and desired-system-active.* lines used by verify-baremetal-iso.sh.

Hardware runbook for the same canonical path:

  1. Build nixosConfigurations.ultracloud-iso and the target install profiles you want the installer to materialize.
  2. Publish cluster state where each reusable node class owns install_plan.nixos_configuration, install_plan.disko_config_path, and a stable disk selector. Prefer install_plan.target_disk_by_id on hardware; the QEMU proof now uses /dev/disk/by-id/virtio-uc-control-root and /dev/disk/by-id/virtio-uc-worker-root to exercise the same contract. When the live ISO can reach a binary cache, also publish desired_system.target_system with the prebuilt closure for that class so nix-agent converges to the exact shipped system instead of rebuilding a dirty local copy.
  3. Make deployer and the optional binary cache reachable from the live ISO, then boot the ISO through USB or BMC virtual media with ultracloud.deployer_url=..., ultracloud.bootstrap_token=..., and optional ultracloud.binary_cache_url=....
  4. Confirm the live installer resolves the install profile, downloads the flake bundle, runs Disko against the selected disk, reboots, and lands on the post-install marker.
  5. Confirm nix-agent on the installed node converges the desired system to active.

QEMU-to-hardware mapping for the proof:

QEMU harness proof Hardware proof
nix run ./nix/test-cluster#cluster -- baremetal-iso boot the same nixosConfigurations.ultracloud-iso through USB or BMC virtual media
user-mode NAT fallback to 10.0.2.2 routable ultracloud.deployer_url and optional ultracloud.binary_cache_url
virtio disk by-id selectors seeded by explicit QEMU serials server, NVMe, or RAID-controller /dev/disk/by-id/... selectors in the node class
host-local QEMU logs and SSH on 127.0.0.1:22231/22232 serial-over-LAN, BMC console, or physical console plus SSH on the installed host
same marker sequence and nix-agent active gate same marker sequence and nix-agent active gate

Host prerequisites for the KVM-backed proof are a Linux host with readable and writable /dev/kvm, nested virtualization enabled, and enough free space under ./work or ULTRACLOUD_WORK_ROOT for VM disks, logs, and temporary build state. The checked-in wrappers force local Nix builders and derive max-jobs and per-build cores from the host CPU count unless ULTRACLOUD_LOCAL_NIX_MAX_JOBS, ULTRACLOUD_LOCAL_NIX_BUILD_CORES, PHOTON_CLUSTER_NIX_MAX_JOBS, or PHOTON_CLUSTER_NIX_BUILD_CORES override them.

Regression Guards

nix build .#checks.x86_64-linux.canonical-profile-eval-guards
nix build .#checks.x86_64-linux.canonical-profile-build-guards

These two checks are the fast fail-first drift gates for the supported surface:

  • canonical-profile-eval-guards: forces evaluation of every canonical profile entrypoint, so broken attrs fail before any long-running harness work starts.
  • canonical-profile-build-guards: realizes the single-node VM, the HA control-plane configs and companion image, and the ISO or bare-metal outputs so build-time drift is caught even when a cluster harness is not running.
  • supported-surface-guard: rejects unfinished public-surface wording across the published docs, add-on workspaces, and VM-cluster harness files, fails on shipped public server code that still contains Status::unimplemented, unimplemented!(), todo!(), or other intentional stub responses, blocks high-signal completeness markers such as TODO:, FIXME, or best-effort in the supported FiberLB, PrismNet, PlasmaVMC, and K8sHost server code paths, and also fails if archived helpers such as netboot-worker, plasmavmc-firecracker, k8shost-cni, k8shost-csi, or k8shost-controllers re-enter the default product surface.

Portable Local Proof

nix build .#checks.x86_64-linux.canonical-profile-eval-guards
nix build .#checks.x86_64-linux.portable-control-plane-regressions

Use this lane on Linux hosts that do not expose /dev/kvm:

  • portable-control-plane-regressions: TCG-safe aggregate check that keeps the canonical profile eval guard, deployer-bootstrap-e2e, host-lifecycle-e2e, deployer-vm-smoke, and fleet-scheduler-e2e green together.
  • It also links in supported-surface-guard, so unsupported product-surface wording, code-level public API stubs, or high-signal completeness markers in the supported provider/backend servers fail in the same low-cost lane before a publishable rerun.
  • It intentionally does not boot the six-node nested-KVM VM suite, so it is a developer regression path, not the publishable multi-node proof.
  • CI runs canonical-profile-eval-guards and portable-control-plane-regressions on every relevant change from .github/workflows/nix.yml.

Publishable Checks

nix run .#single-node-quickstart
nix run ./nix/test-cluster#cluster -- baremetal-iso
nix run ./nix/test-cluster#cluster -- fresh-smoke
nix run ./nix/test-cluster#cluster -- fresh-demo-vm-webapp
nix run ./nix/test-cluster#cluster -- fresh-matrix
nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof
nix run ./nix/test-cluster#cluster -- provider-vm-reality-proof
nix run ./nix/test-cluster#cluster -- rollout-soak
./nix/test-cluster/run-publishable-kvm-suite.sh ./work/publishable-kvm-suite
./nix/test-cluster/run-supported-surface-final-proof.sh ./work/final-proofs/latest
nix build .#checks.x86_64-linux.baremetal-iso-e2e
nix build .#checks.x86_64-linux.baremetal-iso-e2e && ./result/bin/baremetal-iso-e2e ./work/baremetal-iso-e2e/latest
nix build .#checks.x86_64-linux.deployer-vm-smoke

Use these commands as the release-facing local proof set:

  • single-node-quickstart: productized one-command quickstart gate for the minimal VM platform profile
  • single-node-trial-vm: buildable VM appliance for the same minimal VM-platform profile
  • baremetal-iso: canonical bare-metal bootstrap gate covering pre-install boot, phone-home, flake bundle fetch, Disko install, reboot, post-install boot, and desired-system activation on one control-plane node plus one worker-equivalent node
  • fresh-smoke: base VM-cluster gate for the six-node harness that extends the canonical 3-node HA control plane, including readiness, core behavior, and fault injection
  • fresh-smoke also proves the supported PlasmaVMC backend contract by requiring both worker registrations to advertise HYPERVISOR_TYPE_KVM and nothing broader on the public surface
  • fresh-demo-vm-webapp: optional VM-hosting bundle proof for plasmavmc + prismnet with state persisted through lightningstor
  • fresh-matrix: optional composition proof for provider bundles such as prismnet + flashdns + fiberlb and plasmavmc + coronafs + lightningstor, including PrismNet security-group ACL add/remove, FiberLB TCP plus TLS-terminated Https / TerminatedHttps listeners, LightningStor bucket metadata plus object-version APIs, the published k8shost pod-watch surface, and the KVM-only PlasmaVMC worker contract
  • chainfire-live-membership-proof: focused local-KVM ChainFire lane that starts from the canonical 3-node control plane, adds a temporary learner on node04, promotes it to voter, transfers leadership to another live voter, restarts the temporary voter, removes the current leader, re-adds the removed leader, and scales back into the canonical 3-node shape while proving local serializable reads through each transition
  • provider-vm-reality-proof: focused local-KVM provider and VM-hosting lane that writes dated artifacts under ./work/provider-vm-reality-proof/latest, captures authoritative FlashDNS answers, FiberLB backend drain and re-convergence, and PlasmaVMC KVM shared-storage migration plus post-migration restart state
  • rollout-soak: focused longer-run control-plane and rollout lane that rebuilds from clean local runtime state, writes dated artifacts under ./work/rollout-soak/latest, repeats draining maintenance and worker power-loss, then restarts deployer, fleet-scheduler, node-agent, chainfire, and flaredb while recording explicit nix-agent scope markers for the steady-state KVM nodes
  • durability-proof: canonical chainfire flaredb deployer backup/restore lane. It stores artifacts under ./work/durability-proof/latest, proves logical backup/restore for ChainFire keys and FlareDB SQL rows, uses the canonical Deployer admin pre-register request itself as the backup artifact, verifies that the pre-registered node survives a deployer.service restart, replays the same request idempotently, and injects CoronaFS plus LightningStor failures on the live KVM cluster
  • run-publishable-kvm-suite.sh: reproducible wrapper that captures the KVM environment, requires real /dev/kvm access, keeps runtime state under ./work by default, and runs the publishable nested-KVM application lanes plus the focused ChainFire live-membership proof in a single command
  • run-supported-surface-final-proof.sh: one-shot local wrapper that keeps builders local, records environment metadata, builds single-node-trial-vm, runs supported-surface-guard, single-node-quickstart, and then the publishable nested-KVM suite into one dated log root
  • baremetal-iso-e2e: materialized exact proof runner for the same canonical ISO harness; the build output keeps the attr stable, and ./result/bin/baremetal-iso-e2e runs the real host-KVM proof with persisted log/meta
  • deployer-vm-smoke: lightweight regression proving that nix-agent can activate a host-built target closure without guest-side compilation
  • deployer-vm-rollback: smallest reproducible nix-agent rollback proof. It publishes a desired system with a failing health_check_command, expects observed status rolled-back, and confirms the node does not stay on the rejected target generation

single-node-trial-vm and single-node-quickstart are the standalone VM-platform story. They keep the minimal KVM-backed surface separate from the rollout stack.

The checked-in entrypoint for the publishable KVM proof is the local wrapper ./nix/test-cluster/run-publishable-kvm-suite.sh. Runner-specific workflow wiring from task/f5c70db0-baseline-profiles is intentionally excluded from this baseline branch. The 2026-04-10 local AMD/KVM proof snapshot is recorded under ./work/final-proofs/32f64c10-1b74-4d8a-8d7d-b2cc6bf6b4f0-final for supported-surface-guard, single-node-trial-vm, and single-node-quickstart, under ./work/publishable-kvm-suite for the passing fresh-smoke, fresh-demo-vm-webapp, fresh-matrix, and wrapper environment capture, and under ./work/rollout-soak/20260410T164549+0900 for the longer-running rollout/control-plane soak. The 2026-04-10 exact bare-metal check-runner proof is recorded under ./work/baremetal-iso-e2e/0de75570-dabd-471b-95fe-5898c54e2e8c; its outer environment.txt records execution_model=materialized-check-runner, while state/environment.txt records vm_accelerator_mode=kvm.

Responsibility Coverage

  • baremetal-iso and baremetal-iso-e2e are the canonical proof for deployer -> installer -> nix-agent. They cover phone-home, install-plan materialization, Disko, reboot, and desired-system activation, and they now share the same verify-baremetal-iso.sh runtime harness.
  • deployer-vm-smoke is the smallest regression for the same deployer -> nix-agent boundary. It proves that a node can receive a prebuilt target closure and activate it without guest-side compilation.
  • deployer-vm-rollback is the canonical operator proof for nix-agent health-check, rollback, and partial failure recovery. Use it with rollout-bundle.md when documenting or changing the host-local rollback contract.
  • portable-control-plane-regressions keeps the main non-KVM-safe boundaries under continuous coverage by composing deployer-bootstrap-e2e, host-lifecycle-e2e, deployer-vm-smoke, and fleet-scheduler-e2e behind the canonical profile eval guard.
  • fresh-smoke and fresh-matrix are the canonical proof for deployer -> fleet-scheduler -> node-agent. They cover native service placement, heartbeats, failover, and runtime reconciliation.
  • fresh-smoke proves the supported fleet-scheduler maintenance semantics: short-lived active -> draining -> active transitions, fail-stop worker loss, and replica restoration after the node returns.
  • chainfire-live-membership-proof is the canonical KVM proof for ChainFire live reconfiguration on the supported surface. It covers learner add, local replica catch-up, voter promotion, live leader transfer, temporary-voter restart and rejoin, current-leader removal, removed-leader re-add, and final scale-in on the canonical control-plane shape.
  • rollout-soak is the longer-running companion lane for the same bundle. It validates exactly one planned drain cycle and one fail-stop worker-loss cycle on the two native-runtime workers, holds each degraded state for 30 seconds, restarts deployer, fleet-scheduler, node-agent, chainfire, and flaredb, and then revalidates the live cluster. It also writes scope-fixed-contract.json, deployer-scope-fixed.txt, and fleet-scheduler-scope-fixed.txt so the supported release boundary is captured in the proof root. The steady-state KVM nodes do not ship nix-agent.service, so the lane records scope markers there and leaves executable nix-agent proof to deployer-vm-rollback, baremetal-iso, and baremetal-iso-e2e.
  • Multi-hour maintenance windows, arbitrary multi-voter ChainFire swaps that still need joint consensus, larger-cluster or hardware ChainFire live membership reconfiguration beyond the canonical KVM proof lane, destructive FlareDB schema rewrites, fully automated online migration, and large-cluster drain storms remain outside the release-proven scope and are called out explicitly in rollout-bundle.md and control-plane-ops.md.
  • fresh-smoke also covers k8shost separately from fleet-scheduler: k8shost exposes tenant pod and service semantics, while fleet-scheduler handles bare-metal host services. k8shost is fixed as an API/control-plane product surface; runtime dataplane helpers stay archived non-product.
  • fresh-matrix keeps the shipped add-on surface honest: it exercises the supported creditservice quota, wallet, reservation, and API-gateway flows, the published k8shost-server API contract, the supported LightningStor bucket metadata plus object-version APIs, and the network-provider bundle contract for PrismNet ACL lifecycle plus FiberLB TCP and TLS-terminated listeners.
  • provider-vm-reality-proof is the artifact-producing companion lane for that same provider or VM-hosting bundle. It records PrismNet port and ACL state, authoritative FlashDNS answers, FiberLB listener drain or restore artifacts, and PlasmaVMC migration or storage-handoff state in one dated proof root.
  • PrismNet real OVS/OVN dataplane validation remains outside the supported local KVM surface. The current provider proof keeps tenant API lifecycle and attached-VM networking honest, but not a release-grade ovn-nbctl or hardware-switch dataplane path.
  • FiberLB native BGP or BFD peer interop and hardware VIP ownership remain outside the supported local KVM surface. The current provider proof fixes the shipped contract to listener publication plus backend drain and re-convergence inside the lab.
  • PlasmaVMC real-hardware migration or storage handoff remains a later hardware proof. The current provider proof fixes the release surface to KVM shared-storage migration on the local worker pair.
  • Within that edge bundle, APIGateway is supported as stateless replicated instances behind an external L4 or VIP layer, but the release-facing proof remains the shipped single gateway-node layout on node06; live in-process reload is not promised, and config rollout stays restart-based.
  • NightLight is supported as a single-node WAL/snapshot service; replicated HA metrics storage and per-tenant retention enforcement are not part of the current product contract.
  • CreditService export and backend migration are supported as offline export/import or backend-native snapshot workflows, not live mixed-writer migration.
  • FiberLB HTTPS health checks currently do not verify backend TLS certificates. Supported scope is limited to TCP reachability plus HTTP status for the backend endpoint until CA-aware verification is wired through config, server code, and the canonical harness.
  • durability-proof is the canonical backup, restore, and failure-injection companion lane for the publishable KVM suite. Use it after fresh-matrix when you need persisted artifacts for chainfire, flaredb, deployer, coronafs, and lightningstor.
  • rollout-soak is the longer-running maintenance and DR companion lane for the same control-plane and rollout bundle. Use it when a change is supposed to survive the current release boundary of one planned drain cycle, one fail-stop worker-loss cycle, and service-restart churn on the live KVM lab instead of only the short fresh-smoke window.
  • run-core-control-plane-ops-proof.sh is the focused operator lifecycle proof for the core control plane. It records the published ChainFire API boundary, the FlareDB additive-first migration and destructive-DDL boundary, and the standalone IAM bootstrap hardening plus signing-key, credential, and mTLS rotation proof under ./work/core-control-plane-ops-proof.
  • The supported deployer HA and DR boundary is scope-fixed to one active writer plus optional cold-standby restore, not automatic multi-instance failover. The canonical runbook is to recover one writer, re-apply ultracloud.cluster generated state with deployer-ctl apply, replay preserved admin pre-register requests, and then verify state through the admin API or deployer-ctl node inspect; the unsupported multi-instance boundary is fixed in rollout-bundle.md.
  • The supported node-agent product contract is also fixed in rollout-bundle.md: per-instance logs and pid metadata live under ${stateDir}/pids, secrets must already exist in the rendered spec or mounted host files, host-path volumes are passed through but not provisioned, and upgrades are replace-and-reconcile operations rather than in-place patching.
  • The dated 2026-04-10 proof root for that lane is ./work/durability-proof/20260410T120618+0900; result.json records success=true, and the artifact set includes deployer-post-restart-list.json, coronafs-node04-local-state.json, and lightningstor-head-during-node05-outage.json.
  • single-node-quickstart intentionally excludes deployer, nix-agent, node-agent, and fleet-scheduler, so the smallest trial surface stays focused on the VM-platform core instead of mixing rollout and scheduling responsibilities.

The three fresh-* VM-cluster commands plus chainfire-live-membership-proof make up the publishable nested-KVM suite. They require a Linux host with /dev/kvm and nested virtualization, and the harness stops at preflight by design when that device is absent. single-node-quickstart and baremetal-iso can still fall back to TCG for debugging, but the release-facing baremetal-iso-e2e runner now requires host KVM so the exact proof lane matches the shipped hardware proxy route. deployer-vm-smoke and portable-control-plane-regressions remain the supported non-KVM developer lanes.

Release-facing completion now requires both of these to be green on the same branch:

  • the canonical bare-metal proof: nix run ./nix/test-cluster#cluster -- baremetal-iso plus nix build .#checks.x86_64-linux.baremetal-iso-e2e and ./result/bin/baremetal-iso-e2e
  • the publishable nested-KVM suite: fresh-smoke, fresh-demo-vm-webapp, fresh-matrix, and chainfire-live-membership-proof, preferably through ./nix/test-cluster/run-publishable-kvm-suite.sh

Focused operator lifecycle proof for the core control plane:

./nix/test-cluster/run-core-control-plane-ops-proof.sh ./work/core-control-plane-ops-proof/latest

This proof is lighter than the full KVM suite. It keeps supported-surface-guard honest for the control-plane contract, runs the standalone IAM signing-key rotation, credential rotation, and mTLS overlap rotation tests, and records the explicit ChainFire membership, FlareDB schema migration or destructive-DDL boundary, and IAM bootstrap hardening markers that the public docs now promise. The dated 2026-04-10 artifact root for that lane is ./work/core-control-plane-ops-proof/20260410T172148+09:00; it includes iam-key-rotation-tests.log, iam-credential-rotation-tests.log, iam-mtls-rotation-tests.log, scope-fixed-contract.json, and result.json.

Work Root Budget

./nix/test-cluster/work-root-budget.sh status
./nix/test-cluster/work-root-budget.sh enforce
./nix/test-cluster/work-root-budget.sh cleanup-advice
./nix/test-cluster/work-root-budget.sh prune-proof-logs 2

Use ./nix/test-cluster/work-root-budget.sh status for reporting, ./nix/test-cluster/work-root-budget.sh enforce when a local proof run should fail on budget overrun, and ./nix/test-cluster/work-root-budget.sh prune-proof-logs 2 for a safer dated-proof cleanup dry-run.

The helper keeps the local proof path practical by reporting the current size of ./work, ./work/test-cluster/state, disposable runtime directories such as ./work/tmp and ./work/publishable-kvm-runtime, and the dated proof roots including ./work/provider-vm-reality-proof and ./work/hardware-smoke. The enforce mode turns those soft budgets into a non-zero local gate, and prune-proof-logs gives a safer dated-proof cleanup workflow before the final nix store gc.

Extended Measurements

nix run ./nix/test-cluster#cluster -- fresh-bench-storage

fresh-bench-storage remains useful for storage regression tracking, but it is a benchmark path, not part of the minimal canonical publish gate.

Operational Commands

nix run ./nix/test-cluster#cluster -- status
nix run ./nix/test-cluster#cluster -- logs node01
nix run ./nix/test-cluster#cluster -- ssh node04
nix run ./nix/test-cluster#cluster -- demo-vm-webapp
nix run ./nix/test-cluster#cluster -- serve-vm-webapp
nix run ./nix/test-cluster#cluster -- matrix
nix run ./nix/test-cluster#cluster -- bench-storage
nix run ./nix/test-cluster#cluster -- fresh-matrix
nix run ./nix/test-cluster#cluster -- fresh-bench-storage
nix run ./nix/test-cluster#cluster -- stop
nix run ./nix/test-cluster#cluster -- clean

Validation Philosophy

  • package unit tests are useful but not sufficient
  • host-built VM clusters are the main integration signal
  • bootstrap and rollout paths must stay evaluable independently of the larger VM-hosting feature set
  • distributed storage and virtualization paths must be checked under failure, not only at steady state

Legacy And Experimental Paths

  • baremetal/vm-cluster manual launch scripts are legacy/manual, not canonical validation
  • direct nix develop ./nix/test-cluster -c ./nix/test-cluster/run-cluster.sh ... usage is a debugging path, not the publishable entrypoint
  • standalone use of netboot-control-plane or netboot-all-in-one outside the documented profiles is a debugging path, not a fourth supported profile
  • netboot-worker, Firecracker, mvisor, k8shost-cni, k8shost-controllers, and lightningstor-csi are archived non-product helpers and should not be presented as canonical entrypoints
  • netboot-base, pxe-server, vm-smoke-target, and other helper images are internal or legacy building blocks, not supported profiles by themselves