photoncloud-monorepo/docs/testing.md

9.4 KiB

Testing

UltraCloud treats VM-first validation as the canonical local proof path and keeps the public support contract limited to three profiles.

Canonical Profiles

Profile Primary outputs Required components Optional components
single-node dev nix run .#single-node-quickstart, nixosConfigurations.single-node-quickstart, companion install image nixosConfigurations.netboot-all-in-one chainfire, flaredb, iam, plasmavmc, prismnet lightningstor, coronafs, flashdns, fiberlb, apigateway, nightlight, creditservice, k8shost, deployer
3-node HA control plane nixosConfigurations.node01, node02, node03, netboot-control-plane chainfire, flaredb, iam, nix-agent on every control-plane node, plus deployer on the bootstrap node fleet-scheduler, node-agent, prismnet, flashdns, fiberlb, plasmavmc, lightningstor, coronafs, k8shost, apigateway, nightlight, creditservice
bare-metal bootstrap nixosConfigurations.ultracloud-iso, nixosConfigurations.baremetal-qemu-control-plane, nixosConfigurations.baremetal-qemu-worker, checks.x86_64-linux.baremetal-iso-e2e deployer, first-boot-automation, install-target, nix-agent netboot-control-plane, netboot-worker, and netboot-all-in-one as experimental helper images, plus node-agent, fleet-scheduler, and higher-level storage or edge services after bootstrap

Quickstart Smoke

nix flake show . --all-systems | rg -n "single|all-in-one|quickstart"
nix eval --no-eval-cache .#nixosConfigurations.single-node-quickstart.config.system.build.toplevel.drvPath --raw
nix run .#single-node-quickstart

single-node-quickstart is the supported one-box entrypoint. It boots the minimal VM stack under QEMU, waits for chainfire, flaredb, iam, prismnet, and plasmavmc, and verifies their health from inside the guest. The launcher uses the generated NixOS VM runner, so it can fall back to TCG when /dev/kvm is absent.

For debugging, keep the VM alive after the smoke passes:

ULTRACLOUD_QUICKSTART_KEEP_VM=1 nix run .#single-node-quickstart

Canonical Bare-Metal Proof

nix eval --no-eval-cache .#nixosConfigurations.baremetal-qemu-control-plane.config.system.build.toplevel.drvPath --raw
nix eval --no-eval-cache .#nixosConfigurations.baremetal-qemu-worker.config.system.build.toplevel.drvPath --raw
nix run ./nix/test-cluster#cluster -- baremetal-iso
nix build .#checks.x86_64-linux.baremetal-iso-e2e

baremetal-iso is the canonical install path for QEMU-as-bare-metal validation. It boots nixosConfigurations.ultracloud-iso, waits for /api/v1/phone-home, downloads the flake bundle from deployer, runs Disko, reboots, confirms the first post-install boot markers, and waits for nix-agent to report the desired system as active for both baremetal-qemu-control-plane and baremetal-qemu-worker. baremetal-iso-e2e runs the same flow under flake check.

Regression Guards

nix build .#checks.x86_64-linux.canonical-profile-eval-guards
nix build .#checks.x86_64-linux.canonical-profile-build-guards

These two checks are the fast fail-first drift gates for the supported surface:

  • canonical-profile-eval-guards: forces evaluation of every canonical profile output, including netboot-worker and netboot-all-in-one, so broken attrs fail before any long-running harness work starts.
  • canonical-profile-build-guards: realizes the canonical VM, ISO, control-plane, and helper-image outputs so build-time drift is caught even when a cluster harness is not running.

Portable Local Proof

nix build .#checks.x86_64-linux.canonical-profile-eval-guards
nix build .#checks.x86_64-linux.portable-control-plane-regressions

Use this lane on Linux hosts that do not expose /dev/kvm:

  • portable-control-plane-regressions: TCG-safe aggregate check that keeps the canonical profile eval guard, deployer-bootstrap-e2e, host-lifecycle-e2e, deployer-vm-smoke, and fleet-scheduler-e2e green together.
  • It intentionally does not boot the six-node nested-KVM VM suite, so it is a developer regression path, not the publishable multi-node proof.
  • CI runs canonical-profile-eval-guards and portable-control-plane-regressions on every relevant change from .github/workflows/nix.yml.

Publishable Checks

nix run .#single-node-quickstart
nix run ./nix/test-cluster#cluster -- baremetal-iso
nix run ./nix/test-cluster#cluster -- fresh-smoke
nix run ./nix/test-cluster#cluster -- fresh-demo-vm-webapp
nix run ./nix/test-cluster#cluster -- fresh-matrix
./nix/test-cluster/run-publishable-kvm-suite.sh ./work/publishable-kvm-suite
nix build .#checks.x86_64-linux.baremetal-iso-e2e
nix build .#checks.x86_64-linux.deployer-vm-smoke

Use these commands as the release-facing local proof set:

  • single-node-quickstart: productized one-command quickstart gate for the minimal VM platform profile
  • baremetal-iso: canonical bare-metal bootstrap gate covering pre-install boot, phone-home, flake bundle fetch, Disko install, reboot, post-install boot, and desired-system activation on one control-plane node plus one worker-equivalent node
  • fresh-smoke: base VM-cluster gate for the canonical multi-node topology, including readiness, core behavior, and fault injection
  • fresh-demo-vm-webapp: optional VM-hosting bundle proof for plasmavmc + prismnet with state persisted through lightningstor
  • fresh-matrix: optional composition proof for provider bundles such as prismnet + flashdns + fiberlb and plasmavmc + coronafs + lightningstor
  • run-publishable-kvm-suite.sh: reproducible wrapper that captures the KVM environment and runs the full publishable nested-KVM trio in a single command
  • baremetal-iso-e2e: flake-check wrapper around the same canonical ISO harness
  • deployer-vm-smoke: lightweight regression proving that nix-agent can activate a host-built target closure without guest-side compilation

Responsibility Coverage

  • baremetal-iso and baremetal-iso-e2e are the canonical proof for deployer -> installer -> nix-agent. They cover phone-home, install-plan materialization, Disko, reboot, and desired-system activation.
  • deployer-vm-smoke is the smallest regression for the same deployer -> nix-agent boundary. It proves that a node can receive a prebuilt target closure and activate it without guest-side compilation.
  • portable-control-plane-regressions keeps the main non-KVM-safe boundaries under continuous coverage by composing deployer-bootstrap-e2e, host-lifecycle-e2e, deployer-vm-smoke, and fleet-scheduler-e2e behind the canonical profile eval guard.
  • fresh-smoke and fresh-matrix are the canonical proof for deployer -> fleet-scheduler -> node-agent. They cover native service placement, heartbeats, failover, and runtime reconciliation.
  • fresh-smoke also covers k8shost separately from fleet-scheduler: k8shost exposes tenant pod and service semantics, while fleet-scheduler handles bare-metal host services.

The three fresh-* VM-cluster commands are the publishable nested-KVM suite. They require a Linux host with /dev/kvm and nested virtualization, and the harness stops at preflight by design when that device is absent. single-node-quickstart, baremetal-iso, baremetal-iso-e2e, deployer-vm-smoke, and portable-control-plane-regressions can run on TCG-only hosts, but they are slower without host KVM.

Release-facing completion now requires both of these to be green on the same branch:

  • the canonical bare-metal proof: nix run ./nix/test-cluster#cluster -- baremetal-iso plus nix build .#checks.x86_64-linux.baremetal-iso-e2e
  • the publishable nested-KVM suite: fresh-smoke, fresh-demo-vm-webapp, and fresh-matrix, preferably through ./nix/test-cluster/run-publishable-kvm-suite.sh

Extended Measurements

nix run ./nix/test-cluster#cluster -- fresh-bench-storage

fresh-bench-storage remains useful for storage regression tracking, but it is a benchmark path, not part of the minimal canonical publish gate.

Operational Commands

nix run ./nix/test-cluster#cluster -- status
nix run ./nix/test-cluster#cluster -- logs node01
nix run ./nix/test-cluster#cluster -- ssh node04
nix run ./nix/test-cluster#cluster -- demo-vm-webapp
nix run ./nix/test-cluster#cluster -- serve-vm-webapp
nix run ./nix/test-cluster#cluster -- matrix
nix run ./nix/test-cluster#cluster -- bench-storage
nix run ./nix/test-cluster#cluster -- fresh-matrix
nix run ./nix/test-cluster#cluster -- fresh-bench-storage
nix run ./nix/test-cluster#cluster -- stop
nix run ./nix/test-cluster#cluster -- clean

Validation Philosophy

  • package unit tests are useful but not sufficient
  • host-built VM clusters are the main integration signal
  • bootstrap and rollout paths must stay evaluable independently of the larger VM-hosting feature set
  • distributed storage and virtualization paths must be checked under failure, not only at steady state

Legacy And Experimental Paths

  • baremetal/vm-cluster manual launch scripts are legacy/manual, not canonical validation
  • direct nix develop ./nix/test-cluster -c ./nix/test-cluster/run-cluster.sh ... usage is a debugging path, not the publishable entrypoint
  • netboot-control-plane, netboot-worker, netboot-all-in-one, netboot-base, pxe-server, and other helper images are internal or experimental building blocks, not supported profiles by themselves