| .. | ||
| common.nix | ||
| flake.lock | ||
| flake.nix | ||
| hardware-smoke.sh | ||
| nightlight_remote_write.py | ||
| node01.nix | ||
| node02.nix | ||
| node03.nix | ||
| node04.nix | ||
| node05.nix | ||
| node06.nix | ||
| README.md | ||
| run-baremetal-iso-e2e.sh | ||
| run-cluster.sh | ||
| run-core-control-plane-ops-proof.sh | ||
| run-local-baseline.sh | ||
| run-publishable-kvm-suite.sh | ||
| run-supported-surface-final-proof.sh | ||
| storage-node01.nix | ||
| storage-node02.nix | ||
| storage-node03.nix | ||
| storage-node04.nix | ||
| storage-node05.nix | ||
| verify-baremetal-iso.sh | ||
| vm-bench-guest-image.nix | ||
| vm-guest-image.nix | ||
| work-root-budget.sh | ||
UltraCloud VM Test Cluster
nix/test-cluster is the canonical local validation path for UltraCloud.
It boots six QEMU VMs, treats them as hardware-like nodes, and validates representative control-plane, worker, and gateway behavior over SSH and service endpoints.
All VM images are built on the host in a single Nix invocation and then booted as prebuilt artifacts. The guests do not compile the stack locally.
The same harness also owns the canonical bare-metal bootstrap proof: a raw-QEMU ISO flow that phones home to deployer, runs Disko, reboots, and waits for nix-agent desired-system convergence on one control-plane node and one worker-equivalent node.
That local QEMU proof is intentionally the same operator route planned for hardware. The same nixosConfigurations.ultracloud-iso image can be written to USB or attached through BMC virtual media on a physical host; QEMU with KVM is only standing in for the chassis while the install flow, phone-home, Disko, reboot, and desired-system handoff stay the same.
The hardware bridge now has its own canonical wrapper: nix run ./nix/test-cluster#hardware-smoke -- preflight. It writes the exact kernel parameters, expected ULTRACLOUD_MARKER lines, failure markers, and operator handoff under ./work/hardware-smoke/latest, and the same wrapper can later be rerun as run or capture when USB or BMC/Redfish transport is actually available.
The harness keeps the install contract reusable by pushing install details into classes and pools. verify-baremetal-iso.sh now publishes node classes whose install_plan owns the install profile and stable disk selector, while node records carry only identity plus any desired-system override that is genuinely host-specific. In the canonical QEMU proof that means the node record carries the prebuilt desired_system.target_system plus the health check, and the class carries the install plan. The chassis emulates the preferred hardware-style disk selection by attaching explicit virtio serials and installing against /dev/disk/by-id/virtio-uc-control-root and /dev/disk/by-id/virtio-uc-worker-root.
When /dev/kvm is absent, the portable fallback is not another harness subcommand. Use the root-flake non-KVM lane instead: nix build .#checks.x86_64-linux.portable-control-plane-regressions.
When /dev/kvm and nested virtualization are available, the reproducible publishable lane is ./nix/test-cluster/run-publishable-kvm-suite.sh, which records environment metadata and then runs fresh-smoke, fresh-demo-vm-webapp, and fresh-matrix in order.
nix run ./nix/test-cluster#cluster -- durability-proof is the canonical chainfire flaredb deployer backup/restore lane. It persists artifacts under ./work/durability-proof/latest, proves logical backup/restore for ChainFire keys and FlareDB SQL rows, uses the canonical Deployer admin pre-register request itself as the backup artifact, verifies that the pre-registered node survives a deployer.service restart, replays the same request idempotently, and injects CoronaFS plus LightningStor failures against the live KVM cluster.
nix run ./nix/test-cluster#cluster -- rollout-soak is the longer-running KVM companion lane for the rollout bundle and fixed-membership control plane. It rebuilds from clean local runtime state, writes dated artifacts under ./work/rollout-soak/latest, validates exactly one planned draining maintenance cycle and one fail-stop worker-loss cycle on the two native-runtime workers, holds each degraded state for 30 seconds, then restarts deployer, fleet-scheduler, node-agent, chainfire, and flaredb before revalidating the live cluster. The same proof root includes scope-fixed-contract.json, deployer-scope-fixed.txt, and fleet-scheduler-scope-fixed.txt so the supported release boundary is recorded with the runtime evidence. The steady-state KVM nodes do not run nix-agent.service, so the lane records nix-agent scope markers instead of pretending a live-cluster nix-agent restart happened.
nix run ./nix/test-cluster#cluster -- provider-vm-reality-proof is the focused local-KVM reality lane for prismnet, flashdns, fiberlb, and plasmavmc. It writes authoritative DNS answers, FiberLB backend drain or restore artifacts, and PlasmaVMC migration or storage-handoff state under ./work/provider-vm-reality-proof/latest.
./nix/test-cluster/run-core-control-plane-ops-proof.sh is the focused operator lifecycle proof for chainfire, flaredb, and iam. It records the ChainFire fixed-membership boundary, the FlareDB additive-first migration and destructive-DDL boundary, and the standalone IAM bootstrap hardening plus signing-key, credential, and mTLS rotation proof under ./work/core-control-plane-ops-proof.
./nix/test-cluster/work-root-budget.sh is the checked helper for local disk budget reporting, stronger local enforcement, and safer cleanup guidance under ./work.
The dated 2026-04-10 artifact root for the focused control-plane proof is ./work/core-control-plane-ops-proof/20260410T172148+09:00.
Runner-specific workflow wiring from task/f5c70db0-baseline-profiles is intentionally excluded from this re-aggregated baseline; the checked-in artifact here is the local wrapper.
What it validates
- 3-node control-plane formation for
chainfire,flaredb, andiam - control-plane service health for
prismnet,flashdns,fiberlb,plasmavmc,lightningstor, andk8shost - worker-node
plasmavmcandlightningstorstartup, including KVM-only PlasmaVMC worker registration on the supported public surface - LightningStor bucket metadata and explicit object-version APIs on the optional storage surface
- PrismNet port binding for PlasmaVMC guests, including lifecycle cleanup on VM deletion
- nested KVM inside worker VMs by booting an inner guest with
qemu-system-x86_64 -accel kvm - gateway-node
apigateway,nightlight, andcreditservicequota, wallet, reservation, and admission flows - host-forwarded access to the API gateway and NightLight HTTP surfaces
- cross-node data replication smoke tests for
chainfireandflaredb - deployer-seeded native runtime scheduling from declarative Nix service definitions, including drain/failover recovery
- ISO-based bare-metal bootstrap from
nixosConfigurations.ultracloud-isothrough phone-home, flake bundle fetch, Disko install, reboot, and desired-system activation - durability and restore coverage for
chainfire,flaredb,deployer,coronafs, andlightningstor
The supported k8shost coverage here is the k8shost-server API surface. k8shost is fixed as an API/control-plane product surface; runtime dataplane helpers stay archived non-product. Archived k8shost-cni, k8shost-controllers, and lightningstor-csi scaffolds stay outside the canonical profiles and are not part of the publishable proof.
Validation layers
- image build: build all six VM derivations on the host in one
nix build - boot and unit readiness: boot the nodes in dependency order and wait for SSH plus the expected
systemdunits - protocol surfaces: probe the expected HTTP, TCP, UDP, and metrics endpoints for each role
- replicated state: write and read convergence checks across the 3-node
chainfireandflaredbclusters - worker virtualization: launch a nested KVM guest inside both worker VMs
- external entrypoints: verify host-forwarded API gateway and NightLight access from outside the guest
- auth-integrated add-ons: confirm
creditservicestays up, connects to IAM, and serves the published quota and wallet flows - workload API contract: confirm
k8shostpod watches return bounded snapshot streams and that LightningStor bucket metadata or version-listing RPCs round-trip against the live cluster
Requirements
- minimal host requirements:
- Linux host with readable and writable
/dev/kvm - nested virtualization enabled on the host hypervisor
nix- enough free space under
./workorULTRACLOUD_WORK_ROOT
- Linux host with readable and writable
- if you do not use
nix runornix develop, install:qemu-system-x86_64sshsshpasscurl
The checked-in wrappers force local Nix builders and derive parallelism from host CPU count by default. Override with ULTRACLOUD_LOCAL_NIX_MAX_JOBS, ULTRACLOUD_LOCAL_NIX_BUILD_CORES, PHOTON_CLUSTER_NIX_MAX_JOBS, or PHOTON_CLUSTER_NIX_BUILD_CORES when a host needs different scheduling.
Main commands
nix run ./nix/test-cluster#cluster -- build
nix run ./nix/test-cluster#cluster -- start
nix run ./nix/test-cluster#cluster -- smoke
nix run ./nix/test-cluster#cluster -- fresh-smoke
nix run ./nix/test-cluster#cluster -- baremetal-iso
nix run ./nix/test-cluster#cluster -- demo-vm-webapp
nix run ./nix/test-cluster#cluster -- fresh-demo-vm-webapp
nix run ./nix/test-cluster#cluster -- serve-vm-webapp
nix run ./nix/test-cluster#cluster -- fresh-serve-vm-webapp
nix run ./nix/test-cluster#cluster -- matrix
nix run ./nix/test-cluster#cluster -- fresh-matrix
nix run ./nix/test-cluster#cluster -- provider-vm-reality-proof
nix run ./nix/test-cluster#cluster -- rollout-soak
nix run ./nix/test-cluster#cluster -- durability-proof
nix run ./nix/test-cluster#cluster -- bench-storage
nix run ./nix/test-cluster#cluster -- fresh-bench-storage
nix run ./nix/test-cluster#cluster -- validate
nix run ./nix/test-cluster#cluster -- status
nix run ./nix/test-cluster#cluster -- ssh node04
nix run ./nix/test-cluster#cluster -- stop
nix run ./nix/test-cluster#cluster -- clean
make cluster-smoke
Preferred entrypoint for publishable verification: nix run ./nix/test-cluster#cluster -- fresh-smoke
Preferred entrypoint for publishable bare-metal bootstrap verification: nix run ./nix/test-cluster#cluster -- baremetal-iso
Preferred entrypoint for the exact host-KVM bare-metal proof lane: nix build .#checks.x86_64-linux.baremetal-iso-e2e && ./result/bin/baremetal-iso-e2e <log-dir>
Preferred entrypoint for physical-node preflight and handoff: nix run ./nix/test-cluster#hardware-smoke -- preflight
Preferred entrypoint for portable local verification on TCG-only hosts: nix build .#checks.x86_64-linux.portable-control-plane-regressions
Preferred entrypoint for reproducible KVM-suite reruns: ./nix/test-cluster/run-publishable-kvm-suite.sh <log-dir>
Preferred entrypoint for the full supported-surface proof on a local AMD/KVM host: ./nix/test-cluster/run-supported-surface-final-proof.sh <log-dir>
Preferred entrypoint for focused ChainFire, FlareDB, and IAM operator lifecycle verification: ./nix/test-cluster/run-core-control-plane-ops-proof.sh <log-dir>
Preferred entrypoint for local disk budget reporting: ./nix/test-cluster/work-root-budget.sh status
Preferred entrypoint for local budget enforcement: ./nix/test-cluster/work-root-budget.sh enforce
Preferred entrypoint for safer dated-proof cleanup dry-runs: ./nix/test-cluster/work-root-budget.sh prune-proof-logs 2
make cluster-smoke is a convenience wrapper for the same clean host-build VM validation flow.
nix run ./nix/test-cluster#cluster -- demo-vm-webapp creates a PrismNet-attached VM, boots a tiny web app inside the guest, stores its counter in FlareDB, writes JSON snapshots to LightningStor object storage, and then proves that the state survives guest restart plus cross-worker migration. The attached data volume is still used by the guest for its local bootstrap config.
nix run ./nix/test-cluster#cluster -- serve-vm-webapp runs the same VM web app flow but leaves the guest running and prints a http://127.0.0.1:<port>/ URL that is forwarded from the host into the tenant network so you can inspect /state or send POST /visit yourself.
nix run ./nix/test-cluster#cluster -- matrix reuses the current running cluster to exercise composed service scenarios such as prismnet + flashdns + fiberlb, PrismNet-backed VM hosting with plasmavmc + prismnet + coronafs + lightningstor, the Kubernetes-style hosting bundle, and API-gateway-mediated nightlight / creditservice flows.
Preferred entrypoint for publishable matrix verification: nix run ./nix/test-cluster#cluster -- fresh-matrix
Preferred entrypoint for focused provider and VM-hosting reality verification: nix run ./nix/test-cluster#cluster -- provider-vm-reality-proof
Preferred entrypoint for longer-running rollout maintenance and DR verification: nix run ./nix/test-cluster#cluster -- rollout-soak
Preferred entrypoint for durability and restore verification: nix run ./nix/test-cluster#cluster -- durability-proof
The dated 2026-04-10 proof root for that lane is ./work/durability-proof/20260410T120618+0900; result.json records success=true, and the artifact set includes deployer-post-restart-list.json, coronafs-node04-local-state.json, and lightningstor-head-during-node05-outage.json.
The dated 2026-04-10 proof root for the provider and VM-hosting lane is ./work/provider-vm-reality-proof/20260410T135827+0900; result.json records success=true, and the artifact set includes network-provider/fiberlb-drain-summary.txt, network-provider/flashdns-service-authoritative-answer.txt, and vm-hosting/migration-summary.json.
Rollout Bundle Operator Contract
The supported operator contract for deployer, fleet-scheduler, nix-agent, and node-agent is fixed in ../../docs/rollout-bundle.md.
deployeris supported as one active writer with restart or cold-standby restore. Automatic ChainFire-backed multi-instance failover is outside the supported product contract for this release.nix-agenthealth-check and rollback behavior is proven bynix build .#checks.x86_64-linux.deployer-vm-rollback, whilebaremetal-isoandbaremetal-iso-e2eprove the same desired-system handoff with the installer in front.fresh-smokeis the canonical KVM proof forfleet-schedulerdrain, maintenance, and failover semantics. It drainsnode04, checks relocation tonode05, restoresnode04, then stopsnode05and verifies failover plus replica restoration when the worker returns.rollout-soakis the longer-running companion for that same contract. It proves the current release boundary of one planned drain cycle, one fail-stop worker-loss cycle, and 30-second held degraded states on the two native-runtime workers, then restarts the rollout services and the fixed-membership control-plane services before rechecking the live runtime state. The dated 2026-04-10 release-grade artifact root is./work/rollout-soak/20260410T164549+0900.node-agentproduct scope is host-local runtime reconcile only. Logs and pid metadata live under${stateDir}/pids, secrets must already exist in the rendered spec or mounted files, host-path volumes are pass-through only, and upgrades are replace-and-reconcile operations.
nix run ./nix/test-cluster#cluster -- bench-storage benchmarks CoronaFS controller-export vs node-local-export I/O, worker-side materialization latency, and LightningStor large/small-object S3 throughput, then writes a report to docs/storage-benchmarks.md.
Preferred entrypoint for publishable storage numbers: nix run ./nix/test-cluster#cluster -- fresh-storage-bench
nix run ./nix/test-cluster#cluster -- bench-coronafs-local-matrix runs the local single-process CoronaFS export benchmark across the supported cache/aio combinations so software-path regressions can be separated from VM-lab network limits.
On the current lab hosts, cache=none with aio=io_uring is the strongest local-export profile and should be treated as the reference point when CoronaFS remote numbers are being distorted by the nested-QEMU/VDE network path.
Advanced usage
Use the script entrypoint only for local debugging inside a prepared Nix shell:
nix develop ./nix/test-cluster -c ./nix/test-cluster/run-cluster.sh smoke
For the strongest local check, use:
nix develop ./nix/test-cluster -c ./nix/test-cluster/run-cluster.sh fresh-smoke
Runtime state
The harness stores build links and VM runtime state under ${PHOTON_CLUSTER_WORK_ROOT:-$REPO_ROOT/work/test-cluster} by default, with VM disks under ${PHOTON_VM_DIR:-$PHOTON_CLUSTER_WORK_ROOT/state} and VDE switch state under ${PHOTON_CLUSTER_VDE_SWITCH_DIR:-$PHOTON_CLUSTER_WORK_ROOT/vde-switch}. Alternate build profiles use profile-suffixed siblings such as ${PHOTON_VM_DIR:-$PHOTON_CLUSTER_WORK_ROOT/state}-storage.
The publishable KVM wrapper keeps its logs under the path you pass in, defaults runtime/cache state to ./work/publishable-kvm-runtime, and defaults temporary files to ./work/tmp.
Logs for each VM are written to <state-dir>/<node>/vm.log.
Use ./nix/test-cluster/work-root-budget.sh status for disk budget reporting, ./nix/test-cluster/work-root-budget.sh enforce when a local proof run should fail once tracked paths exceed soft budgets, and ./nix/test-cluster/work-root-budget.sh prune-proof-logs 2 for a safer dated-proof cleanup dry-run. The helper reports the size of ./work, ./work/test-cluster/state, disposable runtime roots, and dated proof directories including ./work/rollout-soak, ./work/provider-vm-reality-proof, and ./work/hardware-smoke, then prints a safe cleanup sequence that stops the cluster, removes transient VM state, trims old proof logs, and finally runs a Nix store GC once old result symlinks are no longer needed.
./work/hardware-smoke is the proof root for physical-node bring-up attempts. hardware-smoke.sh keeps latest pointed at the newest preflight or capture run so transport-free blocked state and real hardware evidence land in the same place.
Scope note
This harness is intentionally VM-first, but the canonical bare-metal install proof also lives here so the docs, harness, and flake check all exercise the same ISO route. Older ad hoc launch scripts under baremetal/vm-cluster are legacy/manual paths, nixosConfigurations.netboot-worker is an archived worker helper outside the canonical guard set, and only netboot-all-in-one plus netboot-control-plane remain companion images for the supported profiles.