photoncloud-monorepo/nix/test-cluster/README.md
centra d87fe3f4c5
Some checks failed
KVM Publishable Validation / publishable-kvm-suite (push) Has been cancelled
Merge KVM publishable validation lane
2026-05-11 13:28:18 +09:00

19 KiB

UltraCloud VM Test Cluster

nix/test-cluster is the canonical local validation path for UltraCloud. It boots six QEMU VMs, treats them as hardware-like nodes, and validates representative control-plane, worker, and gateway behavior over SSH and service endpoints. All VM images are built on the host in a single Nix invocation and then booted as prebuilt artifacts. The guests do not compile the stack locally. The same harness also owns the canonical bare-metal bootstrap proof: a raw-QEMU ISO flow that phones home to deployer, runs Disko, reboots, and waits for nix-agent desired-system convergence on one control-plane node and one worker-equivalent node.

That local QEMU proof is intentionally the same operator route planned for hardware. The same nixosConfigurations.ultracloud-iso image can be written to USB or attached through BMC virtual media on a physical host; QEMU with KVM is only standing in for the chassis while the install flow, phone-home, Disko, reboot, and desired-system handoff stay the same.

The hardware bridge now has its own canonical wrapper: nix run ./nix/test-cluster#hardware-smoke -- preflight. It writes the exact kernel parameters, expected ULTRACLOUD_MARKER lines, failure markers, and operator handoff under ./work/hardware-smoke/latest, and the same wrapper can later be rerun as run or capture when USB or BMC/Redfish transport is actually available.

The harness keeps the install contract reusable by pushing install details into classes and pools. verify-baremetal-iso.sh now publishes node classes whose install_plan owns the install profile and stable disk selector, while node records carry only identity plus any desired-system override that is genuinely host-specific. In the canonical QEMU proof that means the node record carries the prebuilt desired_system.target_system plus the health check, and the class carries the install plan. The chassis emulates the preferred hardware-style disk selection by attaching explicit virtio serials and installing against /dev/disk/by-id/virtio-uc-control-root and /dev/disk/by-id/virtio-uc-worker-root.

When /dev/kvm is absent, the portable fallback is not another harness subcommand. Use the root-flake non-KVM lane instead: nix build .#checks.x86_64-linux.portable-control-plane-regressions. When /dev/kvm and nested virtualization are available, the reproducible publishable lane is ./nix/test-cluster/run-publishable-kvm-suite.sh, which records environment metadata and then runs fresh-smoke, fresh-demo-vm-webapp, fresh-matrix, and chainfire-live-membership-proof in order. nix run ./nix/test-cluster#cluster -- durability-proof is the canonical chainfire flaredb deployer backup/restore lane. It persists artifacts under ./work/durability-proof/latest, proves logical backup/restore for ChainFire keys and FlareDB SQL rows, uses the canonical Deployer admin pre-register request itself as the backup artifact, verifies that the pre-registered node survives a deployer.service restart, replays the same request idempotently, and injects CoronaFS plus LightningStor failures against the live KVM cluster. nix run ./nix/test-cluster#cluster -- rollout-soak is the longer-running KVM companion lane for the rollout bundle and canonical control plane. It rebuilds from clean local runtime state, writes dated artifacts under ./work/rollout-soak/latest, validates exactly one planned draining maintenance cycle and one fail-stop worker-loss cycle on the two native-runtime workers, holds each degraded state for 30 seconds, then restarts deployer, fleet-scheduler, node-agent, chainfire, and flaredb before revalidating the live cluster. The same proof root includes scope-fixed-contract.json, deployer-scope-fixed.txt, and fleet-scheduler-scope-fixed.txt so the supported release boundary is recorded with the runtime evidence. The steady-state KVM nodes do not run nix-agent.service, so the lane records nix-agent scope markers instead of pretending a live-cluster nix-agent restart happened. nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof is the focused local-KVM live-reconfiguration lane for the ChainFire control plane. It rebuilds from clean local runtime state, launches a temporary ChainFire replica on node04, proves learner add plus local replication, voter promotion, live leader transfer to another voting member, temporary-voter restart and rejoin, current-leader removal followed by re-election, removed-leader re-add, and final scale-in back to the canonical 3-node control-plane shape, and stores artifacts under ./work/chainfire-live-membership-proof/latest. nix run ./nix/test-cluster#cluster -- provider-vm-reality-proof is the focused local-KVM reality lane for prismnet, flashdns, fiberlb, and plasmavmc. It writes authoritative DNS answers, FiberLB backend drain or restore artifacts, and PlasmaVMC migration or storage-handoff state under ./work/provider-vm-reality-proof/latest. ./nix/test-cluster/run-core-control-plane-ops-proof.sh is the focused operator lifecycle proof for chainfire, flaredb, and iam. It records the published ChainFire live-membership API boundary, the FlareDB additive-first migration and destructive-DDL boundary, and the standalone IAM bootstrap hardening plus signing-key, credential, and mTLS rotation proof under ./work/core-control-plane-ops-proof. ./nix/test-cluster/work-root-budget.sh is the checked helper for local disk budget reporting, stronger local enforcement, and safer cleanup guidance under ./work. The dated 2026-04-10 artifact root for the focused control-plane proof is ./work/core-control-plane-ops-proof/20260410T172148+09:00. The repository-owned remote entrypoint for the same publishable KVM proof is .github/workflows/kvm-publishable-selfhosted.yml. It runs the local wrapper on Forgejo runners labeled nix-host and cn-nixos-mouse-runner.

What it validates

  • 3-node control-plane formation for chainfire, flaredb, and iam
  • control-plane service health for prismnet, flashdns, fiberlb, plasmavmc, lightningstor, and k8shost
  • worker-node plasmavmc and lightningstor startup, including KVM-only PlasmaVMC worker registration on the supported public surface
  • LightningStor bucket metadata and explicit object-version APIs on the optional storage surface
  • PrismNet port binding for PlasmaVMC guests, including lifecycle cleanup on VM deletion
  • nested KVM inside worker VMs by booting an inner guest with qemu-system-x86_64 -accel kvm
  • gateway-node apigateway, nightlight, and creditservice quota, wallet, reservation, and admission flows
  • host-forwarded access to the API gateway and NightLight HTTP surfaces
  • cross-node data replication smoke tests for chainfire and flaredb
  • live ChainFire scale-out, learner promotion, leader transfer, temporary-voter restart, current-leader removal, re-add, and scale-in on the canonical control-plane shape
  • deployer-seeded native runtime scheduling from declarative Nix service definitions, including drain/failover recovery
  • ISO-based bare-metal bootstrap from nixosConfigurations.ultracloud-iso through phone-home, flake bundle fetch, Disko install, reboot, and desired-system activation
  • durability and restore coverage for chainfire, flaredb, deployer, coronafs, and lightningstor

The supported k8shost coverage here is the k8shost-server API surface. k8shost is fixed as an API/control-plane product surface; runtime dataplane helpers stay archived non-product. Archived k8shost-cni, k8shost-controllers, and lightningstor-csi scaffolds stay outside the canonical profiles and are not part of the publishable proof.

Validation layers

  • image build: build all six VM derivations on the host in one nix build
  • boot and unit readiness: boot the nodes in dependency order and wait for SSH plus the expected systemd units
  • protocol surfaces: probe the expected HTTP, TCP, UDP, and metrics endpoints for each role
  • replicated state: write and read convergence checks across the 3-node chainfire and flaredb clusters
  • worker virtualization: launch a nested KVM guest inside both worker VMs
  • external entrypoints: verify host-forwarded API gateway and NightLight access from outside the guest
  • auth-integrated add-ons: confirm creditservice stays up, connects to IAM, and serves the published quota and wallet flows
  • workload API contract: confirm k8shost pod watches return bounded snapshot streams and that LightningStor bucket metadata or version-listing RPCs round-trip against the live cluster

Requirements

  • minimal host requirements:
    • Linux host with readable and writable /dev/kvm
    • nested virtualization enabled on the host hypervisor
    • nix
    • enough free space under ./work or ULTRACLOUD_WORK_ROOT
  • if you do not use nix run or nix develop, install:
    • qemu-system-x86_64
    • ssh
    • sshpass
    • curl

The checked-in wrappers force local Nix builders and derive parallelism from host CPU count by default. Override with ULTRACLOUD_LOCAL_NIX_MAX_JOBS, ULTRACLOUD_LOCAL_NIX_BUILD_CORES, PHOTON_CLUSTER_NIX_MAX_JOBS, or PHOTON_CLUSTER_NIX_BUILD_CORES when a host needs different scheduling.

Main commands

nix run ./nix/test-cluster#cluster -- build
nix run ./nix/test-cluster#cluster -- start
nix run ./nix/test-cluster#cluster -- smoke
nix run ./nix/test-cluster#cluster -- fresh-smoke
nix run ./nix/test-cluster#cluster -- baremetal-iso
nix run ./nix/test-cluster#cluster -- demo-vm-webapp
nix run ./nix/test-cluster#cluster -- fresh-demo-vm-webapp
nix run ./nix/test-cluster#cluster -- serve-vm-webapp
nix run ./nix/test-cluster#cluster -- fresh-serve-vm-webapp
nix run ./nix/test-cluster#cluster -- matrix
nix run ./nix/test-cluster#cluster -- fresh-matrix
nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof
nix run ./nix/test-cluster#cluster -- provider-vm-reality-proof
nix run ./nix/test-cluster#cluster -- rollout-soak
nix run ./nix/test-cluster#cluster -- durability-proof
nix run ./nix/test-cluster#cluster -- bench-storage
nix run ./nix/test-cluster#cluster -- fresh-bench-storage
nix run ./nix/test-cluster#cluster -- validate
nix run ./nix/test-cluster#cluster -- status
nix run ./nix/test-cluster#cluster -- ssh node04
nix run ./nix/test-cluster#cluster -- stop
nix run ./nix/test-cluster#cluster -- clean
make cluster-smoke

Preferred entrypoint for publishable verification: nix run ./nix/test-cluster#cluster -- fresh-smoke

Preferred entrypoint for publishable bare-metal bootstrap verification: nix run ./nix/test-cluster#cluster -- baremetal-iso

Preferred entrypoint for the exact host-KVM bare-metal proof lane: nix build .#checks.x86_64-linux.baremetal-iso-e2e && ./result/bin/baremetal-iso-e2e <log-dir>

Preferred entrypoint for physical-node preflight and handoff: nix run ./nix/test-cluster#hardware-smoke -- preflight

Preferred entrypoint for portable local verification on TCG-only hosts: nix build .#checks.x86_64-linux.portable-control-plane-regressions

Preferred entrypoint for reproducible KVM-suite reruns: ./nix/test-cluster/run-publishable-kvm-suite.sh <log-dir>

Preferred entrypoint for the full supported-surface proof on a local AMD/KVM host: ./nix/test-cluster/run-supported-surface-final-proof.sh <log-dir>

Preferred entrypoint for focused ChainFire, FlareDB, and IAM operator lifecycle verification: ./nix/test-cluster/run-core-control-plane-ops-proof.sh <log-dir>

Preferred entrypoint for local disk budget reporting: ./nix/test-cluster/work-root-budget.sh status Preferred entrypoint for local budget enforcement: ./nix/test-cluster/work-root-budget.sh enforce Preferred entrypoint for safer dated-proof cleanup dry-runs: ./nix/test-cluster/work-root-budget.sh prune-proof-logs 2

make cluster-smoke is a convenience wrapper for the same clean host-build VM validation flow.

nix run ./nix/test-cluster#cluster -- demo-vm-webapp creates a PrismNet-attached VM, boots a tiny web app inside the guest, stores its counter in FlareDB, writes JSON snapshots to LightningStor object storage, and then proves that the state survives guest restart plus cross-worker migration. The attached data volume is still used by the guest for its local bootstrap config.

nix run ./nix/test-cluster#cluster -- serve-vm-webapp runs the same VM web app flow but leaves the guest running and prints a http://127.0.0.1:<port>/ URL that is forwarded from the host into the tenant network so you can inspect /state or send POST /visit yourself.

nix run ./nix/test-cluster#cluster -- matrix reuses the current running cluster to exercise composed service scenarios such as prismnet + flashdns + fiberlb, PrismNet-backed VM hosting with plasmavmc + prismnet + coronafs + lightningstor, the Kubernetes-style hosting bundle, and API-gateway-mediated nightlight / creditservice flows.

Preferred entrypoint for publishable matrix verification: nix run ./nix/test-cluster#cluster -- fresh-matrix

Preferred entrypoint for focused ChainFire live membership verification: nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof

Preferred entrypoint for focused provider and VM-hosting reality verification: nix run ./nix/test-cluster#cluster -- provider-vm-reality-proof

Preferred entrypoint for longer-running rollout maintenance and DR verification: nix run ./nix/test-cluster#cluster -- rollout-soak

Preferred entrypoint for durability and restore verification: nix run ./nix/test-cluster#cluster -- durability-proof

The dated 2026-04-10 proof root for that lane is ./work/durability-proof/20260410T120618+0900; result.json records success=true, and the artifact set includes deployer-post-restart-list.json, coronafs-node04-local-state.json, and lightningstor-head-during-node05-outage.json. The dated 2026-04-10 proof root for the provider and VM-hosting lane is ./work/provider-vm-reality-proof/20260410T135827+0900; result.json records success=true, and the artifact set includes network-provider/fiberlb-drain-summary.txt, network-provider/flashdns-service-authoritative-answer.txt, and vm-hosting/migration-summary.json.

Rollout Bundle Operator Contract

The supported operator contract for deployer, fleet-scheduler, nix-agent, and node-agent is fixed in ../../docs/rollout-bundle.md.

  • deployer is supported as one active writer with restart or cold-standby restore. Automatic ChainFire-backed multi-instance failover is outside the supported product contract for this release.
  • nix-agent health-check and rollback behavior is proven by nix build .#checks.x86_64-linux.deployer-vm-rollback, while baremetal-iso and baremetal-iso-e2e prove the same desired-system handoff with the installer in front.
  • fresh-smoke is the canonical KVM proof for fleet-scheduler drain, maintenance, and failover semantics. It drains node04, checks relocation to node05, restores node04, then stops node05 and verifies failover plus replica restoration when the worker returns.
  • rollout-soak is the longer-running companion for that same contract. It proves the current release boundary of one planned drain cycle, one fail-stop worker-loss cycle, and 30-second held degraded states on the two native-runtime workers, then restarts the rollout services and the canonical control-plane services before rechecking the live runtime state. The dated 2026-04-10 release-grade artifact root is ./work/rollout-soak/20260410T164549+0900.
  • node-agent product scope is host-local runtime reconcile only. Logs and pid metadata live under ${stateDir}/pids, secrets must already exist in the rendered spec or mounted files, host-path volumes are pass-through only, and upgrades are replace-and-reconcile operations.

nix run ./nix/test-cluster#cluster -- bench-storage benchmarks CoronaFS controller-export vs node-local-export I/O, worker-side materialization latency, and LightningStor large/small-object S3 throughput, then writes a report to docs/storage-benchmarks.md.

Preferred entrypoint for publishable storage numbers: nix run ./nix/test-cluster#cluster -- fresh-storage-bench

nix run ./nix/test-cluster#cluster -- bench-coronafs-local-matrix runs the local single-process CoronaFS export benchmark across the supported cache/aio combinations so software-path regressions can be separated from VM-lab network limits. On the current lab hosts, cache=none with aio=io_uring is the strongest local-export profile and should be treated as the reference point when CoronaFS remote numbers are being distorted by the nested-QEMU/VDE network path.

Advanced usage

Use the script entrypoint only for local debugging inside a prepared Nix shell:

nix develop ./nix/test-cluster -c ./nix/test-cluster/run-cluster.sh smoke

For the strongest local check, use:

nix develop ./nix/test-cluster -c ./nix/test-cluster/run-cluster.sh fresh-smoke

Runtime state

The harness stores build links and VM runtime state under ${PHOTON_CLUSTER_WORK_ROOT:-$REPO_ROOT/work/test-cluster} by default, with VM disks under ${PHOTON_VM_DIR:-$PHOTON_CLUSTER_WORK_ROOT/state} and VDE switch state under ${PHOTON_CLUSTER_VDE_SWITCH_DIR:-$PHOTON_CLUSTER_WORK_ROOT/vde-switch}. Alternate build profiles use profile-suffixed siblings such as ${PHOTON_VM_DIR:-$PHOTON_CLUSTER_WORK_ROOT/state}-storage. The publishable KVM wrapper keeps its logs under the path you pass in, defaults runtime/cache state to ./work/publishable-kvm-runtime, and defaults temporary files to ./work/tmp. Logs for each VM are written to <state-dir>/<node>/vm.log.

Use ./nix/test-cluster/work-root-budget.sh status for disk budget reporting, ./nix/test-cluster/work-root-budget.sh enforce when a local proof run should fail once tracked paths exceed soft budgets, and ./nix/test-cluster/work-root-budget.sh prune-proof-logs 2 for a safer dated-proof cleanup dry-run. The helper reports the size of ./work, ./work/test-cluster/state, disposable runtime roots, and dated proof directories including ./work/rollout-soak, ./work/provider-vm-reality-proof, and ./work/hardware-smoke, then prints a safe cleanup sequence that stops the cluster, removes transient VM state, trims old proof logs, and finally runs a Nix store GC once old result symlinks are no longer needed.

./work/hardware-smoke is the proof root for physical-node bring-up attempts. hardware-smoke.sh keeps latest pointed at the newest preflight or capture run so transport-free blocked state and real hardware evidence land in the same place.

Scope note

This harness is intentionally VM-first, but the canonical bare-metal install proof also lives here so the docs, harness, and flake check all exercise the same ISO route. Older ad hoc launch scripts under baremetal/vm-cluster are legacy/manual paths, nixosConfigurations.netboot-worker is an archived worker helper outside the canonical guard set, and only netboot-all-in-one plus netboot-control-plane remain companion images for the supported profiles.