History

centra c1d4178a52 Establish baseline product surface and proof lanes		2026-04-10 19:28:44 +09:00
..
common.nix	Establish baseline product surface and proof lanes	2026-04-10 19:28:44 +09:00
flake.lock	nix-nos削除	2026-04-04 16:33:03 +09:00
flake.nix	Establish baseline product surface and proof lanes	2026-04-10 19:28:44 +09:00
hardware-smoke.sh	Establish baseline product surface and proof lanes	2026-04-10 19:28:44 +09:00
nightlight_remote_write.py	WIP snapshot: preserve dirty worktree	2026-03-20 16:25:11 +09:00
node01.nix	Establish baseline product surface and proof lanes	2026-04-10 19:28:44 +09:00
node02.nix	harden plasmavmc image ingestion and internal execution paths	2026-04-02 07:57:25 +09:00
node03.nix	harden plasmavmc image ingestion and internal execution paths	2026-04-02 07:57:25 +09:00
node04.nix	Implement host lifecycle orchestration and distributed storage restructuring	2026-03-27 12:14:12 +09:00
node05.nix	Implement host lifecycle orchestration and distributed storage restructuring	2026-03-27 12:14:12 +09:00
node06.nix	Establish baseline product surface and proof lanes	2026-04-10 19:28:44 +09:00
README.md	Establish baseline product surface and proof lanes	2026-04-10 19:28:44 +09:00
run-baremetal-iso-e2e.sh	Establish baseline product surface and proof lanes	2026-04-10 19:28:44 +09:00
run-cluster.sh	Establish baseline product surface and proof lanes	2026-04-10 19:28:44 +09:00
run-core-control-plane-ops-proof.sh	Establish baseline product surface and proof lanes	2026-04-10 19:28:44 +09:00
run-local-baseline.sh	Establish baseline product surface and proof lanes	2026-04-10 19:28:44 +09:00
run-publishable-kvm-suite.sh	Establish baseline product surface and proof lanes	2026-04-10 19:28:44 +09:00
run-supported-surface-final-proof.sh	Establish baseline product surface and proof lanes	2026-04-10 19:28:44 +09:00
storage-node01.nix	nix-nos削除	2026-04-04 16:33:03 +09:00
storage-node02.nix	nix-nos削除	2026-04-04 16:33:03 +09:00
storage-node03.nix	nix-nos削除	2026-04-04 16:33:03 +09:00
storage-node04.nix	nix-nos削除	2026-04-04 16:33:03 +09:00
storage-node05.nix	nix-nos削除	2026-04-04 16:33:03 +09:00
verify-baremetal-iso.sh	Establish baseline product surface and proof lanes	2026-04-10 19:28:44 +09:00
vm-bench-guest-image.nix	nix-nos削除	2026-04-04 16:33:03 +09:00
vm-guest-image.nix	Establish baseline product surface and proof lanes	2026-04-10 19:28:44 +09:00
work-root-budget.sh	Establish baseline product surface and proof lanes	2026-04-10 19:28:44 +09:00

README.md

UltraCloud VM Test Cluster

nix/test-cluster is the canonical local validation path for UltraCloud. It boots six QEMU VMs, treats them as hardware-like nodes, and validates representative control-plane, worker, and gateway behavior over SSH and service endpoints. All VM images are built on the host in a single Nix invocation and then booted as prebuilt artifacts. The guests do not compile the stack locally. The same harness also owns the canonical bare-metal bootstrap proof: a raw-QEMU ISO flow that phones home to deployer, runs Disko, reboots, and waits for nix-agent desired-system convergence on one control-plane node and one worker-equivalent node.

That local QEMU proof is intentionally the same operator route planned for hardware. The same nixosConfigurations.ultracloud-iso image can be written to USB or attached through BMC virtual media on a physical host; QEMU with KVM is only standing in for the chassis while the install flow, phone-home, Disko, reboot, and desired-system handoff stay the same.

The hardware bridge now has its own canonical wrapper: nix run ./nix/test-cluster#hardware-smoke -- preflight. It writes the exact kernel parameters, expected ULTRACLOUD_MARKER lines, failure markers, and operator handoff under ./work/hardware-smoke/latest, and the same wrapper can later be rerun as run or capture when USB or BMC/Redfish transport is actually available.

The harness keeps the install contract reusable by pushing install details into classes and pools. verify-baremetal-iso.sh now publishes node classes whose install_plan owns the install profile and stable disk selector, while node records carry only identity plus any desired-system override that is genuinely host-specific. In the canonical QEMU proof that means the node record carries the prebuilt desired_system.target_system plus the health check, and the class carries the install plan. The chassis emulates the preferred hardware-style disk selection by attaching explicit virtio serials and installing against /dev/disk/by-id/virtio-uc-control-root and /dev/disk/by-id/virtio-uc-worker-root.

When /dev/kvm is absent, the portable fallback is not another harness subcommand. Use the root-flake non-KVM lane instead: nix build .#checks.x86_64-linux.portable-control-plane-regressions. When /dev/kvm and nested virtualization are available, the reproducible publishable lane is ./nix/test-cluster/run-publishable-kvm-suite.sh, which records environment metadata and then runs fresh-smoke, fresh-demo-vm-webapp, and fresh-matrix in order. nix run ./nix/test-cluster#cluster -- durability-proof is the canonical chainfire flaredb deployer backup/restore lane. It persists artifacts under ./work/durability-proof/latest, proves logical backup/restore for ChainFire keys and FlareDB SQL rows, uses the canonical Deployer admin pre-register request itself as the backup artifact, verifies that the pre-registered node survives a deployer.service restart, replays the same request idempotently, and injects CoronaFS plus LightningStor failures against the live KVM cluster. nix run ./nix/test-cluster#cluster -- rollout-soak is the longer-running KVM companion lane for the rollout bundle and fixed-membership control plane. It rebuilds from clean local runtime state, writes dated artifacts under ./work/rollout-soak/latest, validates exactly one planned draining maintenance cycle and one fail-stop worker-loss cycle on the two native-runtime workers, holds each degraded state for 30 seconds, then restarts deployer, fleet-scheduler, node-agent, chainfire, and flaredb before revalidating the live cluster. The same proof root includes scope-fixed-contract.json, deployer-scope-fixed.txt, and fleet-scheduler-scope-fixed.txt so the supported release boundary is recorded with the runtime evidence. The steady-state KVM nodes do not run nix-agent.service, so the lane records nix-agent scope markers instead of pretending a live-cluster nix-agent restart happened. nix run ./nix/test-cluster#cluster -- provider-vm-reality-proof is the focused local-KVM reality lane for prismnet, flashdns, fiberlb, and plasmavmc. It writes authoritative DNS answers, FiberLB backend drain or restore artifacts, and PlasmaVMC migration or storage-handoff state under ./work/provider-vm-reality-proof/latest. ./nix/test-cluster/run-core-control-plane-ops-proof.sh is the focused operator lifecycle proof for chainfire, flaredb, and iam. It records the ChainFire fixed-membership boundary, the FlareDB additive-first migration and destructive-DDL boundary, and the standalone IAM bootstrap hardening plus signing-key, credential, and mTLS rotation proof under ./work/core-control-plane-ops-proof. ./nix/test-cluster/work-root-budget.sh is the checked helper for local disk budget reporting, stronger local enforcement, and safer cleanup guidance under ./work. The dated 2026-04-10 artifact root for the focused control-plane proof is ./work/core-control-plane-ops-proof/20260410T172148+09:00. Runner-specific workflow wiring from task/f5c70db0-baseline-profiles is intentionally excluded from this re-aggregated baseline; the checked-in artifact here is the local wrapper.

What it validates

3-node control-plane formation for chainfire, flaredb, and iam
control-plane service health for prismnet, flashdns, fiberlb, plasmavmc, lightningstor, and k8shost
worker-node plasmavmc and lightningstor startup, including KVM-only PlasmaVMC worker registration on the supported public surface
LightningStor bucket metadata and explicit object-version APIs on the optional storage surface
PrismNet port binding for PlasmaVMC guests, including lifecycle cleanup on VM deletion
nested KVM inside worker VMs by booting an inner guest with qemu-system-x86_64 -accel kvm
gateway-node apigateway, nightlight, and creditservice quota, wallet, reservation, and admission flows
host-forwarded access to the API gateway and NightLight HTTP surfaces
cross-node data replication smoke tests for chainfire and flaredb
deployer-seeded native runtime scheduling from declarative Nix service definitions, including drain/failover recovery
ISO-based bare-metal bootstrap from nixosConfigurations.ultracloud-iso through phone-home, flake bundle fetch, Disko install, reboot, and desired-system activation
durability and restore coverage for chainfire, flaredb, deployer, coronafs, and lightningstor

The supported k8shost coverage here is the k8shost-server API surface. k8shost is fixed as an API/control-plane product surface; runtime dataplane helpers stay archived non-product. Archived k8shost-cni, k8shost-controllers, and lightningstor-csi scaffolds stay outside the canonical profiles and are not part of the publishable proof.

Validation layers

image build: build all six VM derivations on the host in one nix build
boot and unit readiness: boot the nodes in dependency order and wait for SSH plus the expected systemd units
protocol surfaces: probe the expected HTTP, TCP, UDP, and metrics endpoints for each role
replicated state: write and read convergence checks across the 3-node chainfire and flaredb clusters
worker virtualization: launch a nested KVM guest inside both worker VMs
external entrypoints: verify host-forwarded API gateway and NightLight access from outside the guest
auth-integrated add-ons: confirm creditservice stays up, connects to IAM, and serves the published quota and wallet flows
workload API contract: confirm k8shost pod watches return bounded snapshot streams and that LightningStor bucket metadata or version-listing RPCs round-trip against the live cluster

Requirements

minimal host requirements:
- Linux host with readable and writable /dev/kvm
- nested virtualization enabled on the host hypervisor
- nix
- enough free space under ./work or ULTRACLOUD_WORK_ROOT
if you do not use nix run or nix develop, install:
- qemu-system-x86_64
- ssh
- sshpass
- curl

The checked-in wrappers force local Nix builders and derive parallelism from host CPU count by default. Override with ULTRACLOUD_LOCAL_NIX_MAX_JOBS, ULTRACLOUD_LOCAL_NIX_BUILD_CORES, PHOTON_CLUSTER_NIX_MAX_JOBS, or PHOTON_CLUSTER_NIX_BUILD_CORES when a host needs different scheduling.

Main commands

nix run ./nix/test-cluster#cluster -- build
nix run ./nix/test-cluster#cluster -- start
nix run ./nix/test-cluster#cluster -- smoke
nix run ./nix/test-cluster#cluster -- fresh-smoke
nix run ./nix/test-cluster#cluster -- baremetal-iso
nix run ./nix/test-cluster#cluster -- demo-vm-webapp
nix run ./nix/test-cluster#cluster -- fresh-demo-vm-webapp
nix run ./nix/test-cluster#cluster -- serve-vm-webapp
nix run ./nix/test-cluster#cluster -- fresh-serve-vm-webapp
nix run ./nix/test-cluster#cluster -- matrix
nix run ./nix/test-cluster#cluster -- fresh-matrix
nix run ./nix/test-cluster#cluster -- provider-vm-reality-proof
nix run ./nix/test-cluster#cluster -- rollout-soak
nix run ./nix/test-cluster#cluster -- durability-proof
nix run ./nix/test-cluster#cluster -- bench-storage
nix run ./nix/test-cluster#cluster -- fresh-bench-storage
nix run ./nix/test-cluster#cluster -- validate
nix run ./nix/test-cluster#cluster -- status
nix run ./nix/test-cluster#cluster -- ssh node04
nix run ./nix/test-cluster#cluster -- stop
nix run ./nix/test-cluster#cluster -- clean
make cluster-smoke

Preferred entrypoint for publishable verification: nix run ./nix/test-cluster#cluster -- fresh-smoke

Preferred entrypoint for publishable bare-metal bootstrap verification: nix run ./nix/test-cluster#cluster -- baremetal-iso

Preferred entrypoint for the exact host-KVM bare-metal proof lane: nix build .#checks.x86_64-linux.baremetal-iso-e2e && ./result/bin/baremetal-iso-e2e <log-dir>

Preferred entrypoint for physical-node preflight and handoff: nix run ./nix/test-cluster#hardware-smoke -- preflight

Preferred entrypoint for portable local verification on TCG-only hosts: nix build .#checks.x86_64-linux.portable-control-plane-regressions

Preferred entrypoint for reproducible KVM-suite reruns: ./nix/test-cluster/run-publishable-kvm-suite.sh <log-dir>

Preferred entrypoint for the full supported-surface proof on a local AMD/KVM host: ./nix/test-cluster/run-supported-surface-final-proof.sh <log-dir>

Preferred entrypoint for focused ChainFire, FlareDB, and IAM operator lifecycle verification: ./nix/test-cluster/run-core-control-plane-ops-proof.sh <log-dir>

Preferred entrypoint for local disk budget reporting: ./nix/test-cluster/work-root-budget.sh status Preferred entrypoint for local budget enforcement: ./nix/test-cluster/work-root-budget.sh enforce Preferred entrypoint for safer dated-proof cleanup dry-runs: ./nix/test-cluster/work-root-budget.sh prune-proof-logs 2

make cluster-smoke is a convenience wrapper for the same clean host-build VM validation flow.

nix run ./nix/test-cluster#cluster -- demo-vm-webapp creates a PrismNet-attached VM, boots a tiny web app inside the guest, stores its counter in FlareDB, writes JSON snapshots to LightningStor object storage, and then proves that the state survives guest restart plus cross-worker migration. The attached data volume is still used by the guest for its local bootstrap config.

nix run ./nix/test-cluster#cluster -- serve-vm-webapp runs the same VM web app flow but leaves the guest running and prints a http://127.0.0.1:<port>/ URL that is forwarded from the host into the tenant network so you can inspect /state or send POST /visit yourself.

nix run ./nix/test-cluster#cluster -- matrix reuses the current running cluster to exercise composed service scenarios such as prismnet + flashdns + fiberlb, PrismNet-backed VM hosting with plasmavmc + prismnet + coronafs + lightningstor, the Kubernetes-style hosting bundle, and API-gateway-mediated nightlight / creditservice flows.

Preferred entrypoint for publishable matrix verification: nix run ./nix/test-cluster#cluster -- fresh-matrix

Preferred entrypoint for focused provider and VM-hosting reality verification: nix run ./nix/test-cluster#cluster -- provider-vm-reality-proof

Preferred entrypoint for longer-running rollout maintenance and DR verification: nix run ./nix/test-cluster#cluster -- rollout-soak

Preferred entrypoint for durability and restore verification: nix run ./nix/test-cluster#cluster -- durability-proof

The dated 2026-04-10 proof root for that lane is ./work/durability-proof/20260410T120618+0900; result.json records success=true, and the artifact set includes deployer-post-restart-list.json, coronafs-node04-local-state.json, and lightningstor-head-during-node05-outage.json. The dated 2026-04-10 proof root for the provider and VM-hosting lane is ./work/provider-vm-reality-proof/20260410T135827+0900; result.json records success=true, and the artifact set includes network-provider/fiberlb-drain-summary.txt, network-provider/flashdns-service-authoritative-answer.txt, and vm-hosting/migration-summary.json.

Rollout Bundle Operator Contract

The supported operator contract for deployer, fleet-scheduler, nix-agent, and node-agent is fixed in ../../docs/rollout-bundle.md.

deployer is supported as one active writer with restart or cold-standby restore. Automatic ChainFire-backed multi-instance failover is outside the supported product contract for this release.
nix-agent health-check and rollback behavior is proven by nix build .#checks.x86_64-linux.deployer-vm-rollback, while baremetal-iso and baremetal-iso-e2e prove the same desired-system handoff with the installer in front.
fresh-smoke is the canonical KVM proof for fleet-scheduler drain, maintenance, and failover semantics. It drains node04, checks relocation to node05, restores node04, then stops node05 and verifies failover plus replica restoration when the worker returns.
rollout-soak is the longer-running companion for that same contract. It proves the current release boundary of one planned drain cycle, one fail-stop worker-loss cycle, and 30-second held degraded states on the two native-runtime workers, then restarts the rollout services and the fixed-membership control-plane services before rechecking the live runtime state. The dated 2026-04-10 release-grade artifact root is ./work/rollout-soak/20260410T164549+0900.
node-agent product scope is host-local runtime reconcile only. Logs and pid metadata live under ${stateDir}/pids, secrets must already exist in the rendered spec or mounted files, host-path volumes are pass-through only, and upgrades are replace-and-reconcile operations.

nix run ./nix/test-cluster#cluster -- bench-storage benchmarks CoronaFS controller-export vs node-local-export I/O, worker-side materialization latency, and LightningStor large/small-object S3 throughput, then writes a report to docs/storage-benchmarks.md.

Preferred entrypoint for publishable storage numbers: nix run ./nix/test-cluster#cluster -- fresh-storage-bench

nix run ./nix/test-cluster#cluster -- bench-coronafs-local-matrix runs the local single-process CoronaFS export benchmark across the supported cache/aio combinations so software-path regressions can be separated from VM-lab network limits. On the current lab hosts, cache=none with aio=io_uring is the strongest local-export profile and should be treated as the reference point when CoronaFS remote numbers are being distorted by the nested-QEMU/VDE network path.

Advanced usage

Use the script entrypoint only for local debugging inside a prepared Nix shell:

nix develop ./nix/test-cluster -c ./nix/test-cluster/run-cluster.sh smoke

For the strongest local check, use:

nix develop ./nix/test-cluster -c ./nix/test-cluster/run-cluster.sh fresh-smoke

Runtime state

The harness stores build links and VM runtime state under ${PHOTON_CLUSTER_WORK_ROOT:-$REPO_ROOT/work/test-cluster} by default, with VM disks under ${PHOTON_VM_DIR:-$PHOTON_CLUSTER_WORK_ROOT/state} and VDE switch state under ${PHOTON_CLUSTER_VDE_SWITCH_DIR:-$PHOTON_CLUSTER_WORK_ROOT/vde-switch}. Alternate build profiles use profile-suffixed siblings such as ${PHOTON_VM_DIR:-$PHOTON_CLUSTER_WORK_ROOT/state}-storage. The publishable KVM wrapper keeps its logs under the path you pass in, defaults runtime/cache state to ./work/publishable-kvm-runtime, and defaults temporary files to ./work/tmp. Logs for each VM are written to <state-dir>/<node>/vm.log.

Use ./nix/test-cluster/work-root-budget.sh status for disk budget reporting, ./nix/test-cluster/work-root-budget.sh enforce when a local proof run should fail once tracked paths exceed soft budgets, and ./nix/test-cluster/work-root-budget.sh prune-proof-logs 2 for a safer dated-proof cleanup dry-run. The helper reports the size of ./work, ./work/test-cluster/state, disposable runtime roots, and dated proof directories including ./work/rollout-soak, ./work/provider-vm-reality-proof, and ./work/hardware-smoke, then prints a safe cleanup sequence that stops the cluster, removes transient VM state, trims old proof logs, and finally runs a Nix store GC once old result symlinks are no longer needed.

./work/hardware-smoke is the proof root for physical-node bring-up attempts. hardware-smoke.sh keeps latest pointed at the newest preflight or capture run so transport-free blocked state and real hardware evidence land in the same place.

Scope note

This harness is intentionally VM-first, but the canonical bare-metal install proof also lives here so the docs, harness, and flake check all exercise the same ISO route. Older ad hoc launch scripts under baremetal/vm-cluster are legacy/manual paths, nixosConfigurations.netboot-worker is an archived worker helper outside the canonical guard set, and only netboot-all-in-one plus netboot-control-plane remain companion images for the supported profiles.