photoncloud-monorepo/nix/test-cluster/README.md
2026-05-05 22:49:03 +09:00

181 lines
19 KiB
Markdown

# UltraCloud VM Test Cluster
`nix/test-cluster` is the canonical local validation path for UltraCloud.
It boots six QEMU VMs, treats them as hardware-like nodes, and validates representative control-plane, worker, and gateway behavior over SSH and service endpoints.
All VM images are built on the host in a single Nix invocation and then booted as prebuilt artifacts. The guests do not compile the stack locally.
The same harness also owns the canonical bare-metal bootstrap proof: a raw-QEMU ISO flow that phones home to `deployer`, runs Disko, reboots, and waits for `nix-agent` desired-system convergence on one control-plane node and one worker-equivalent node.
That local QEMU proof is intentionally the same operator route planned for hardware. The same `nixosConfigurations.ultracloud-iso` image can be written to USB or attached through BMC virtual media on a physical host; QEMU with KVM is only standing in for the chassis while the install flow, phone-home, Disko, reboot, and desired-system handoff stay the same.
The hardware bridge now has its own canonical wrapper: `nix run ./nix/test-cluster#hardware-smoke -- preflight`. It writes the exact kernel parameters, expected `ULTRACLOUD_MARKER` lines, failure markers, and operator handoff under `./work/hardware-smoke/latest`, and the same wrapper can later be rerun as `run` or `capture` when USB or BMC/Redfish transport is actually available.
The harness keeps the install contract reusable by pushing install details into classes and pools. `verify-baremetal-iso.sh` now publishes node classes whose `install_plan` owns the install profile and stable disk selector, while node records carry only identity plus any desired-system override that is genuinely host-specific. In the canonical QEMU proof that means the node record carries the prebuilt `desired_system.target_system` plus the health check, and the class carries the install plan. The chassis emulates the preferred hardware-style disk selection by attaching explicit virtio serials and installing against `/dev/disk/by-id/virtio-uc-control-root` and `/dev/disk/by-id/virtio-uc-worker-root`.
When `/dev/kvm` is absent, the portable fallback is not another harness subcommand. Use the root-flake non-KVM lane instead: `nix build .#checks.x86_64-linux.portable-control-plane-regressions`.
When `/dev/kvm` and nested virtualization are available, the reproducible publishable lane is `./nix/test-cluster/run-publishable-kvm-suite.sh`, which records environment metadata and then runs `fresh-smoke`, `fresh-demo-vm-webapp`, `fresh-matrix`, and `chainfire-live-membership-proof` in order.
`nix run ./nix/test-cluster#cluster -- durability-proof` is the canonical chainfire flaredb deployer backup/restore lane. It persists artifacts under `./work/durability-proof/latest`, proves logical backup/restore for ChainFire keys and FlareDB SQL rows, uses the canonical Deployer admin pre-register request itself as the backup artifact, verifies that the pre-registered node survives a `deployer.service` restart, replays the same request idempotently, and injects CoronaFS plus LightningStor failures against the live KVM cluster.
`nix run ./nix/test-cluster#cluster -- rollout-soak` is the longer-running KVM companion lane for the rollout bundle and canonical control plane. It rebuilds from clean local runtime state, writes dated artifacts under `./work/rollout-soak/latest`, validates exactly one planned `draining` maintenance cycle and one fail-stop worker-loss cycle on the two native-runtime workers, holds each degraded state for 30 seconds, then restarts `deployer`, `fleet-scheduler`, `node-agent`, `chainfire`, and `flaredb` before revalidating the live cluster. The same proof root includes `scope-fixed-contract.json`, `deployer-scope-fixed.txt`, and `fleet-scheduler-scope-fixed.txt` so the supported release boundary is recorded with the runtime evidence. The steady-state KVM nodes do not run `nix-agent.service`, so the lane records `nix-agent` scope markers instead of pretending a live-cluster `nix-agent` restart happened.
`nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof` is the focused local-KVM live-reconfiguration lane for the ChainFire control plane. It rebuilds from clean local runtime state, launches a temporary ChainFire replica on `node04`, proves learner add plus local replication, voter promotion, live leader transfer to another voting member, temporary-voter restart and rejoin, current-leader removal followed by re-election, removed-leader re-add, and final scale-in back to the canonical 3-node control-plane shape, and stores artifacts under `./work/chainfire-live-membership-proof/latest`.
`nix run ./nix/test-cluster#cluster -- provider-vm-reality-proof` is the focused local-KVM reality lane for `prismnet`, `flashdns`, `fiberlb`, and `plasmavmc`. It writes authoritative DNS answers, FiberLB backend drain or restore artifacts, and PlasmaVMC migration or storage-handoff state under `./work/provider-vm-reality-proof/latest`.
`./nix/test-cluster/run-core-control-plane-ops-proof.sh` is the focused operator lifecycle proof for `chainfire`, `flaredb`, and `iam`. It records the published ChainFire live-membership API boundary, the FlareDB additive-first migration and destructive-DDL boundary, and the standalone IAM bootstrap hardening plus signing-key, credential, and mTLS rotation proof under `./work/core-control-plane-ops-proof`.
`./nix/test-cluster/work-root-budget.sh` is the checked helper for local disk budget reporting, stronger local enforcement, and safer cleanup guidance under `./work`.
The dated 2026-04-10 artifact root for the focused control-plane proof is `./work/core-control-plane-ops-proof/20260410T172148+09:00`.
Runner-specific workflow wiring from `task/f5c70db0-baseline-profiles` is intentionally excluded from this re-aggregated baseline; the checked-in artifact here is the local wrapper.
## What it validates
- 3-node control-plane formation for `chainfire`, `flaredb`, and `iam`
- control-plane service health for `prismnet`, `flashdns`, `fiberlb`, `plasmavmc`, `lightningstor`, and `k8shost`
- worker-node `plasmavmc` and `lightningstor` startup, including KVM-only PlasmaVMC worker registration on the supported public surface
- LightningStor bucket metadata and explicit object-version APIs on the optional storage surface
- PrismNet port binding for PlasmaVMC guests, including lifecycle cleanup on VM deletion
- nested KVM inside worker VMs by booting an inner guest with `qemu-system-x86_64 -accel kvm`
- gateway-node `apigateway`, `nightlight`, and `creditservice` quota, wallet, reservation, and admission flows
- host-forwarded access to the API gateway and NightLight HTTP surfaces
- cross-node data replication smoke tests for `chainfire` and `flaredb`
- live ChainFire scale-out, learner promotion, leader transfer, temporary-voter restart, current-leader removal, re-add, and scale-in on the canonical control-plane shape
- deployer-seeded native runtime scheduling from declarative Nix service definitions, including drain/failover recovery
- ISO-based bare-metal bootstrap from `nixosConfigurations.ultracloud-iso` through phone-home, flake bundle fetch, Disko install, reboot, and desired-system activation
- durability and restore coverage for `chainfire`, `flaredb`, `deployer`, `coronafs`, and `lightningstor`
The supported `k8shost` coverage here is the `k8shost-server` API surface. `k8shost` is fixed as an API/control-plane product surface; runtime dataplane helpers stay archived non-product. Archived `k8shost-cni`, `k8shost-controllers`, and `lightningstor-csi` scaffolds stay outside the canonical profiles and are not part of the publishable proof.
## Validation layers
- image build: build all six VM derivations on the host in one `nix build`
- boot and unit readiness: boot the nodes in dependency order and wait for SSH plus the expected `systemd` units
- protocol surfaces: probe the expected HTTP, TCP, UDP, and metrics endpoints for each role
- replicated state: write and read convergence checks across the 3-node `chainfire` and `flaredb` clusters
- worker virtualization: launch a nested KVM guest inside both worker VMs
- external entrypoints: verify host-forwarded API gateway and NightLight access from outside the guest
- auth-integrated add-ons: confirm `creditservice` stays up, connects to IAM, and serves the published quota and wallet flows
- workload API contract: confirm `k8shost` pod watches return bounded snapshot streams and that LightningStor bucket metadata or version-listing RPCs round-trip against the live cluster
## Requirements
- minimal host requirements:
- Linux host with readable and writable `/dev/kvm`
- nested virtualization enabled on the host hypervisor
- `nix`
- enough free space under `./work` or `ULTRACLOUD_WORK_ROOT`
- if you do not use `nix run` or `nix develop`, install:
- `qemu-system-x86_64`
- `ssh`
- `sshpass`
- `curl`
The checked-in wrappers force local Nix builders and derive parallelism from host CPU count by default. Override with `ULTRACLOUD_LOCAL_NIX_MAX_JOBS`, `ULTRACLOUD_LOCAL_NIX_BUILD_CORES`, `PHOTON_CLUSTER_NIX_MAX_JOBS`, or `PHOTON_CLUSTER_NIX_BUILD_CORES` when a host needs different scheduling.
## Main commands
```bash
nix run ./nix/test-cluster#cluster -- build
nix run ./nix/test-cluster#cluster -- start
nix run ./nix/test-cluster#cluster -- smoke
nix run ./nix/test-cluster#cluster -- fresh-smoke
nix run ./nix/test-cluster#cluster -- baremetal-iso
nix run ./nix/test-cluster#cluster -- demo-vm-webapp
nix run ./nix/test-cluster#cluster -- fresh-demo-vm-webapp
nix run ./nix/test-cluster#cluster -- serve-vm-webapp
nix run ./nix/test-cluster#cluster -- fresh-serve-vm-webapp
nix run ./nix/test-cluster#cluster -- matrix
nix run ./nix/test-cluster#cluster -- fresh-matrix
nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof
nix run ./nix/test-cluster#cluster -- provider-vm-reality-proof
nix run ./nix/test-cluster#cluster -- rollout-soak
nix run ./nix/test-cluster#cluster -- durability-proof
nix run ./nix/test-cluster#cluster -- bench-storage
nix run ./nix/test-cluster#cluster -- fresh-bench-storage
nix run ./nix/test-cluster#cluster -- validate
nix run ./nix/test-cluster#cluster -- status
nix run ./nix/test-cluster#cluster -- ssh node04
nix run ./nix/test-cluster#cluster -- stop
nix run ./nix/test-cluster#cluster -- clean
make cluster-smoke
```
Preferred entrypoint for publishable verification: `nix run ./nix/test-cluster#cluster -- fresh-smoke`
Preferred entrypoint for publishable bare-metal bootstrap verification: `nix run ./nix/test-cluster#cluster -- baremetal-iso`
Preferred entrypoint for the exact host-KVM bare-metal proof lane: `nix build .#checks.x86_64-linux.baremetal-iso-e2e && ./result/bin/baremetal-iso-e2e <log-dir>`
Preferred entrypoint for physical-node preflight and handoff: `nix run ./nix/test-cluster#hardware-smoke -- preflight`
Preferred entrypoint for portable local verification on TCG-only hosts: `nix build .#checks.x86_64-linux.portable-control-plane-regressions`
Preferred entrypoint for reproducible KVM-suite reruns: `./nix/test-cluster/run-publishable-kvm-suite.sh <log-dir>`
Preferred entrypoint for the full supported-surface proof on a local AMD/KVM host: `./nix/test-cluster/run-supported-surface-final-proof.sh <log-dir>`
Preferred entrypoint for focused ChainFire, FlareDB, and IAM operator lifecycle verification: `./nix/test-cluster/run-core-control-plane-ops-proof.sh <log-dir>`
Preferred entrypoint for local disk budget reporting: `./nix/test-cluster/work-root-budget.sh status`
Preferred entrypoint for local budget enforcement: `./nix/test-cluster/work-root-budget.sh enforce`
Preferred entrypoint for safer dated-proof cleanup dry-runs: `./nix/test-cluster/work-root-budget.sh prune-proof-logs 2`
`make cluster-smoke` is a convenience wrapper for the same clean host-build VM validation flow.
`nix run ./nix/test-cluster#cluster -- demo-vm-webapp` creates a PrismNet-attached VM, boots a tiny web app inside the guest, stores its counter in FlareDB, writes JSON snapshots to LightningStor object storage, and then proves that the state survives guest restart plus cross-worker migration. The attached data volume is still used by the guest for its local bootstrap config.
`nix run ./nix/test-cluster#cluster -- serve-vm-webapp` runs the same VM web app flow but leaves the guest running and prints a `http://127.0.0.1:<port>/` URL that is forwarded from the host into the tenant network so you can inspect `/state` or send `POST /visit` yourself.
`nix run ./nix/test-cluster#cluster -- matrix` reuses the current running cluster to exercise composed service scenarios such as `prismnet + flashdns + fiberlb`, PrismNet-backed VM hosting with `plasmavmc + prismnet + coronafs + lightningstor`, the Kubernetes-style hosting bundle, and API-gateway-mediated `nightlight` / `creditservice` flows.
Preferred entrypoint for publishable matrix verification: `nix run ./nix/test-cluster#cluster -- fresh-matrix`
Preferred entrypoint for focused ChainFire live membership verification: `nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof`
Preferred entrypoint for focused provider and VM-hosting reality verification: `nix run ./nix/test-cluster#cluster -- provider-vm-reality-proof`
Preferred entrypoint for longer-running rollout maintenance and DR verification: `nix run ./nix/test-cluster#cluster -- rollout-soak`
Preferred entrypoint for durability and restore verification: `nix run ./nix/test-cluster#cluster -- durability-proof`
The dated 2026-04-10 proof root for that lane is `./work/durability-proof/20260410T120618+0900`; `result.json` records `success=true`, and the artifact set includes `deployer-post-restart-list.json`, `coronafs-node04-local-state.json`, and `lightningstor-head-during-node05-outage.json`.
The dated 2026-04-10 proof root for the provider and VM-hosting lane is `./work/provider-vm-reality-proof/20260410T135827+0900`; `result.json` records `success=true`, and the artifact set includes `network-provider/fiberlb-drain-summary.txt`, `network-provider/flashdns-service-authoritative-answer.txt`, and `vm-hosting/migration-summary.json`.
## Rollout Bundle Operator Contract
The supported operator contract for `deployer`, `fleet-scheduler`, `nix-agent`, and `node-agent` is fixed in [../../docs/rollout-bundle.md](../../docs/rollout-bundle.md).
- `deployer` is supported as one active writer with restart or cold-standby restore. Automatic ChainFire-backed multi-instance failover is outside the supported product contract for this release.
- `nix-agent` health-check and rollback behavior is proven by `nix build .#checks.x86_64-linux.deployer-vm-rollback`, while `baremetal-iso` and `baremetal-iso-e2e` prove the same desired-system handoff with the installer in front.
- `fresh-smoke` is the canonical KVM proof for `fleet-scheduler` drain, maintenance, and failover semantics. It drains `node04`, checks relocation to `node05`, restores `node04`, then stops `node05` and verifies failover plus replica restoration when the worker returns.
- `rollout-soak` is the longer-running companion for that same contract. It proves the current release boundary of one planned drain cycle, one fail-stop worker-loss cycle, and 30-second held degraded states on the two native-runtime workers, then restarts the rollout services and the canonical control-plane services before rechecking the live runtime state. The dated 2026-04-10 release-grade artifact root is `./work/rollout-soak/20260410T164549+0900`.
- `node-agent` product scope is host-local runtime reconcile only. Logs and pid metadata live under `${stateDir}/pids`, secrets must already exist in the rendered spec or mounted files, host-path volumes are pass-through only, and upgrades are replace-and-reconcile operations.
`nix run ./nix/test-cluster#cluster -- bench-storage` benchmarks CoronaFS controller-export vs node-local-export I/O, worker-side materialization latency, and LightningStor large/small-object S3 throughput, then writes a report to `docs/storage-benchmarks.md`.
Preferred entrypoint for publishable storage numbers: `nix run ./nix/test-cluster#cluster -- fresh-storage-bench`
`nix run ./nix/test-cluster#cluster -- bench-coronafs-local-matrix` runs the local single-process CoronaFS export benchmark across the supported `cache`/`aio` combinations so software-path regressions can be separated from VM-lab network limits.
On the current lab hosts, `cache=none` with `aio=io_uring` is the strongest local-export profile and should be treated as the reference point when CoronaFS remote numbers are being distorted by the nested-QEMU/VDE network path.
## Advanced usage
Use the script entrypoint only for local debugging inside a prepared Nix shell:
```bash
nix develop ./nix/test-cluster -c ./nix/test-cluster/run-cluster.sh smoke
```
For the strongest local check, use:
```bash
nix develop ./nix/test-cluster -c ./nix/test-cluster/run-cluster.sh fresh-smoke
```
## Runtime state
The harness stores build links and VM runtime state under `${PHOTON_CLUSTER_WORK_ROOT:-$REPO_ROOT/work/test-cluster}` by default, with VM disks under `${PHOTON_VM_DIR:-$PHOTON_CLUSTER_WORK_ROOT/state}` and VDE switch state under `${PHOTON_CLUSTER_VDE_SWITCH_DIR:-$PHOTON_CLUSTER_WORK_ROOT/vde-switch}`. Alternate build profiles use profile-suffixed siblings such as `${PHOTON_VM_DIR:-$PHOTON_CLUSTER_WORK_ROOT/state}-storage`.
The publishable KVM wrapper keeps its logs under the path you pass in, defaults runtime/cache state to `./work/publishable-kvm-runtime`, and defaults temporary files to `./work/tmp`.
Logs for each VM are written to `<state-dir>/<node>/vm.log`.
Use `./nix/test-cluster/work-root-budget.sh status` for disk budget reporting, `./nix/test-cluster/work-root-budget.sh enforce` when a local proof run should fail once tracked paths exceed soft budgets, and `./nix/test-cluster/work-root-budget.sh prune-proof-logs 2` for a safer dated-proof cleanup dry-run. The helper reports the size of `./work`, `./work/test-cluster/state`, disposable runtime roots, and dated proof directories including `./work/rollout-soak`, `./work/provider-vm-reality-proof`, and `./work/hardware-smoke`, then prints a safe cleanup sequence that stops the cluster, removes transient VM state, trims old proof logs, and finally runs a Nix store GC once old result symlinks are no longer needed.
`./work/hardware-smoke` is the proof root for physical-node bring-up attempts. `hardware-smoke.sh` keeps `latest` pointed at the newest preflight or capture run so transport-free blocked state and real hardware evidence land in the same place.
## Scope note
This harness is intentionally VM-first, but the canonical bare-metal install proof also lives here so the docs, harness, and `flake check` all exercise the same ISO route. Older ad hoc launch scripts under `baremetal/vm-cluster` are `legacy/manual` paths, `nixosConfigurations.netboot-worker` is an archived worker helper outside the canonical guard set, and only `netboot-all-in-one` plus `netboot-control-plane` remain companion images for the supported profiles.