116 lines
7.4 KiB
Markdown
116 lines
7.4 KiB
Markdown
# UltraCloud VM Test Cluster
|
|
|
|
`nix/test-cluster` is the canonical local validation path for UltraCloud.
|
|
It boots six QEMU VMs, treats them as hardware-like nodes, and validates representative control-plane, worker, and gateway behavior over SSH and service endpoints.
|
|
All VM images are built on the host in a single Nix invocation and then booted as prebuilt artifacts. The guests do not compile the stack locally.
|
|
The same harness also owns the canonical bare-metal bootstrap proof: a raw-QEMU ISO flow that phones home to `deployer`, runs Disko, reboots, and waits for `nix-agent` desired-system convergence on one control-plane node and one worker-equivalent node.
|
|
|
|
When `/dev/kvm` is absent, the portable fallback is not another harness subcommand. Use the root-flake non-KVM lane instead: `nix build .#checks.x86_64-linux.portable-control-plane-regressions`.
|
|
When `/dev/kvm` and nested virtualization are available, the reproducible publishable lane is `./nix/test-cluster/run-publishable-kvm-suite.sh`, which records environment metadata and then runs `fresh-smoke`, `fresh-demo-vm-webapp`, and `fresh-matrix` in order.
|
|
|
|
## What it validates
|
|
|
|
- 3-node control-plane formation for `chainfire`, `flaredb`, and `iam`
|
|
- control-plane service health for `prismnet`, `flashdns`, `fiberlb`, `plasmavmc`, `lightningstor`, and `k8shost`
|
|
- worker-node `plasmavmc` and `lightningstor` startup
|
|
- PrismNet port binding for PlasmaVMC guests, including lifecycle cleanup on VM deletion
|
|
- nested KVM inside worker VMs by booting an inner guest with `qemu-system-x86_64 -accel kvm`
|
|
- gateway-node `apigateway`, `nightlight`, and minimal `creditservice` startup
|
|
- host-forwarded access to the API gateway and NightLight HTTP surfaces
|
|
- cross-node data replication smoke tests for `chainfire` and `flaredb`
|
|
- deployer-seeded native runtime scheduling from declarative Nix service definitions, including drain/failover recovery
|
|
- ISO-based bare-metal bootstrap from `nixosConfigurations.ultracloud-iso` through phone-home, flake bundle fetch, Disko install, reboot, and desired-system activation
|
|
|
|
## Validation layers
|
|
|
|
- image build: build all six VM derivations on the host in one `nix build`
|
|
- boot and unit readiness: boot the nodes in dependency order and wait for SSH plus the expected `systemd` units
|
|
- protocol surfaces: probe the expected HTTP, TCP, UDP, and metrics endpoints for each role
|
|
- replicated state: write and read convergence checks across the 3-node `chainfire` and `flaredb` clusters
|
|
- worker virtualization: launch a nested KVM guest inside both worker VMs
|
|
- external entrypoints: verify host-forwarded API gateway and NightLight access from outside the guest
|
|
- auth-integrated minimal services: confirm `creditservice` stays up and actually connects to IAM
|
|
|
|
## Requirements
|
|
|
|
- minimal host requirements:
|
|
- Linux host with `/dev/kvm`
|
|
- nested virtualization enabled on the host hypervisor
|
|
- `nix`
|
|
- if you do not use `nix run` or `nix develop`, install:
|
|
- `qemu-system-x86_64`
|
|
- `ssh`
|
|
- `sshpass`
|
|
- `curl`
|
|
|
|
## Main commands
|
|
|
|
```bash
|
|
nix run ./nix/test-cluster#cluster -- build
|
|
nix run ./nix/test-cluster#cluster -- start
|
|
nix run ./nix/test-cluster#cluster -- smoke
|
|
nix run ./nix/test-cluster#cluster -- fresh-smoke
|
|
nix run ./nix/test-cluster#cluster -- baremetal-iso
|
|
nix run ./nix/test-cluster#cluster -- demo-vm-webapp
|
|
nix run ./nix/test-cluster#cluster -- fresh-demo-vm-webapp
|
|
nix run ./nix/test-cluster#cluster -- serve-vm-webapp
|
|
nix run ./nix/test-cluster#cluster -- fresh-serve-vm-webapp
|
|
nix run ./nix/test-cluster#cluster -- matrix
|
|
nix run ./nix/test-cluster#cluster -- fresh-matrix
|
|
nix run ./nix/test-cluster#cluster -- bench-storage
|
|
nix run ./nix/test-cluster#cluster -- fresh-bench-storage
|
|
nix run ./nix/test-cluster#cluster -- validate
|
|
nix run ./nix/test-cluster#cluster -- status
|
|
nix run ./nix/test-cluster#cluster -- ssh node04
|
|
nix run ./nix/test-cluster#cluster -- stop
|
|
nix run ./nix/test-cluster#cluster -- clean
|
|
make cluster-smoke
|
|
```
|
|
|
|
Preferred entrypoint for publishable verification: `nix run ./nix/test-cluster#cluster -- fresh-smoke`
|
|
|
|
Preferred entrypoint for publishable bare-metal bootstrap verification: `nix run ./nix/test-cluster#cluster -- baremetal-iso`
|
|
|
|
Preferred entrypoint for portable local verification on TCG-only hosts: `nix build .#checks.x86_64-linux.portable-control-plane-regressions`
|
|
|
|
Preferred entrypoint for reproducible KVM-suite reruns: `./nix/test-cluster/run-publishable-kvm-suite.sh <log-dir>`
|
|
|
|
`make cluster-smoke` is a convenience wrapper for the same clean host-build VM validation flow.
|
|
|
|
`nix run ./nix/test-cluster#cluster -- demo-vm-webapp` creates a PrismNet-attached VM, boots a tiny web app inside the guest, stores its counter in FlareDB, writes JSON snapshots to LightningStor object storage, and then proves that the state survives guest restart plus cross-worker migration. The attached data volume is still used by the guest for its local bootstrap config.
|
|
|
|
`nix run ./nix/test-cluster#cluster -- serve-vm-webapp` runs the same VM web app flow but leaves the guest running and prints a `http://127.0.0.1:<port>/` URL that is forwarded from the host into the tenant network so you can inspect `/state` or send `POST /visit` yourself.
|
|
|
|
`nix run ./nix/test-cluster#cluster -- matrix` reuses the current running cluster to exercise composed service scenarios such as `prismnet + flashdns + fiberlb`, PrismNet-backed VM hosting with `plasmavmc + prismnet + coronafs + lightningstor`, the Kubernetes-style hosting bundle, and API-gateway-mediated `nightlight` / `creditservice` flows.
|
|
|
|
Preferred entrypoint for publishable matrix verification: `nix run ./nix/test-cluster#cluster -- fresh-matrix`
|
|
|
|
`nix run ./nix/test-cluster#cluster -- bench-storage` benchmarks CoronaFS controller-export vs node-local-export I/O, worker-side materialization latency, and LightningStor large/small-object S3 throughput, then writes a report to `docs/storage-benchmarks.md`.
|
|
|
|
Preferred entrypoint for publishable storage numbers: `nix run ./nix/test-cluster#cluster -- fresh-storage-bench`
|
|
|
|
`nix run ./nix/test-cluster#cluster -- bench-coronafs-local-matrix` runs the local single-process CoronaFS export benchmark across the supported `cache`/`aio` combinations so software-path regressions can be separated from VM-lab network limits.
|
|
On the current lab hosts, `cache=none` with `aio=io_uring` is the strongest local-export profile and should be treated as the reference point when CoronaFS remote numbers are being distorted by the nested-QEMU/VDE network path.
|
|
|
|
## Advanced usage
|
|
|
|
Use the script entrypoint only for local debugging inside a prepared Nix shell:
|
|
|
|
```bash
|
|
nix develop ./nix/test-cluster -c ./nix/test-cluster/run-cluster.sh smoke
|
|
```
|
|
|
|
For the strongest local check, use:
|
|
|
|
```bash
|
|
nix develop ./nix/test-cluster -c ./nix/test-cluster/run-cluster.sh fresh-smoke
|
|
```
|
|
|
|
## Runtime state
|
|
|
|
The harness stores build links and VM runtime state under `${PHOTON_VM_DIR:-$HOME/.ultracloud-test-cluster}` for the default profile and uses profile-suffixed siblings such as `${PHOTON_VM_DIR:-$HOME/.ultracloud-test-cluster}-storage` for alternate build profiles.
|
|
Logs for each VM are written to `<state-dir>/<node>/vm.log`.
|
|
|
|
## Scope note
|
|
|
|
This harness is intentionally VM-first, but the canonical bare-metal install proof also lives here so the docs, harness, and `flake check` all exercise the same ISO route. Older ad hoc launch scripts under `baremetal/vm-cluster` are legacy/manual paths, and the `netboot-*` images remain experimental helpers rather than the supported bootstrap entrypoint.
|