photoncloud-monorepo/README.md

117 lines
8.1 KiB
Markdown

# UltraCloud
UltraCloud is a Nix-first cloud platform workspace that assembles a small control plane, network services, VM hosting, shared storage, object storage, and gateway services into one reproducible repository.
The fastest public entrypoint is the one-command single-node quickstart. The canonical multi-node integration proof remains the six-node VM cluster under [`nix/test-cluster`](nix/test-cluster/README.md), which builds all guest images on the host, boots them as hardware-like QEMU nodes, and validates real multi-node behavior.
The canonical bare-metal bootstrap proof is the ISO-on-QEMU path under [`nix/test-cluster`](nix/test-cluster/README.md), which drives phone-home, Disko install, reboot, and desired-system convergence for one control-plane node and one worker-equivalent node.
## Components
- `chainfire`: replicated coordination store
- `flaredb`: replicated KV and metadata store
- `iam`: identity, token issuance, and authorization
- `prismnet`: tenant networking control plane
- `flashdns`: authoritative DNS service
- `fiberlb`: load balancer control plane and dataplane
- `plasmavmc`: VM control plane and worker agents
- `coronafs`: shared filesystem for mutable VM volumes
- `lightningstor`: object storage and VM image backing
- `k8shost`: Kubernetes-style hosting control plane for tenant pods and services
- `apigateway`: external API and proxy surface
- `nightlight`: metrics ingestion and query service
- `creditservice`: minimal reference quota/credit service
- `deployer`: bootstrap and phone-home deployment service that owns install plans and desired-system intent
- `fleet-scheduler`: non-Kubernetes service scheduler for bare-metal cluster services
## Quick Start
Single-node quickstart:
```bash
nix run .#single-node-quickstart
```
This app builds the minimal VM stack, boots a QEMU VM, waits for `chainfire`, `flaredb`, `iam`, `prismnet`, and `plasmavmc`, checks their health endpoints, and verifies the in-guest VM runtime prerequisites. For an interactive session, keep the VM running:
```bash
ULTRACLOUD_QUICKSTART_KEEP_VM=1 nix run .#single-node-quickstart
```
The legacy name `.#all-in-one-quickstart` is kept as an alias.
Portable local proof on hosts without `/dev/kvm`:
```bash
nix build .#checks.x86_64-linux.canonical-profile-eval-guards
nix build .#checks.x86_64-linux.portable-control-plane-regressions
```
This TCG-safe lane keeps canonical profile drift, the core `chainfire` / `deployer` control-plane path, the `deployer -> nix-agent` boundary, and the `fleet-scheduler -> node-agent` boundary under regression coverage without requiring nested virtualization.
Publishable nested-KVM suite:
```bash
nix develop
nix run ./nix/test-cluster#cluster -- fresh-smoke
nix run ./nix/test-cluster#cluster -- fresh-demo-vm-webapp
nix run ./nix/test-cluster#cluster -- fresh-matrix
./nix/test-cluster/run-publishable-kvm-suite.sh ./work/publishable-kvm-suite
```
Project-done release proof now requires both halves of the public validation surface to be green:
- `baremetal-iso` and `baremetal-iso-e2e` for the canonical `deployer -> installer -> nix-agent` bare-metal bootstrap path
- the KVM publishable suite (`fresh-smoke`, `fresh-demo-vm-webapp`, `fresh-matrix`) for the nested-KVM multi-node VM-hosting path
Canonical bare-metal bootstrap proof:
```bash
nix run ./nix/test-cluster#cluster -- baremetal-iso
nix build .#checks.x86_64-linux.baremetal-iso-e2e
```
## Canonical Profiles
UltraCloud now fixes the public support surface to three canonical profiles:
| Profile | Primary Nix outputs | Required components | Optional components |
| --- | --- | --- | --- |
| `single-node dev` | `nix run .#single-node-quickstart`, `nixosConfigurations.single-node-quickstart`, companion install image `nixosConfigurations.netboot-all-in-one` | `chainfire`, `flaredb`, `iam`, `plasmavmc`, `prismnet` | `lightningstor`, `coronafs`, `flashdns`, `fiberlb`, `apigateway`, `nightlight`, `creditservice`, `k8shost`, `deployer` |
| `3-node HA control plane` | `nixosConfigurations.node01`, `node02`, `node03`, `netboot-control-plane` | `chainfire`, `flaredb`, `iam`, `nix-agent` on every control-plane node, plus `deployer` on the bootstrap node | `fleet-scheduler`, `node-agent`, `prismnet`, `flashdns`, `fiberlb`, `plasmavmc`, `lightningstor`, `coronafs`, `k8shost`, `apigateway`, `nightlight`, `creditservice` |
| `bare-metal bootstrap` | `nixosConfigurations.ultracloud-iso`, `nixosConfigurations.baremetal-qemu-control-plane`, `nixosConfigurations.baremetal-qemu-worker`, `checks.x86_64-linux.baremetal-iso-e2e` | `deployer`, `first-boot-automation`, `install-target`, `nix-agent` | `netboot-control-plane`, `netboot-worker`, and `netboot-all-in-one` as experimental helper images, plus `node-agent`, `fleet-scheduler`, and higher-level storage or edge services after bootstrap |
`netboot-base` is an internal helper image, not a public profile. `netboot-control-plane`, `netboot-worker`, and `netboot-all-in-one` remain experimental helper images until they implement the same phone-home and install semantics as the ISO path. Older launch flows under `baremetal/vm-cluster` are `legacy/manual`, not canonical.
## Responsibility Boundaries
- `k8shost` owns Kubernetes-style pod and service APIs for tenant workloads, then translates them into `prismnet`, `flashdns`, and `fiberlb` objects. It does not place host-native cluster daemons.
- `fleet-scheduler` owns placement and failover of host-native service instances from declarative cluster state. It consumes `node-agent` heartbeats and writes instance placement, but it does not expose tenant-facing Kubernetes semantics.
- `deployer` owns machine enrollment, `/api/v1/phone-home`, install plans, cluster metadata, and desired-system references. It decides what a node should become, but it does not execute the host-local switch.
- `nix-agent` owns host-local NixOS convergence only. It reads desired-system state from `deployer` or `chainfire`, activates the target closure, and rolls back on failed health checks.
- `node-agent` owns host-local runtime execution only. It reports heartbeats and applies scheduled service-instance state, but it does not install the base OS or rewrite desired-system targets.
## Main Entrypoints
- workspace flake: [flake.nix](flake.nix)
- single-node quickstart smoke: [`nix run .#single-node-quickstart`](docs/testing.md)
- portable local proof: [`nix build .#checks.x86_64-linux.portable-control-plane-regressions`](docs/testing.md)
- canonical bare-metal bootstrap smoke: [`nix run ./nix/test-cluster#cluster -- baremetal-iso`](docs/testing.md)
- canonical profile guards: [`nix build .#checks.x86_64-linux.canonical-profile-eval-guards`](docs/testing.md), [`nix build .#checks.x86_64-linux.canonical-profile-build-guards`](docs/testing.md)
- VM validation harness: [nix/test-cluster/README.md](nix/test-cluster/README.md)
- shared volume notes: [coronafs/README.md](coronafs/README.md)
- minimal quota-service rationale: [creditservice/README.md](creditservice/README.md)
- legacy/manual VM launch scripts: [baremetal/vm-cluster/README.md](baremetal/vm-cluster/README.md)
## Repository Guide
- [docs/README.md](docs/README.md): documentation entrypoint
- [docs/testing.md](docs/testing.md): validation path summary
- [docs/component-matrix.md](docs/component-matrix.md): canonical profiles and optional bundles
- [docs/storage-benchmarks.md](docs/storage-benchmarks.md): latest CoronaFS and LightningStor lab numbers
- `plans/`: design notes and exploration documents
## Scope
UltraCloud is centered on reproducible infrastructure behavior rather than polished end-user product surfaces. Some services, such as `creditservice`, are intentionally minimal reference implementations that prove integration points rather than full products.
Host-level NixOS rollout validation is also expected to stay reproducible: `baremetal-iso-e2e` is now the full install-path proof, `canonical-profile-eval-guards` and `canonical-profile-build-guards` fail fast when supported outputs drift, and `portable-control-plane-regressions` is the non-KVM developer lane that keeps the main control-plane and rollout boundaries green on TCG-only hosts before the publishable nested-KVM suite is rerun.