photoncloud-monorepo/docs/edge-trial-surface.md

83 lines
4.5 KiB
Markdown

# Edge And Trial Surface
This document fixes the supported product boundary for the edge bundle and the lightest trial surface.
## APIGateway
APIGateway is supported as stateless replicated instances behind an external L4 or VIP layer; live in-process reload is not part of the product contract.
Supported operator contract:
1. Render gateway config from Nix or `ultracloud.cluster` generated inputs and restart or replace the process when routes, auth providers, or credit providers change.
2. Scale out by running multiple identical gateway instances behind FiberLB, an external load balancer, or another L4 or VIP distribution layer.
3. Treat route distribution as configuration rollout, not as a dynamic control-plane API.
Explicit non-supported behavior:
1. Hot route reload through an admin API or `SIGHUP`.
2. Stateful leader election or in-process config distribution between gateway replicas.
3. A release promise that every HA topology is directly exercised by `fresh-matrix`.
Current proof scope:
1. `fresh-matrix` proves the shipped single gateway-node composition on `node06`.
2. The HA story is a supported operator shape, but the release-facing proof remains one stateless gateway instance plus restart-based rollout.
## NightLight
NightLight is supported as a single-node WAL/snapshot service; replicated HA metrics storage is not part of the product contract.
Supported operator contract:
1. Use one NightLight instance per edge bundle, per lab, or per tenant environment when you need a hard operational boundary.
2. Use `retention_days`, the WAL, and periodic snapshots as the retention contract for that instance.
3. Put shared access control in front of NightLight with APIGateway or another authenticated front door when multiple writers or readers share the same endpoint.
Explicit non-supported behavior:
1. Multi-node or quorum-backed NightLight replication.
2. Per-tenant retention enforcement inside NightLight itself.
3. Treating NightLight labels as a hard security boundary.
The supported tenant contract is therefore deployment-scoped: one NightLight instance can serve one environment or a carefully trusted shared bundle, but tenant isolation is not enforced inside the process.
## CreditService
CreditService export and backend migration are supported as offline export/import or backend-native snapshot workflows, not live mixed-writer migration.
Supported operator contract:
1. Keep CreditService scoped to quota, wallet, reservation, and admission-control behavior.
2. Use backend-native snapshots or logical API replay as the export baseline.
3. Drain or quiesce writes before moving between FlareDB, PostgreSQL, or SQLite backends.
4. Rehydrate the target backend, then cut APIGateway or callers over to the new endpoint.
Explicit non-supported behavior:
1. Finance-grade ledger ownership.
2. Live mixed-writer backend migration.
3. Turning the service into a pricing, invoicing, or settlement platform.
## Trial Surface
OCI/Docker artifact is intentionally not the public trial surface.
The supported lightweight trial remains:
1. `nix build .#single-node-trial-vm`
2. `nix run .#single-node-trial`
3. `nix run .#single-node-quickstart`
That boundary exists because the supported VM-platform contract needs a guest kernel plus host KVM, `/dev/net/tun`, and OVS or libvirt semantics. A Docker or OCI image would either be host-coupled and privileged or prove a different, weaker contract.
## Work Root Budget
Use `./nix/test-cluster/work-root-budget.sh status` for reporting, `./nix/test-cluster/work-root-budget.sh enforce` for a stronger local budget gate, and `./nix/test-cluster/work-root-budget.sh prune-proof-logs 2` for safer dated-proof cleanup.
Recommended soft budgets on a local AMD/KVM proof host:
1. Keep `./work/test-cluster/state` under roughly 35 GiB.
2. Keep disposable runtime state such as `./work/tmp` and `./work/publishable-kvm-runtime` under roughly 10 GiB combined.
3. Keep dated proof roots trimmed so combined proof logs stay under roughly 20 GiB unless you are intentionally archiving a release snapshot.
The helper prints current sizes, highlights budget overruns, and prints safe cleanup steps such as stopping the cluster, cleaning runtime state, deleting disposable log roots, and then running a Nix store GC after old result symlinks are no longer needed. The `enforce` mode lets local proof lanes fail fast when the operator has let `./work` drift beyond the documented soft budget, and `prune-proof-logs` gives a dry-run-first workflow for trimming dated proof roots.