# Rollout Bundle Operations

This document fixes the supported operator contract for the native rollout bundle:

- `deployer`
- `fleet-scheduler`
- `nix-agent`
- `node-agent`

The supported layering is still `deployer -> nix-agent` for host OS rollout and `deployer -> fleet-scheduler -> node-agent` for host-native service placement.

## Supported Scope

- `deployer` is supported as a single logical rollout authority. The supported recovery model is restart-in-place or cold-standby replacement that reuses the same `chainfire` namespace, admin and bootstrap credentials, bootstrap flake bundle, and local state backup.
- `deployer` is scope-fixed to one active writer plus optional cold-standby restore; automatic ChainFire-backed multi-instance failover is outside the supported product contract for this release. Do not run multiple writers against the same `deployer` namespace and assume automatic leader failover is safe.
- `nix-agent` is supported for host-local desired-system apply, post-activation health-check execution, and rollback to the previous known system.
- `fleet-scheduler` is scope-fixed to the two native-runtime worker lab with one planned drain cycle, one fail-stop worker-loss cycle, and 30-second held degraded states in `rollout-soak`; multi-hour maintenance windows, pinned singleton policies, and large-cluster drain storms are outside the supported product contract for this release.
- `node-agent` is supported for host-local runtime reconcile, process and container execution, per-instance logs, and declared host-path volume mounts. It is not a secret manager, a storage provisioner, or an in-place binary patch system.

## Proof Commands

- `nix build .#checks.x86_64-linux.deployer-vm-smoke`
- `nix build .#checks.x86_64-linux.deployer-vm-rollback`
- `nix build .#checks.x86_64-linux.portable-control-plane-regressions`
- `nix build .#checks.x86_64-linux.fleet-scheduler-e2e`
- `nix run ./nix/test-cluster#cluster -- fresh-smoke`
- `nix run ./nix/test-cluster#cluster -- rollout-soak`
- `nix run ./nix/test-cluster#cluster -- durability-proof`

`deployer-vm-rollback` is the smallest reproducible proof for the `nix-agent` health-check and rollback path. `fresh-smoke` and `fleet-scheduler-e2e` keep the short regression semantics green. `rollout-soak` is the longer-running KVM operator lane for one planned drain cycle, one fail-stop worker-loss cycle, and service-restart behavior across `deployer`, `fleet-scheduler`, `node-agent`, and the canonical 3-node control plane. It writes `scope-fixed-contract.json`, `deployer-scope-fixed.txt`, and `fleet-scheduler-scope-fixed.txt` so the release boundary is captured in the proof root instead of being implied only by docs. The steady-state `nix/test-cluster` nodes record explicit `nix-agent` scope markers instead of pretending they run `nix-agent.service`. `durability-proof` remains the canonical persisted artifact lane for `deployer` backup, restart, replay, and storage-side failure injection.

## Deployer HA And DR

Supported deployer recovery is a single-writer restore runbook. `DEPLOYER-P1-01` is closed as a scope-fixed release boundary rather than an implied future HA promise:

1. Preserve the generated cluster state from `ultracloud.cluster`, the deployer bootstrap and admin credentials, and `services.deployer.localStatePath`.
2. Start exactly one `deployer` instance with the same `chainfireEndpoints`, `clusterNamespace`, `chainfireNamespace`, tokens, and optional TLS CA inputs.
3. Re-apply the canonical cluster state:

```bash
deployer-ctl \
  --chainfire-endpoint http://127.0.0.1:2379 \
  --cluster-id <cluster-id> \
  --cluster-namespace ultracloud \
  --deployer-namespace deployer \
  apply --config cluster-state.json --prune
```

4. Replay any preserved admin pre-register requests in the same shape as `./work/durability-proof/latest/deployer-pre-register-request.json`.
5. Verify the recovered state with `curl -fsS -H 'x-deployer-token: <token>' http://<deployer>:8088/api/v1/admin/nodes | jq` and, for node rollout intent, `deployer-ctl node inspect --node-id <node> --include-desired-system --include-observed-system`.

The 2026-04-10 canonical backup-and-replay proof for this contract is `nix run ./nix/test-cluster#cluster -- durability-proof`, which recorded `deployer-pre-register-request.json`, `deployer-backup-list.json`, `deployer-post-restart-list.json`, and `deployer-replayed-list.json` under `./work/durability-proof/20260410T120618+0900`. The longer-run live-operations companion is `nix run ./nix/test-cluster#cluster -- rollout-soak`, which on 2026-04-10 recorded `deployer-post-restart-nodes.json`, `maintenance-held.json`, `power-loss-held.json`, `post-control-plane-restarts.json`, `scope-fixed-contract.json`, and `deployer-scope-fixed.txt` under `./work/rollout-soak/20260410T164549+0900` while holding degraded states and re-checking the admin inventory.

## Nix-Agent Operator Contract

- `services.nix-agent.healthCheckCommand` is an argv vector, not a shell fragment. Every entry is passed to the process directly.
- The command runs after `switch-to-configuration`.
- Exit status `0` means the desired system stays active.
- Non-zero exit with `rollbackOnFailure = true` causes rollback to the previous known system and reports observed status `rolled-back`.
- Non-zero exit with `rollbackOnFailure = false` leaves the failed generation in place and requires operator intervention.

The supported recovery flow is:

1. Inspect the desired and observed rollout state:

```bash
deployer-ctl \
  --chainfire-endpoint http://127.0.0.1:2379 \
  --cluster-id <cluster-id> \
  --cluster-namespace ultracloud \
  --deployer-namespace deployer \
  node inspect \
  --node-id <node-id> \
  --include-desired-system \
  --include-observed-system
```

2. If the node reports `rolled-back`, fix the failed target or health-check input, then re-publish the desired system.
3. Re-run the smallest proof lane with `nix build .#checks.x86_64-linux.deployer-vm-rollback` when the issue is in the `deployer -> nix-agent` boundary, or the installer-backed `baremetal-iso` and `baremetal-iso-e2e` lanes when the issue includes first boot.

`deployer-vm-rollback` is the canonical reproducible proof for this contract. It publishes a desired system whose `health_check_command = ["false"]`, expects observed status `rolled-back`, and proves that the current system does not remain on the rejected target generation. The longer-running 2026-04-10 `rollout-soak` lane does not pretend the steady-state `nix/test-cluster` nodes are deployer-managed `nix-agent` nodes; instead it records `node01-nix-agent-scope.txt` and `node04-nix-agent-scope.txt` under `./work/rollout-soak/20260410T154744+0900`, while the executable `nix-agent` proof surface remains `deployer-vm-rollback`, `baremetal-iso`, and `baremetal-iso-e2e`.

## Fleet-Scheduler Drain And Maintenance Contract

- Use `deployer-ctl node set-state --node-id <node> --state draining` for planned short-lived maintenance.
- `draining` removes the node from new placement and causes the scheduler to relocate replicated work when capacity exists.
- `active` re-admits the node and allows replica count to grow back, but healthy singleton work is not required to churn back automatically.
- Fail-stop worker loss is treated like implicit maintenance exhaustion: the scheduler restores healthy placement on the remaining eligible nodes when placement policy allows it.
- The supported release proof is limited to the two native-runtime worker lab with one planned drain cycle and one fail-stop worker-loss cycle, each held for 30 seconds in `rollout-soak`.
- Multi-hour maintenance windows, operator approval workflows, pinned singleton drain choreography, and large-cluster drain storms remain outside the supported contract for this release.

`fresh-smoke` is the canonical KVM proof for the baseline behavior. It drains `node04`, verifies that `native-web`, `native-container`, and `native-daemon` relocate to `node05`, restores `node04`, then simulates `node05` loss and verifies failover back to `node04` plus replica restoration when `node05` returns. `rollout-soak` reruns that choreography as exactly one planned drain cycle and one fail-stop worker-loss cycle, holds each degraded state for 30 seconds, restarts the rollout services, and then rechecks the live runtime state; the 2026-04-10 run under `./work/rollout-soak/20260410T164549+0900` is the current release-grade artifact for that scope-fixed boundary. `fleet-scheduler-e2e` remains the cheap regression lane for the same scheduling semantics.

## Node-Agent Logs, Secrets, Volumes, And Upgrade Contract

- Runtime state lives under `services.node-agent.stateDir`, with pid files, metadata, and per-instance logs under `${stateDir}/pids`.
- Each managed instance writes combined stdout and stderr to `${stateDir}/pids/<service>-<instance>.pid.log`.
- Metadata is persisted beside the pid file as `${stateDir}/pids/<service>-<instance>.pid.meta.json`, including argv and boot-id data used to reject stale pid reuse across reboot.
- Secrets are not fetched, rotated, or encrypted by `node-agent`. Supported secret delivery is limited to values already present in the rendered service spec, environment, or mounted host files.
- Volumes are declared host-path mounts from `ContainerVolumeSpec`. `node-agent` passes them through to the runtime and honors `read_only`, but it does not provision or garbage-collect those paths.
- Upgrades are replace-and-reconcile operations driven by `fleet-scheduler` state changes. `node-agent` does not patch binaries in place; it stops stale processes or containers and starts new ones from the updated spec.

`fresh-smoke`, `fresh-matrix`, `fleet-scheduler-e2e`, and `rollout-soak` are the operator proofs for the live runtime path, while the persisted process metadata in `deployer/crates/node-agent/src/process.rs` is the source of truth for the log and stale-pid contract. `rollout-soak` restarts `node-agent.service` on live worker nodes and records the longer-running restart survival artifacts under `./work/rollout-soak/20260410T164549+0900`; `nix-agent` stays scope-fixed to its dedicated deployer and installer proofs because the steady-state KVM cluster nodes do not run `nix-agent.service`.