10 KiB
Rollout Bundle Operations
This document fixes the supported operator contract for the native rollout bundle:
deployerfleet-schedulernix-agentnode-agent
The supported layering is still deployer -> nix-agent for host OS rollout and deployer -> fleet-scheduler -> node-agent for host-native service placement.
Supported Scope
deployeris supported as a single logical rollout authority. The supported recovery model is restart-in-place or cold-standby replacement that reuses the samechainfirenamespace, admin and bootstrap credentials, bootstrap flake bundle, and local state backup.deployeris scope-fixed to one active writer plus optional cold-standby restore; automatic ChainFire-backed multi-instance failover is outside the supported product contract for this release. Do not run multiple writers against the samedeployernamespace and assume automatic leader failover is safe.nix-agentis supported for host-local desired-system apply, post-activation health-check execution, and rollback to the previous known system.fleet-scheduleris scope-fixed to the two native-runtime worker lab with one planned drain cycle, one fail-stop worker-loss cycle, and 30-second held degraded states inrollout-soak; multi-hour maintenance windows, pinned singleton policies, and large-cluster drain storms are outside the supported product contract for this release.node-agentis supported for host-local runtime reconcile, process and container execution, per-instance logs, and declared host-path volume mounts. It is not a secret manager, a storage provisioner, or an in-place binary patch system.
Proof Commands
nix build .#checks.x86_64-linux.deployer-vm-smokenix build .#checks.x86_64-linux.deployer-vm-rollbacknix build .#checks.x86_64-linux.portable-control-plane-regressionsnix build .#checks.x86_64-linux.fleet-scheduler-e2enix run ./nix/test-cluster#cluster -- fresh-smokenix run ./nix/test-cluster#cluster -- rollout-soaknix run ./nix/test-cluster#cluster -- durability-proof
deployer-vm-rollback is the smallest reproducible proof for the nix-agent health-check and rollback path. fresh-smoke and fleet-scheduler-e2e keep the short regression semantics green. rollout-soak is the longer-running KVM operator lane for one planned drain cycle, one fail-stop worker-loss cycle, and service-restart behavior across deployer, fleet-scheduler, node-agent, and the fixed-membership control plane. It writes scope-fixed-contract.json, deployer-scope-fixed.txt, and fleet-scheduler-scope-fixed.txt so the release boundary is captured in the proof root instead of being implied only by docs. The steady-state nix/test-cluster nodes record explicit nix-agent scope markers instead of pretending they run nix-agent.service. durability-proof remains the canonical persisted artifact lane for deployer backup, restart, replay, and storage-side failure injection.
Deployer HA And DR
Supported deployer recovery is a single-writer restore runbook. DEPLOYER-P1-01 is closed as a scope-fixed release boundary rather than an implied future HA promise:
- Preserve the generated cluster state from
ultracloud.cluster, the deployer bootstrap and admin credentials, andservices.deployer.localStatePath. - Start exactly one
deployerinstance with the samechainfireEndpoints,clusterNamespace,chainfireNamespace, tokens, and optional TLS CA inputs. - Re-apply the canonical cluster state:
deployer-ctl \
--chainfire-endpoint http://127.0.0.1:2379 \
--cluster-id <cluster-id> \
--cluster-namespace ultracloud \
--deployer-namespace deployer \
apply --config cluster-state.json --prune
- Replay any preserved admin pre-register requests in the same shape as
./work/durability-proof/latest/deployer-pre-register-request.json. - Verify the recovered state with
curl -fsS -H 'x-deployer-token: <token>' http://<deployer>:8088/api/v1/admin/nodes | jqand, for node rollout intent,deployer-ctl node inspect --node-id <node> --include-desired-system --include-observed-system.
The 2026-04-10 canonical backup-and-replay proof for this contract is nix run ./nix/test-cluster#cluster -- durability-proof, which recorded deployer-pre-register-request.json, deployer-backup-list.json, deployer-post-restart-list.json, and deployer-replayed-list.json under ./work/durability-proof/20260410T120618+0900. The longer-run live-operations companion is nix run ./nix/test-cluster#cluster -- rollout-soak, which on 2026-04-10 recorded deployer-post-restart-nodes.json, maintenance-held.json, power-loss-held.json, post-control-plane-restarts.json, scope-fixed-contract.json, and deployer-scope-fixed.txt under ./work/rollout-soak/20260410T164549+0900 while holding degraded states and re-checking the admin inventory.
Nix-Agent Operator Contract
services.nix-agent.healthCheckCommandis an argv vector, not a shell fragment. Every entry is passed to the process directly.- The command runs after
switch-to-configuration. - Exit status
0means the desired system stays active. - Non-zero exit with
rollbackOnFailure = truecauses rollback to the previous known system and reports observed statusrolled-back. - Non-zero exit with
rollbackOnFailure = falseleaves the failed generation in place and requires operator intervention.
The supported recovery flow is:
- Inspect the desired and observed rollout state:
deployer-ctl \
--chainfire-endpoint http://127.0.0.1:2379 \
--cluster-id <cluster-id> \
--cluster-namespace ultracloud \
--deployer-namespace deployer \
node inspect \
--node-id <node-id> \
--include-desired-system \
--include-observed-system
- If the node reports
rolled-back, fix the failed target or health-check input, then re-publish the desired system. - Re-run the smallest proof lane with
nix build .#checks.x86_64-linux.deployer-vm-rollbackwhen the issue is in thedeployer -> nix-agentboundary, or the installer-backedbaremetal-isoandbaremetal-iso-e2elanes when the issue includes first boot.
deployer-vm-rollback is the canonical reproducible proof for this contract. It publishes a desired system whose health_check_command = ["false"], expects observed status rolled-back, and proves that the current system does not remain on the rejected target generation. The longer-running 2026-04-10 rollout-soak lane does not pretend the steady-state nix/test-cluster nodes are deployer-managed nix-agent nodes; instead it records node01-nix-agent-scope.txt and node04-nix-agent-scope.txt under ./work/rollout-soak/20260410T154744+0900, while the executable nix-agent proof surface remains deployer-vm-rollback, baremetal-iso, and baremetal-iso-e2e.
Fleet-Scheduler Drain And Maintenance Contract
- Use
deployer-ctl node set-state --node-id <node> --state drainingfor planned short-lived maintenance. drainingremoves the node from new placement and causes the scheduler to relocate replicated work when capacity exists.activere-admits the node and allows replica count to grow back, but healthy singleton work is not required to churn back automatically.- Fail-stop worker loss is treated like implicit maintenance exhaustion: the scheduler restores healthy placement on the remaining eligible nodes when placement policy allows it.
- The supported release proof is limited to the two native-runtime worker lab with one planned drain cycle and one fail-stop worker-loss cycle, each held for 30 seconds in
rollout-soak. - Multi-hour maintenance windows, operator approval workflows, pinned singleton drain choreography, and large-cluster drain storms remain outside the supported contract for this release.
fresh-smoke is the canonical KVM proof for the baseline behavior. It drains node04, verifies that native-web, native-container, and native-daemon relocate to node05, restores node04, then simulates node05 loss and verifies failover back to node04 plus replica restoration when node05 returns. rollout-soak reruns that choreography as exactly one planned drain cycle and one fail-stop worker-loss cycle, holds each degraded state for 30 seconds, restarts the rollout services, and then rechecks the live runtime state; the 2026-04-10 run under ./work/rollout-soak/20260410T164549+0900 is the current release-grade artifact for that scope-fixed boundary. fleet-scheduler-e2e remains the cheap regression lane for the same scheduling semantics.
Node-Agent Logs, Secrets, Volumes, And Upgrade Contract
- Runtime state lives under
services.node-agent.stateDir, with pid files, metadata, and per-instance logs under${stateDir}/pids. - Each managed instance writes combined stdout and stderr to
${stateDir}/pids/<service>-<instance>.pid.log. - Metadata is persisted beside the pid file as
${stateDir}/pids/<service>-<instance>.pid.meta.json, including argv and boot-id data used to reject stale pid reuse across reboot. - Secrets are not fetched, rotated, or encrypted by
node-agent. Supported secret delivery is limited to values already present in the rendered service spec, environment, or mounted host files. - Volumes are declared host-path mounts from
ContainerVolumeSpec.node-agentpasses them through to the runtime and honorsread_only, but it does not provision or garbage-collect those paths. - Upgrades are replace-and-reconcile operations driven by
fleet-schedulerstate changes.node-agentdoes not patch binaries in place; it stops stale processes or containers and starts new ones from the updated spec.
fresh-smoke, fresh-matrix, fleet-scheduler-e2e, and rollout-soak are the operator proofs for the live runtime path, while the persisted process metadata in deployer/crates/node-agent/src/process.rs is the source of truth for the log and stale-pid contract. rollout-soak restarts node-agent.service on live worker nodes and records the longer-running restart survival artifacts under ./work/rollout-soak/20260410T164549+0900; nix-agent stays scope-fixed to its dedicated deployer and installer proofs because the steady-state KVM cluster nodes do not run nix-agent.service.