photoncloud-monorepo/docs/rollout-bundle.md
2026-05-05 22:49:03 +09:00

10 KiB

Rollout Bundle Operations

This document fixes the supported operator contract for the native rollout bundle:

  • deployer
  • fleet-scheduler
  • nix-agent
  • node-agent

The supported layering is still deployer -> nix-agent for host OS rollout and deployer -> fleet-scheduler -> node-agent for host-native service placement.

Supported Scope

  • deployer is supported as a single logical rollout authority. The supported recovery model is restart-in-place or cold-standby replacement that reuses the same chainfire namespace, admin and bootstrap credentials, bootstrap flake bundle, and local state backup.
  • deployer is scope-fixed to one active writer plus optional cold-standby restore; automatic ChainFire-backed multi-instance failover is outside the supported product contract for this release. Do not run multiple writers against the same deployer namespace and assume automatic leader failover is safe.
  • nix-agent is supported for host-local desired-system apply, post-activation health-check execution, and rollback to the previous known system.
  • fleet-scheduler is scope-fixed to the two native-runtime worker lab with one planned drain cycle, one fail-stop worker-loss cycle, and 30-second held degraded states in rollout-soak; multi-hour maintenance windows, pinned singleton policies, and large-cluster drain storms are outside the supported product contract for this release.
  • node-agent is supported for host-local runtime reconcile, process and container execution, per-instance logs, and declared host-path volume mounts. It is not a secret manager, a storage provisioner, or an in-place binary patch system.

Proof Commands

  • nix build .#checks.x86_64-linux.deployer-vm-smoke
  • nix build .#checks.x86_64-linux.deployer-vm-rollback
  • nix build .#checks.x86_64-linux.portable-control-plane-regressions
  • nix build .#checks.x86_64-linux.fleet-scheduler-e2e
  • nix run ./nix/test-cluster#cluster -- fresh-smoke
  • nix run ./nix/test-cluster#cluster -- rollout-soak
  • nix run ./nix/test-cluster#cluster -- durability-proof

deployer-vm-rollback is the smallest reproducible proof for the nix-agent health-check and rollback path. fresh-smoke and fleet-scheduler-e2e keep the short regression semantics green. rollout-soak is the longer-running KVM operator lane for one planned drain cycle, one fail-stop worker-loss cycle, and service-restart behavior across deployer, fleet-scheduler, node-agent, and the canonical 3-node control plane. It writes scope-fixed-contract.json, deployer-scope-fixed.txt, and fleet-scheduler-scope-fixed.txt so the release boundary is captured in the proof root instead of being implied only by docs. The steady-state nix/test-cluster nodes record explicit nix-agent scope markers instead of pretending they run nix-agent.service. durability-proof remains the canonical persisted artifact lane for deployer backup, restart, replay, and storage-side failure injection.

Deployer HA And DR

Supported deployer recovery is a single-writer restore runbook. DEPLOYER-P1-01 is closed as a scope-fixed release boundary rather than an implied future HA promise:

  1. Preserve the generated cluster state from ultracloud.cluster, the deployer bootstrap and admin credentials, and services.deployer.localStatePath.
  2. Start exactly one deployer instance with the same chainfireEndpoints, clusterNamespace, chainfireNamespace, tokens, and optional TLS CA inputs.
  3. Re-apply the canonical cluster state:
deployer-ctl \
  --chainfire-endpoint http://127.0.0.1:2379 \
  --cluster-id <cluster-id> \
  --cluster-namespace ultracloud \
  --deployer-namespace deployer \
  apply --config cluster-state.json --prune
  1. Replay any preserved admin pre-register requests in the same shape as ./work/durability-proof/latest/deployer-pre-register-request.json.
  2. Verify the recovered state with curl -fsS -H 'x-deployer-token: <token>' http://<deployer>:8088/api/v1/admin/nodes | jq and, for node rollout intent, deployer-ctl node inspect --node-id <node> --include-desired-system --include-observed-system.

The 2026-04-10 canonical backup-and-replay proof for this contract is nix run ./nix/test-cluster#cluster -- durability-proof, which recorded deployer-pre-register-request.json, deployer-backup-list.json, deployer-post-restart-list.json, and deployer-replayed-list.json under ./work/durability-proof/20260410T120618+0900. The longer-run live-operations companion is nix run ./nix/test-cluster#cluster -- rollout-soak, which on 2026-04-10 recorded deployer-post-restart-nodes.json, maintenance-held.json, power-loss-held.json, post-control-plane-restarts.json, scope-fixed-contract.json, and deployer-scope-fixed.txt under ./work/rollout-soak/20260410T164549+0900 while holding degraded states and re-checking the admin inventory.

Nix-Agent Operator Contract

  • services.nix-agent.healthCheckCommand is an argv vector, not a shell fragment. Every entry is passed to the process directly.
  • The command runs after switch-to-configuration.
  • Exit status 0 means the desired system stays active.
  • Non-zero exit with rollbackOnFailure = true causes rollback to the previous known system and reports observed status rolled-back.
  • Non-zero exit with rollbackOnFailure = false leaves the failed generation in place and requires operator intervention.

The supported recovery flow is:

  1. Inspect the desired and observed rollout state:
deployer-ctl \
  --chainfire-endpoint http://127.0.0.1:2379 \
  --cluster-id <cluster-id> \
  --cluster-namespace ultracloud \
  --deployer-namespace deployer \
  node inspect \
  --node-id <node-id> \
  --include-desired-system \
  --include-observed-system
  1. If the node reports rolled-back, fix the failed target or health-check input, then re-publish the desired system.
  2. Re-run the smallest proof lane with nix build .#checks.x86_64-linux.deployer-vm-rollback when the issue is in the deployer -> nix-agent boundary, or the installer-backed baremetal-iso and baremetal-iso-e2e lanes when the issue includes first boot.

deployer-vm-rollback is the canonical reproducible proof for this contract. It publishes a desired system whose health_check_command = ["false"], expects observed status rolled-back, and proves that the current system does not remain on the rejected target generation. The longer-running 2026-04-10 rollout-soak lane does not pretend the steady-state nix/test-cluster nodes are deployer-managed nix-agent nodes; instead it records node01-nix-agent-scope.txt and node04-nix-agent-scope.txt under ./work/rollout-soak/20260410T154744+0900, while the executable nix-agent proof surface remains deployer-vm-rollback, baremetal-iso, and baremetal-iso-e2e.

Fleet-Scheduler Drain And Maintenance Contract

  • Use deployer-ctl node set-state --node-id <node> --state draining for planned short-lived maintenance.
  • draining removes the node from new placement and causes the scheduler to relocate replicated work when capacity exists.
  • active re-admits the node and allows replica count to grow back, but healthy singleton work is not required to churn back automatically.
  • Fail-stop worker loss is treated like implicit maintenance exhaustion: the scheduler restores healthy placement on the remaining eligible nodes when placement policy allows it.
  • The supported release proof is limited to the two native-runtime worker lab with one planned drain cycle and one fail-stop worker-loss cycle, each held for 30 seconds in rollout-soak.
  • Multi-hour maintenance windows, operator approval workflows, pinned singleton drain choreography, and large-cluster drain storms remain outside the supported contract for this release.

fresh-smoke is the canonical KVM proof for the baseline behavior. It drains node04, verifies that native-web, native-container, and native-daemon relocate to node05, restores node04, then simulates node05 loss and verifies failover back to node04 plus replica restoration when node05 returns. rollout-soak reruns that choreography as exactly one planned drain cycle and one fail-stop worker-loss cycle, holds each degraded state for 30 seconds, restarts the rollout services, and then rechecks the live runtime state; the 2026-04-10 run under ./work/rollout-soak/20260410T164549+0900 is the current release-grade artifact for that scope-fixed boundary. fleet-scheduler-e2e remains the cheap regression lane for the same scheduling semantics.

Node-Agent Logs, Secrets, Volumes, And Upgrade Contract

  • Runtime state lives under services.node-agent.stateDir, with pid files, metadata, and per-instance logs under ${stateDir}/pids.
  • Each managed instance writes combined stdout and stderr to ${stateDir}/pids/<service>-<instance>.pid.log.
  • Metadata is persisted beside the pid file as ${stateDir}/pids/<service>-<instance>.pid.meta.json, including argv and boot-id data used to reject stale pid reuse across reboot.
  • Secrets are not fetched, rotated, or encrypted by node-agent. Supported secret delivery is limited to values already present in the rendered service spec, environment, or mounted host files.
  • Volumes are declared host-path mounts from ContainerVolumeSpec. node-agent passes them through to the runtime and honors read_only, but it does not provision or garbage-collect those paths.
  • Upgrades are replace-and-reconcile operations driven by fleet-scheduler state changes. node-agent does not patch binaries in place; it stops stale processes or containers and starts new ones from the updated spec.

fresh-smoke, fresh-matrix, fleet-scheduler-e2e, and rollout-soak are the operator proofs for the live runtime path, while the persisted process metadata in deployer/crates/node-agent/src/process.rs is the source of truth for the log and stale-pid contract. rollout-soak restarts node-agent.service on live worker nodes and records the longer-running restart survival artifacts under ./work/rollout-soak/20260410T164549+0900; nix-agent stays scope-fixed to its dedicated deployer and installer proofs because the steady-state KVM cluster nodes do not run nix-agent.service.