68 KiB
68 KiB
UltraCloud Baseline TODO (2026-04-10)
- Task:
0fe10731-bdbc-4f8f-8bcc-5f5a16903200 - 作成ブランチ:
task/0fe10731-baseline-todo - ベース:
origin/mainatb8ebd24d4e9b2dbe71e34ba09b77092dfa7dd43c - 引き継ぎ方針:
task/343c8c57-main-reaggregateの dirty worktree は reset/revert せず、そのまま新ブランチへ持ち上げた。 - この票の目的: 各コンポーネントの責務、正本 entrypoint、現時点の証拠、未証明事項、優先度付き問題票、依存関係を 1 枚に固定し、以後の自律実装の基準票にする。
- 調査入力:
README.md,docs/component-matrix.md,docs/testing.md,nix/test-cluster/README.md,plans/cluster-investigation-2026-03-02/*, 現在のnix/modules/*,nix/single-node/*,nix/nodes/baremetal-qemu/*,nix/test-cluster/*, 各 component のsrc/main.rs/ API 定義。
Canonical Boundary Snapshot
- 正本 profile は 3 つ:
single-node dev,3-node HA control plane,bare-metal bootstrap。 - 最小コアは
chainfire + flaredb + iam + prismnet + plasmavmc。 - ネットワーク provider bundle は
prismnet + flashdns + fiberlb。 - VM hosting bundle は
plasmavmc + prismnet + coronafs + lightningstor。 - edge/tenant bundle は
apigateway + nightlight + creditservice。 - rollout bundle は
deployer + nix-agent + fleet-scheduler + node-agent。 - 2026-04-10 の current branch では、QEMU/KVM を正本の local proof とし、bare-metal proof も
QEMU as hardwareとして同一 ISO 契約で扱う構造が入っている。
2026-03-02 Failure Split
2026-03-02 の失敗で、2026-04-10 current branch では file-level に解消済みのもの
ARCH-001:flake.nixが欠損docs/.../configuration.nixを参照していた件は解消済み。現在の正本はnix/nodes/vm-cluster/node01,node02,node03とcanonical-profile-eval-guards。ARCH-002: ISO install のdisko.nix欠損参照は解消済み。現在はnix/nodes/baremetal-qemu/control-plane/disko.nixと.../worker/disko.nixをverify-baremetal-iso.shが直接使う。ARCH-003:deployerの Nix wiring 欠損は解消済み。nix/modules/deployer.nix,flake.nixの package/app/check 定義,deployer-serverの/api/v1/phone-homeが存在する。TC-001:joinAddr不整合は解消済み。現在のchainfire/flaredbmodule はinitialPeers契約に揃っている。TC-002:node06のcreditservice評価失敗は解消済み。現在のnix/test-cluster/node06.nixはcreditservice.nixを import し、flaredbAddrも与えている。COMP-001からCOMP-004: IAM endpoint 注入ミスマッチは解消済み。prismnet,plasmavmc,fiberlb,lightningstor,flashdns,creditserviceは現在 module から binary が実際に読む config key に変換している。ARCH-004: first-boot のleader_url契約不整合は解消済み。nix/modules/first-boot-automation.nixはhttp://localhost:8081/8082と/admin/member/addを前提にしている。ARCH-005: FlareDB に first-boot 用 join API が無かった件は解消済み。flaredb/crates/flaredb-server/src/rest.rsにPOST /admin/member/addがある。3.1 NightLight grpcPort mismatch: 解消済み。nightlight-serverは現在 HTTP と gRPC を両方 bind する。ARCH-006/cluster-config二重実装問題: 2026-03-02 にあったnix-nos/topology.nix起点の重複は current tree ではそのまま見当たらず、正本はnix/lib/cluster-schema.nixとnix/modules/ultracloud-cluster.nixに寄っている。QLT-001:flake.nix上の大量doCheck = false群は、少なくとも current file-level ではそのまま残っていない。
2026-03-02 の失敗と切り分けて、2026-04-10 では「構造 fix はあるが runtime 再証明が未了」のもの
VERIFY-001: 2026-04-10 の local AMD/KVM host でsupported-surface-guard,single-node-trial-vm,single-node-quickstart,fresh-smoke,fresh-demo-vm-webapp,fresh-matrix,./nix/test-cluster/run-publishable-kvm-suite.sh ./work/publishable-kvm-suite,canonical-profile-eval-guards,portable-control-plane-regressions,deployer-bootstrap-e2e,host-lifecycle-e2e,fleet-scheduler-e2e,baremetal-iso,nix build .#checks.x86_64-linux.baremetal-iso-e2e, and the built./result/bin/baremetal-iso-e2eexact runner は再走済みで pass。未再証明なのは実機 bare-metal smoke のみ。VERIFY-002: bare-metal bootstrap は QEMU ISO proof まで閉じているが、USB/BMC/実機への同契約再証明はまだ無い。ただし 2026-04-10 にnix run ./nix/test-cluster#hardware-smoke -- preflightを追加し、transport 不在時の blocked state は./work/hardware-smoke/latest/status.envとmissing-requirements.txtへ機械的に残せるようになった。VERIFY-003: config-contract 修正はrun-publishable-kvm-suite.shで全 add-on 有効 profile まで再確認済み。baremetal-iso-e2eも materialized host-KVM runner へ移行済みで、残件は hardware bring-up に絞られた。
First Tranche Backlog
TRANCHE-01: 完了。single-node devの optional bundle health gating は 2026-04-10 に修正済み。coronafsの port mismatch とflashdns/fiberlb/lightningstorの health 未監視を解消した。TRANCHE-02:baremetal-isoとbaremetal-iso-e2eexact runner は 2026-04-10 の local AMD/KVM host で再走済み。次段で USB/BMC/実機 1 台の smoke を追加する。TRANCHE-03: 完了。2026-04-10 にnix run ./nix/test-cluster#cluster -- durability-proofを追加し、chainfire/flaredbの logical backup/restore と、deployerの admin pre-register request replay + restart persistence proof を product doc と harness へ固定した。TRANCHE-04: 完了。fleet-scheduler,nix-agent,node-agent,deployer-ctlの localchainfire既定 endpoint は 2026-04-10 に canonicalhttp://127.0.0.1:2379へ正規化した。TRANCHE-05: 完了。fiberlbの HTTPS health check は 2026-04-10 に supported scope を明文化し、現時点では backend TLS 証明書検証なしのTCP reachability + HTTP statusのみが製品契約だと docs/guard/source comment へ固定した。TRANCHE-06: 完了。k8shostは 2026-04-10 に API/control-plane 製品として固定し、runtime dataplane helpers は archived non-product と docs/guard/TODO を一致させた。TRANCHE-07: 完了。2026-04-10 のdurability-proofがlightningstordistributed backend の node-loss / repair とcoronafscontroller/node split outage を canonical failure-injection proof として保存する。TRANCHE-08: 完了。2026-04-10 にhardware-smokepreflight/handoff wrapper を追加し、deployer -> ISO -> first-boot -> nix-agentの実機 bring-up を USB/BMC/Redfish 共通 entrypoint で準備できるようにした。transport 不在時の blocked artifact も./work/hardware-smokeに固定化した。TRANCHE-10: 完了。2026-04-10 にnix run ./nix/test-cluster#cluster -- rollout-soakを longer-run KVM operator lane として固定し、drainingmaintenance, worker power-loss,deployer/fleet-scheduler/node-agentrestart, fixed-membershipchainfire/flaredbrestart を同一 artifact root に保存した。steady-statetest-clusterにnix-agent.serviceが載っていないことも scope marker artifact で明文化した。TRANCHE-11: 完了。2026-04-10 にDEPLOYER-P1-01とFLEET-P1-01を scope-fixed final state へ更新し、rollout-soakがscope-fixed-contract.json,deployer-scope-fixed.txt,fleet-scheduler-scope-fixed.txtを/mnt/d2/centra/photoncloud-monorepo/work/rollout-soak/20260410T164549+0900へ保存するようにした。deployerは one active writer plus optional cold-standby restore、fleet-schedulerは two native-runtime workers 上の one drain + one fail-stop cycle with 30-second hold を release boundary として固定した。TRANCHE-12: 完了。2026-04-10 にFDB-P1-01,IAM-P1-01,HARNESS-P2-01を次段処理した。run-core-control-plane-ops-proof.shは/mnt/d2/centra/photoncloud-monorepo/work/core-control-plane-ops-proof/20260410T172148+09:00へscope-fixed-contract.json,iam-credential-rotation-tests.log,iam-mtls-rotation-tests.log,result.jsonを保存し、FlareDB destructive DDL/fully automated online migration は scope-fixed unsupported、IAM は signing-key + credential + mTLS overlap rotation までを supported lifecycle とし multi-node failover は unsupported に固定した。work-root-budget.shにはenforceとprune-proof-logsを追加し、disk budget advisory から stronger local gate と safer cleanup workflow へ進めた。
2026-04-10 Physical Hardware Bring-Up Pack
Task:3dba03d3-525b-4079-8c93-90af6a89d32bCanonical entrypoint:nix run ./nix/test-cluster#hardware-smoke -- preflight, thenrunorcaptureCurrent preflight artifact root:./work/hardware-smoke/latestArtifact contract:status.env,missing-requirements.txt,kernel-params.txt,expected-markers.txt,failure-markers.txt,operator-handoff.md,environment.txtBridge to QEMU proof:hardware wrapper reusesnixosConfigurations.ultracloud-isoand the sameULTRACLOUD_MARKER pre-install.boot.*,pre-install.phone-home.complete.*,install.disko.complete.*,reboot.*,post-install.boot.*,desired-system-active.*markers thatverify-baremetal-iso.shenforces in the QEMU harness.Blocked-state recording:when USB device or BMC/Redfish transport is missing,preflightrecordsstatus=blockedand the missing transport, kernel-parameter, and capture inputs inmissing-requirements.txtwithout pretending the hardware proof ran.Still open:an actual physical-node execution remains pending until a removable USB target or BMC/Redfish endpoint plus credentials are supplied.TRANCHE-09: 完了。2026-04-10 にdocs/rollout-bundle.mdを追加し、deployersingle-writer DR、nix-agenthealth-check/rollback、node-agentlogs/secrets/volume/upgrade、fleet-schedulerdrain/maintenance/failover の product contract と proof command を固定した。
2026-04-10 Long-Run Control Plane And Rollout Soak
Task:07d6137e-6e4c-4158-9142-8920f4f70a76Canonical entrypoint:nix run ./nix/test-cluster#cluster -- rollout-soakArtifact root:/mnt/d2/centra/photoncloud-monorepo/work/rollout-soak/20260410T164549+0900Scenario proof:one plannednode04 -> draining -> activecycle, onenode05power-loss and recovery cycle, restart ofdeployer.service,fleet-scheduler.service,node-agent.serviceon both worker nodes, and fixed-membership restart ofchainfire.serviceplusflaredb.serviceonnode02.Saved evidence:maintenance-during.json,maintenance-held.json,maintenance-restored.json,power-loss-during.json,power-loss-held.json,power-loss-restored.json,deployer-post-restart-nodes.json,fleet-scheduler-post-restart.json,node04-node-agent-post-restart.json,node05-node-agent-post-restart.json,chainfire-post-restart-put.json,flaredb-post-restart.json,post-control-plane-restarts.json,scope-fixed-contract.json,deployer-scope-fixed.txt,fleet-scheduler-scope-fixed.txt,result.json.Long-run nix-agent boundary:steady-statenix/test-clusternodes do not shipnix-agent.service, so this soak recordsnode01-nix-agent-scope.txtandnode04-nix-agent-scope.txtinstead of pretending a live-clusternix-agentrestart happened. The executablenix-agentproofs remaindeployer-vm-rollback,baremetal-iso, andbaremetal-iso-e2e.Result:PASS on the local AMD/KVM host.result.jsonrecordssuccess=true,fleet_supported_native_runtime_nodes=2,validated_maintenance_cycles=1,validated_power_loss_cycles=1,soak_hold_secs=30, and the summaryvalidated one planned drain cycle and one fail-stop worker-loss cycle on the two-node native-runtime lab, held each degraded state for the configured soak window, restarted deployer or scheduler or agent services, and revalidated fixed-membership control-plane restarts while keeping deployer HA scope-fixed to single-writer recovery.
2026-04-10 Local Executable Baseline
Task:b1e811fb-158f-415c-a011-64c724e84c5cRunner:nix/test-cluster/run-local-baseline.shLog root:/mnt/d2/centra/photoncloud-monorepo/work/baselines/b1e811fb-158f-415c-a011-64c724e84c5cLocal execution policy:ULTRACLOUD_WORK_ROOT=/mnt/d2/centra/photoncloud-monorepo/work,TMPDIR=/mnt/d2/centra/photoncloud-monorepo/work/tmp,XDG_CACHE_HOME=/mnt/d2/centra/photoncloud-monorepo/work/xdg-cache,PHOTON_CLUSTER_WORK_ROOT=/mnt/d2/centra/photoncloud-monorepo/work/test-cluster,PHOTON_VM_DIR=/mnt/d2/centra/photoncloud-monorepo/work/test-cluster/state,PHOTON_CLUSTER_VDE_SWITCH_DIR=/mnt/d2/centra/photoncloud-monorepo/work/test-cluster/vde-switch,NIX_CONFIG builders =で remote builder を禁止。Host evidence:environment.txtにhost_cpu_count=12,ultracloud_local_nix_max_jobs=6,ultracloud_local_nix_build_cores=2,photon_cluster_nix_max_jobs=6,photon_cluster_nix_build_cores=2,nix_builders=(empty),kvm_access=rw,nested_param_value=1を保存済み。Guard/build checks:canonical-profile-eval-guards: PASS. commandnix build .#checks.x86_64-linux.canonical-profile-eval-guards --no-link; meta/mnt/d2/centra/photoncloud-monorepo/work/baselines/b1e811fb-158f-415c-a011-64c724e84c5c/canonical-profile-eval-guards.meta; log/mnt/d2/centra/photoncloud-monorepo/work/baselines/b1e811fb-158f-415c-a011-64c724e84c5c/canonical-profile-eval-guards.log.supported-surface-guard: PASS. commandnix build .#checks.x86_64-linux.supported-surface-guard --no-link; meta/mnt/d2/centra/photoncloud-monorepo/work/baselines/b1e811fb-158f-415c-a011-64c724e84c5c/supported-surface-guard.meta; log/mnt/d2/centra/photoncloud-monorepo/work/baselines/b1e811fb-158f-415c-a011-64c724e84c5c/supported-surface-guard.log.portable-control-plane-regressions: PASS. commandnix build .#checks.x86_64-linux.portable-control-plane-regressions; meta/mnt/d2/centra/photoncloud-monorepo/work/baselines/b1e811fb-158f-415c-a011-64c724e84c5c/portable-control-plane-regressions.meta; log/mnt/d2/centra/photoncloud-monorepo/work/baselines/b1e811fb-158f-415c-a011-64c724e84c5c/portable-control-plane-regressions.log.deployer-bootstrap-e2e: PASS. commandnix build .#checks.x86_64-linux.deployer-bootstrap-e2e; meta/mnt/d2/centra/photoncloud-monorepo/work/baselines/b1e811fb-158f-415c-a011-64c724e84c5c/deployer-bootstrap-e2e.meta; log/mnt/d2/centra/photoncloud-monorepo/work/baselines/b1e811fb-158f-415c-a011-64c724e84c5c/deployer-bootstrap-e2e.log.host-lifecycle-e2e: PASS. commandnix build .#checks.x86_64-linux.host-lifecycle-e2e; meta/mnt/d2/centra/photoncloud-monorepo/work/baselines/b1e811fb-158f-415c-a011-64c724e84c5c/host-lifecycle-e2e.meta; log/mnt/d2/centra/photoncloud-monorepo/work/baselines/b1e811fb-158f-415c-a011-64c724e84c5c/host-lifecycle-e2e.log.fleet-scheduler-e2e: PASS. commandnix build .#checks.x86_64-linux.fleet-scheduler-e2e; meta/mnt/d2/centra/photoncloud-monorepo/work/baselines/b1e811fb-158f-415c-a011-64c724e84c5c/fleet-scheduler-e2e.meta; log/mnt/d2/centra/photoncloud-monorepo/work/baselines/b1e811fb-158f-415c-a011-64c724e84c5c/fleet-scheduler-e2e.log.
Runtime path checks:single-node-quickstart: PASS. commandnix run .#single-node-quickstart; meta/mnt/d2/centra/photoncloud-monorepo/work/baselines/b1e811fb-158f-415c-a011-64c724e84c5c/single-node-quickstart.meta; log/mnt/d2/centra/photoncloud-monorepo/work/baselines/b1e811fb-158f-415c-a011-64c724e84c5c/single-node-quickstart.log; success markersingle-node quickstart smoke passed.baremetal-iso: PASS. commandnix run ./nix/test-cluster#cluster -- baremetal-iso; meta/mnt/d2/centra/photoncloud-monorepo/work/baselines/b1e811fb-158f-415c-a011-64c724e84c5c/baremetal-iso.meta; log/mnt/d2/centra/photoncloud-monorepo/work/baselines/b1e811fb-158f-415c-a011-64c724e84c5c/baremetal-iso.log; success markersULTRACLOUD_MARKER desired-system-active.iso-control-plane-01,ULTRACLOUD_MARKER desired-system-active.iso-worker-01,Canonical ISO bare-metal QEMU verification succeeded.fresh-smoke: PASS. commandnix run ./nix/test-cluster#cluster -- fresh-smoke; meta/mnt/d2/centra/photoncloud-monorepo/work/baselines/b1e811fb-158f-415c-a011-64c724e84c5c/fresh-smoke.meta; log/mnt/d2/centra/photoncloud-monorepo/work/baselines/b1e811fb-158f-415c-a011-64c724e84c5c/fresh-smoke.log; success markerCluster validation succeeded.
2026-04-10 execution failures:none. 2026-03-02 の historical failure split は上節のままで、この local AMD/KVM baseline では required command 群を fail として再現していない。2026-04-10 observed non-failure risk:HARNESS-OBS-20260410-01: 2026-04-10 に解消。nix/test-cluster/run-cluster.shの stale VM cleanup は currentvm_dir/vde_switch_dirを cmdline で確認した PID のみ収集するように変更し、path 非依存のhostfwd=tcp::${port}-:22fallback を撤去した。
2026-04-10 Bare-Metal Canonical Path
Task:6d9f45e4-1954-4a0b-b886-c61482db6c3cQEMU-as-hardware runtime proof:PASS. commandnix run ./nix/test-cluster#cluster -- baremetal-iso; log root/mnt/d2/centra/photoncloud-monorepo/work/baremetal-iso; evidence filesenvironment.txt,deployer.log,chainfire.log,control-plane.serial.log,worker.serial.log.Runtime PASS markers:ULTRACLOUD_MARKER desired-system-active.iso-control-plane-01,ULTRACLOUD_MARKER desired-system-active.iso-worker-01,Canonical ISO bare-metal QEMU verification succeeded.Runtime contract now proven:- reusable node classes own
install_plan.nixos_configuration,install_plan.disko_config_path, and stableinstall_plan.target_disk_by_id - nodes carry identity plus desired-system overrides only; when a cache-backed prebuilt closure is available they now publish
desired_system.target_systemto converge to the exact shipped system instead of a dirty local rebuild - installed nodes now keep
nix-agentalive across their ownswitch-to-configurationtransaction long enough for activation to finish, which restored post-installchainfireandnix-agentconvergence
- reusable node classes own
Historical blocker (resolved on 2026-04-10):direct build-time execution ofnix build .#checks.x86_64-linux.baremetal-iso-e2eran under sandboxednixbld1and fell back toTCG. The exact lane is now a materialized runner: the check build succeeds quickly and emits./result/bin/baremetal-iso-e2e, and that runner executes the sameverify-baremetal-iso.shharness with host KVM and logs under./work.
2026-04-10 Responsibility And Minimal-Surface Alignment
Task:65a13e46-1376-4f37-a5c1-e520b5b376ecAuthoring source decision:ultracloud.clusterbacked bynix/lib/cluster-schema.nixis now documented inREADME.md,docs/README.md, anddocs/testing.mdas the only supported cluster authoring source.nix-nosis explicitly reduced to legacy compatibility plus low-level network primitives.Module boundary alignment:services.deployer,services.fleet-scheduler,services.nix-agent, andservices.node-agentdescriptions now agree on the canonical layeringultracloud.cluster -> deployer -> (nix-agent | fleet-scheduler -> node-agent).Minimal-surface friction reduction:services.plasmavmcandservices.k8shostnow wait only for local backing services that they actually use. When explicit remote endpoints are configured, they no longer hard-wire unrelated local control-plane units into startup ordering, which preserves a lighter standalone story for the VM-platform core and remote-provider deployments.Validation alignment:supported-surface-guardnow requires contract markers for the supported authoring source, the constrainednix-nosrole, and the standalone VM-platform story so docs drift becomes a failing regression.Still open:rollout-stack の default port mismatch は解消済み。残件は hardware bring-up と longer-duration durability proof。
2026-04-10 Supported Surface Final Proof
Task:32f64c10-1b74-4d8a-8d7d-b2cc6bf6b4f0Guard + minimal-trial proof root:/mnt/d2/centra/photoncloud-monorepo/work/final-proofs/32f64c10-1b74-4d8a-8d7d-b2cc6bf6b4f0-finalsupported-surface-guard: PASS. commandnix build .#checks.x86_64-linux.supported-surface-guard --no-link; meta/mnt/d2/centra/photoncloud-monorepo/work/final-proofs/32f64c10-1b74-4d8a-8d7d-b2cc6bf6b4f0-final/supported-surface-guard.meta; log/mnt/d2/centra/photoncloud-monorepo/work/final-proofs/32f64c10-1b74-4d8a-8d7d-b2cc6bf6b4f0-final/supported-surface-guard.log.single-node-trial-vm: PASS. commandnix build .#single-node-trial-vm --no-link --print-out-paths; meta/mnt/d2/centra/photoncloud-monorepo/work/final-proofs/32f64c10-1b74-4d8a-8d7d-b2cc6bf6b4f0-final/single-node-trial-vm.meta; log/mnt/d2/centra/photoncloud-monorepo/work/final-proofs/32f64c10-1b74-4d8a-8d7d-b2cc6bf6b4f0-final/single-node-trial-vm.log; output path/nix/store/1nq4pkadm3lbxmhkr54iz7lgjd6vm7z3-nixos-vm.single-node-quickstart: PASS. commandnix run .#single-node-quickstart; meta/mnt/d2/centra/photoncloud-monorepo/work/final-proofs/32f64c10-1b74-4d8a-8d7d-b2cc6bf6b4f0-final/single-node-quickstart.meta; log/mnt/d2/centra/photoncloud-monorepo/work/final-proofs/32f64c10-1b74-4d8a-8d7d-b2cc6bf6b4f0-final/single-node-quickstart.log; success markersingle-node quickstart smoke passed.
Publishable KVM suite root:/mnt/d2/centra/photoncloud-monorepo/work/publishable-kvm-suiteenvironment.txtcaptureshost_cpu_count=12,local_nix_max_jobs=6,local_nix_build_cores=2,photon_cluster_nix_max_jobs=6,photon_cluster_nix_build_cores=2,kvm_present=yes,kvm_access=rw,kvm_amd_nested=1,nix_builders=,finished_at=2026-04-10T09:36:09+09:00,exit_status=0.fresh-smoke: PASS. commandnix run ./nix/test-cluster#cluster -- fresh-smoke; meta/mnt/d2/centra/photoncloud-monorepo/work/publishable-kvm-suite/fresh-smoke.meta; log/mnt/d2/centra/photoncloud-monorepo/work/publishable-kvm-suite/fresh-smoke.log; success markerCluster validation succeeded.fresh-demo-vm-webapp: PASS. commandnix run ./nix/test-cluster#cluster -- fresh-demo-vm-webapp; meta/mnt/d2/centra/photoncloud-monorepo/work/publishable-kvm-suite/fresh-demo-vm-webapp.meta; log/mnt/d2/centra/photoncloud-monorepo/work/publishable-kvm-suite/fresh-demo-vm-webapp.log; success markers includePHOTON_VM_DEMO_WEB_READYand the guest web health check onhttp://10.62.10.10:8080/health.fresh-matrix: PASS. commandnix run ./nix/test-cluster#cluster -- fresh-matrix; meta/mnt/d2/centra/photoncloud-monorepo/work/publishable-kvm-suite/fresh-matrix.meta; log/mnt/d2/centra/photoncloud-monorepo/work/publishable-kvm-suite/fresh-matrix.log; success markerComponent matrix validation succeeded.run-publishable-kvm-suite: PASS. command./nix/test-cluster/run-publishable-kvm-suite.sh ./work/publishable-kvm-suite; environment/mnt/d2/centra/photoncloud-monorepo/work/publishable-kvm-suite/environment.txt; final stdout markerpublishable KVM suite passed; logs in ./work/publishable-kvm-suite.
Fixed while proving the surface:NODEAGENT-FIX-20260410-01: reboot-time PID reuse could makenode-agenttreatnative-daemonas the resurrectednative-webinstance after worker reboot, stallingfresh-smokeat native runtime recovery.deployer/crates/node-agent/src/process.rsnow persists argv + boot-id metadata, validates the live/proc/<pid>/cmdline, and refuses to signal or reuse mismatched processes from stale pidfiles.HARNESS-FIX-20260410-01:run-publishable-kvm-suiteexposed a control-plane LightningStor bootstrap race that was not consistently hit by ad-hoc reruns.nix/test-cluster/node01.nixnow holdslightningstor.servicebehind explicit local control-plane and worker-replica TCP readiness with a longer start timeout, andnix/test-cluster/run-cluster.shnow waits the worker storage agents before gating the control-plane LightningStor unit.
Still open after the final supported-surface proof:real hardwarebaremetal-isosmoke.
2026-04-10 baremetal-iso-e2e Local-KVM Exact Lane
Task:0de75570-dabd-471b-95fe-5898c54e2e8cCheck build output:nix build .#checks.x86_64-linux.baremetal-iso-e2enow materializes./result/bin/baremetal-iso-e2einstead of trying to execute QEMU inside the daemon sandbox.Exact proof root:/mnt/d2/centra/photoncloud-monorepo/work/baremetal-iso-e2e/0de75570-dabd-471b-95fe-5898c54e2e8cOuter runner evidence:environment.txtrecordsexecution_model=materialized-check-runner,nix_builders=(empty),kvm_present=yes,kvm_access=rw, and the local CPU-derived Nix parallelism.Exact check build:PASS. commandnix build .#checks.x86_64-linux.baremetal-iso-e2e; output path is a runner package that shipsbin/baremetal-iso-e2eplusshare/ultracloud/README.txtdocumenting the sandbox/TCG reason for the materialized execution model.Exact runner:PASS. command./result/bin/baremetal-iso-e2e ./work/baremetal-iso-e2e/0de75570-dabd-471b-95fe-5898c54e2e8c; meta/mnt/d2/centra/photoncloud-monorepo/work/baremetal-iso-e2e/0de75570-dabd-471b-95fe-5898c54e2e8c/baremetal-iso-e2e.meta; log/mnt/d2/centra/photoncloud-monorepo/work/baremetal-iso-e2e/0de75570-dabd-471b-95fe-5898c54e2e8c/baremetal-iso-e2e.log.Inner runtime evidence:state dir/mnt/d2/centra/photoncloud-monorepo/work/baremetal-iso-e2e/0de75570-dabd-471b-95fe-5898c54e2e8c/state;state/environment.txtrecordsvm_accelerator_mode=kvm; success markers inbaremetal-iso-e2e.logincludeULTRACLOUD_MARKER desired-system-active.iso-control-plane-01,ULTRACLOUD_MARKER desired-system-active.iso-worker-01, andCanonical ISO bare-metal QEMU verification succeeded.Remaining delta vs direct runtime proof:the harness is now identical because bothnix run ./nix/test-cluster#cluster -- baremetal-isoand./result/bin/baremetal-iso-e2ecallnix/test-cluster/verify-baremetal-iso.sh. The only intentional difference is execution entrypoint:nix buildmaterializes the runner because daemon-sandboxednixbldbuilds would otherwise lose host KVM and degrade toTCG.
2026-04-10 Durability And Product-Boundary Hardening
Task:541356be-b289-4583-ba40-cbf46b0f9680Guard rerun:PASS. commandnix build .#checks.x86_64-linux.supported-surface-guard --no-link.Runtime rerun:PASS. commandnix run ./nix/test-cluster#cluster -- fresh-matrix; success markerComponent matrix validation succeeded.Durability proof:PASS. commandnix run ./nix/test-cluster#cluster -- durability-proof; artifact root/mnt/d2/centra/photoncloud-monorepo/work/durability-proof/20260410T120618+0900; convenience symlink/mnt/d2/centra/photoncloud-monorepo/work/durability-proof/latest.ChainFire proof:chainfire-backup-response.jsonとchainfire-restored-response.jsonが同じ logical payload を返し、DELETE 後のchainfire-after-delete.outは 404 を返す。FlareDB proof:flaredb-backup.jsonとflaredb-restored.jsonが同じ SQL row を返し、flaredb-after-delete.jsonは空集合を返す。Deployer proof:deployer-pre-register-request.jsonを backup artifact とし、deployer-backup-list.jsonで pre-registered node を観測し、deployer.servicerestart 後もdeployer-post-restart-list.jsonに残ることを確認し、同じ request を replay した後もdeployer-replayed-list.jsonの summary が変わらないことを確認した。result.jsonのdeployer_restore_modeはadmin pre-register request replay with pre/post-restart list verification。CoronaFS failure injection:coronafs-node04-local-state.jsonは controller 停止中もnode_local=trueと materialized path を保持し、coronafs-node04-capabilities.jsonは node-only capability split (supports_controller_api=false,supports_node_api=true) を維持した。LightningStor failure injection:lightningstor-put-during-node05-outage.json,lightningstor-head-during-node05-outage.json,lightningstor-object-during-node05-outage.txt,lightningstor-object-after-repair.txtが node05 停止中 write と repair 後 read-back を保存する。FiberLB supported limitation:fiberlb/crates/fiberlb-server/src/healthcheck.rs,README.md,docs/testing.md,docs/component-matrix.md,flake.nixで、HTTPS backend health は TLS 証明書検証なしの限定契約だと固定した。k8shost boundary:README.md,docs/testing.md,docs/component-matrix.md,k8shost/README.md,nix/test-cluster/README.md,flake.nixがk8shostを API/control-plane 製品 surface のみに固定し、k8shost-cni,k8shost-controllers,lightningstor-csiを archived non-product として揃えた。Proof-lane hardening done during this tranche:初回durability-proofは FlareDB cleanup tail の unsupportedDROP TABLEで落ちたため unique namespace 前提に整理し、次に cleanup trap の unbound local で落ちたため trap cleanup を${var:-}と guarded tunnel shutdown に直した。現在の lane は zero-exit で artifact を残す。
2026-04-10 Rollout Bundle HA And DR Hardening
Task:a41343c5-116e-4313-8751-b333472f931cOperator doc:docs/rollout-bundle.mdVerification reruns:nix build .#checks.x86_64-linux.portable-control-plane-regressions,nix build .#checks.x86_64-linux.fleet-scheduler-e2e, andnix build .#checks.x86_64-linux.deployer-vm-rollbackall passed on 2026-04-10 with local-only Nix settings.Durability rerun:nix run ./nix/test-cluster#cluster -- durability-proofpassed again from a clean KVM cluster and wrote artifacts under/mnt/d2/centra/photoncloud-monorepo/work/durability-proof/20260410T123535+0900.Supported deployer boundary:single-writer deployer with restart-in-place or cold-standby restore. ChainFire-backed multi-instance failover is explicitly unsupported for now and the restore runbook is fixed tocluster-state apply + preserved pre-register request replay + admin verification.Nix-agent proof:nix build .#checks.x86_64-linux.deployer-vm-rollbackpassed on 2026-04-10 and is now the canonical reproducible proof forhealth_check_command, rollback, androlled-backpartial failure recovery semantics.Fleet-scheduler semantics:fresh-smokeandfleet-scheduler-e2eremain the release proofs for short-liveddrainingmaintenance, fail-stop worker loss, and replica restoration. Long-duration maintenance and large-cluster drain choreography stay scope-limited rather than silently implied.Node-agent contract:product docs now fix${stateDir}/pids/*.logas the per-instance log location,${stateDir}/pids/*.meta.jsonas stale-pid metadata, secret delivery as caller-provided env or mounted files only, host-path volumes as pass-through only, and upgrades as replace-and-reconcile rather than in-place patching.
2026-04-10 Core Control Plane Operator Lifecycle Proofs
Task:dcdc961a-0aa6-47c3-aeba-a1c67bca27b7Operator doc:docs/control-plane-ops.mdFocused proof:./nix/test-cluster/run-core-control-plane-ops-proof.sh /mnt/d2/centra/photoncloud-monorepo/work/core-control-plane-ops-proof/20260410T172148+09:00Focused proof result:passed on 2026-04-10 and wroteresult.json,scope-fixed-contract.json,iam-key-rotation-tests.log,iam-credential-rotation-tests.log,iam-mtls-rotation-tests.log, and the contract-marker logs under/mnt/d2/centra/photoncloud-monorepo/work/core-control-plane-ops-proof/20260410T172148+09:00.Supported-surface guard:rerun after the doc and proof updates so the public lifecycle contract is now guarded alongside the existing supported-surface wording.ChainFire boundary:dynamic membership, replace-node, and scale-out are now explicit non-supported actions on the product surface. The supported path is fixed-membership restore or whole-cluster replacement anchored by the existingdurability-proofbackup/restore lane.FlareDB boundary:online migration and schema evolution are now fixed to an additive-first, backup/restore-gated operator contract. Destructive DDL and fully automated online migration are explicit non-supported boundaries for this release rather than implied future promises.IAM boundary:bootstrap hardening now requires explicit admin token, signing key, and 32-byteIAM_CRED_MASTER_KEYinputs in docs. The standalone proof reruns signing-key rotation, credential overlap-and-revoke rotation, and mTLS overlap-and-cutover rotation tests while checking the hardening markers iniam-server; multi-node IAM failover remains unsupported.
2026-04-10 Edge And Trial-Surface Productization
Task:cc24ac5a-b940-4a32-9136-d706ecadf875Operator doc:docs/edge-trial-surface.mdComponent docs:apigateway/README.md,nightlight/README.md, andcreditservice/README.mdHelper:./nix/test-cluster/work-root-budget.sh statusnow reports./workdisk usage, soft budgets, and cleanup plusnix store gcguidance without mutating state by default.Edge bundle boundary:APIGateway is now documented as stateless replicated behind external L4 or VIP distribution, but restart-based rollout remains the only supported config distribution or reload model proven on this branch. NightLight is fixed to a single-node WAL/snapshot product shape with process-wide retention, and CreditService export plus migration is fixed to offline export/import or backend-native snapshots instead of live mixed-writer migration.Trial boundary:single-node-trial-vmandsingle-node-quickstartremain the only supported lightweight trial surface. OCI/Docker remains intentionally unsupported because it would not prove the same guest-kernel, KVM,/dev/net/tun, and OVS/libvirt contract.
2026-04-10 Provider And VM-Hosting Reality Proof
Task:41a074a3-dc5c-42fc-979e-c8ebf9919d55Focused proof lane:nix run ./nix/test-cluster#cluster -- provider-vm-reality-proofFocused proof result:passed on 2026-04-10 and wroteresult.json,meta.json, journals, and provider or VM-hosting artifacts under/mnt/d2/centra/photoncloud-monorepo/work/provider-vm-reality-proof/20260410T135827+0900.Provider artifacts:network-provider/prismnet-port-create.json,network-provider/prismnet-security-group-after-add.json,network-provider/flashdns-workload-authoritative-answer.txt,network-provider/flashdns-service-authoritative-answer.txt,network-provider/fiberlb-drain-summary.txt,network-provider/fiberlb-tcp-health-before-drain.txt, andnetwork-provider/fiberlb-tcp-health-after-restore.txtfix the current local-KVM proof to tenant network lifecycle, authoritative DNS answers, and listener drain or re-convergence.VM-hosting artifacts:vm-hosting/vm-create-response.json,vm-hosting/root-volume-before-migration.json,vm-hosting/root-volume-after-migration.json,vm-hosting/data-volume-after-migration.json,vm-hosting/migration-summary.json,vm-hosting/prismnet-port-after-migration.json, andvm-hosting/demo-state-after-post-migration-restart.jsonfix the current release proof to KVM shared-storage migration, CoronaFS handoff, and post-migration restart on the worker pair.Scope-fixed gaps:real OVS/OVN dataplane validation, native BGP or BFD peer interop with hardware VIP ownership, and real-hardware VM migration or storage handoff remain outside the supported local-KVM surface and are now explicit docs or guard limits rather than implied release claims.
chainfire
責務:UltraCloud 全体の replicated coordination store。KV, lease, watch, cluster membership view, rollout stack の state anchor を持つ。Canonical entrypoint:nix/modules/chainfire.nix;chainfire/crates/chainfire-server/src/main.rs; supported API はchainfire/proto/chainfire.proto。現在ある証拠:README.mdがMemberList/Statusを supported surface と明示;chainfire/crates/chainfire-server/src/rest.rsに health と member add がある;docs/testing.mdが quickstart と HA proof を定義;nix/single-node/base.nixとnix/nodes/vm-cluster/*が正本 wiring; 2026-04-10 のdurability-proofはchainfire-backup-response.json/chainfire-restored-response.jsonで logical KV backup/restore を保存し、rollout-soakは/mnt/d2/centra/photoncloud-monorepo/work/rollout-soak/20260410T164549+0900/chainfire-post-restart-put.jsonとpost-control-plane-restarts.jsonで fixed-membership restart 後の live proof を保存した。未証明事項:rolling upgrade 手順; 実機 3 ノード上での membership 変更; power-loss 後の復旧 runbook。P0:いまの static survey で即死級の file-level breakage は未検出。P1:CF-P1-01は 2026-04-10 に scope freeze から live-restart-proof 付きへ進んだ。dynamic membership / scale-out / replace-node は supported surface では explicit に unsupported のままだが、fixed-membership restart 自体はrollout-soakにより live KVM proof へ格上げされた。次段で残るのは、live membership mutation 自体を製品化したい場合の dedicated KVM proof 追加だけ。P2:CF-P2-01chainfire-coreの internal pruning が current branch で進行中なので、公開境界と workspace 内部境界の最終整理が必要。依存関係:local disk; host networking;flaredb,iam,deployer,fleet-scheduler,nix-agent,node-agent,coronafsから参照される。
flaredb
責務:replicated KV/SQL metadata store。各サービスの metadata, quota state, object metadata, tenant network state の受け皿。Canonical entrypoint:nix/modules/flaredb.nix;flaredb/crates/flaredb-server/src/main.rs; REST はflaredb/crates/flaredb-server/src/rest.rs。現在ある証拠:README.mdがPOST /api/v1/sqlとGET /api/v1/tablesを supported と明記;flaredb/crates/flaredb-server/src/rest.rsに SQL/KV/scan/member add がある;docs/testing.mdが control-plane proof とfresh-matrix依存を説明;nix/modules/flaredb.nixがpdAddrと namespace mode を生成; 2026-04-10 のrollout-soakは/mnt/d2/centra/photoncloud-monorepo/work/rollout-soak/20260410T164549+0900/flaredb-post-restart-create.json,flaredb-post-restart-insert.json,flaredb-post-restart.jsonで member restart 後の additive SQL を保存し、run-core-control-plane-ops-proof.shは/mnt/d2/centra/photoncloud-monorepo/work/core-control-plane-ops-proof/20260410T172148+09:00/scope-fixed-contract.jsonとflaredb-migration-contract.logで destructive DDL / fully automated online migration が supported surface の外だと固定した。未証明事項:real hardware 上の storage pressure と multi-node repair。fully automated online migration と destructive DDL online cutover はこの release では intentionally unsupported。P0:いまの static survey で即死級の file-level breakage は未検出。P1:FDB-P1-01は 2026-04-10 に scope-fixed final。supported SQL/KV surface の logical backup/restore はdurability-proofと docs で固定済みで、online migration / schema-evolution は additive-first と backup/restore baseline 前提で整理された。rollout-soakは member restart 後の additive SQL を live KVM artifact として保存し、run-core-control-plane-ops-proof.shは destructive DDL と fully automated online migration が supported surface の外だと/mnt/d2/centra/photoncloud-monorepo/work/core-control-plane-ops-proof/20260410T172148+09:00/scope-fixed-contract.jsonに固定した。今後やるなら scope 拡張として destructive online migration proof を別 tranche で扱う。P2:FDB-P2-01namespace ごとのstrong/eventual方針が module default に埋まっており、operator-facing contract としてはまだ弱い。依存関係:chainfireを placement/coordination に使う; local disk;iam,prismnet,flashdns,fiberlb,plasmavmc,lightningstor,creditservice,k8shostから参照される。
iam
責務:identity, token issuance, authn, authz, tenant principal 管理。Canonical entrypoint:nix/modules/iam.nix;iam/crates/iam-server/src/main.rs; API package はiam/crates/iam-api/src/lib.rs。現在ある証拠:README.mdとdocs/component-matrix.mdが core component として扱う;nix/modules/iam.nixがchainfire/flaredb接続を正本生成;iam-authn,iam-authz,iam-storecrate が分離;fresh-matrixと gateway path が credit/k8shost/plasmavmc 経由で IAM を前提にしている;run-core-control-plane-ops-proof.shは/mnt/d2/centra/photoncloud-monorepo/work/core-control-plane-ops-proof/20260410T172148+09:00にiam-key-rotation-tests.log,iam-credential-rotation-tests.log,iam-mtls-rotation-tests.log,scope-fixed-contract.json,result.jsonを保存し、bootstrap hardening, signing-key rotation, credential overlap rotation, mTLS overlap rotation を standalone proof として固定した。未証明事項:multi-node IAM failover; backend matrix 全体での same-lane lifecycle proof。P0:いまの static survey で即死級の file-level breakage は未検出。P1:IAM-P1-01は 2026-04-10 に scope-fixed final。bootstrap hardening と token/signing-key rotation はdocs/control-plane-ops.mdとrun-core-control-plane-ops-proof.shで standalone に固定され、同じ proof root が credential overlap-and-revoke rotation と mTLS overlap-and-cutover rotation も保存するようになった。multi-node IAM failover は supported surface の外へ明示的に出した。今後やるなら scope 拡張として clustered IAM failover proof を別 tranche で扱う。P2:IAM-P2-01flaredb/postgres/sqlite/memoryの backend matrix 全体を harness ではまだ網羅していない。依存関係:flaredbが主 storage; optionalchainfire;prismnet,flashdns,fiberlb,plasmavmc,lightningstor,creditservice,k8shost,apigatewayが consumer。
prismnet
責務:tenant network control plane。VPC, subnet, port, router, security group, service IP pool を扱う。Canonical entrypoint:nix/modules/prismnet.nix;prismnet/crates/prismnet-server/src/main.rs; API はprismnet/crates/prismnet-api/proto/prismnet.proto。現在ある証拠:docs/testing.mdとREADME.mdがfresh-matrixで VPC/subnet/port と security-group ACL add/remove を正本 proof と明示;prismnet/crates/prismnet-server/src/services/*に service 実装がある;prismnet/crates/prismnet-server/src/ovn/client.rsが OVN client を持つ;nix/modules/prismnet.nixが binary-consumed config を生成する。未証明事項:実機 OVS/OVN dataplane; DHCP/metadata service の実ハード proof; multi-rack network integration。P0:いまの static survey で即死級の file-level breakage は未検出。P1:PRISMNET-P1-01は 2026-04-10 に narrowed。provider-vm-reality-proofが local KVM lab で VPC/subnet/port lifecycle, security-group ACL add/remove, attached-VM networking artifact を dated root に保存するようになった。未解消の次段は real OVS/OVN dataplane と hardware-switch integration を release proof に昇格させること。P2:PRISMNET-P2-01ovn/mock.rsが近接して残っているため、supported path と archived/test path の境界を継続監視する必要がある。依存関係:iam,flaredb, optionalchainfire; consumer はflashdns,fiberlb,plasmavmc,k8shost。
flashdns
責務:authoritative DNS publication。tenant record, reverse zone, DNS handler を持つ。Canonical entrypoint:nix/modules/flashdns.nix;flashdns/crates/flashdns-server/src/main.rs;flashdns/crates/flashdns-server/src/dns/*。現在ある証拠:docs/testing.mdとREADME.mdがfresh-matrixで record publication を正本 proof としている;flashdnsserver は record/zone/reverse-zone service を持つ;nix/modules/flashdns.nixが binary-consumed config を生成する。未証明事項:real port 53 exposure; upstream/secondary integration; failover with real network gear。P0:いまの static survey で即死級の file-level breakage は未検出。P1:FLASHDNS-P1-01は 2026-04-10 に narrowed。provider-vm-reality-proofが authoritative workload/service answers を dated root に保存するようになり、local KVM での publication evidence は release lane に入った。未解消の次段は real port 53 exposure と upstream/secondary interop を hardware or external-network proof に広げること。P2:FLASHDNS-P2-01は 2026-04-10 に解消済み。single-node devoptional bundle はnix/single-node/surface.nix上の TCP health gating を持つようになった。依存関係:iam,flaredb, optionalchainfire; publication source はk8shostとfleet-scheduler。
fiberlb
責務:service publication / VIP / L4-L7 load balancing / native BGP advertisement。Canonical entrypoint:nix/modules/fiberlb.nix;fiberlb/crates/fiberlb-server/src/main.rs; dataplane はdataplane.rs,l7_dataplane.rs,vip_manager.rs,bgp_client.rs。現在ある証拠:README.mdとdocs/testing.mdがfresh-matrixで TCP と TLS-terminatedHttps/TerminatedHttpslistener を正本 proof としている; server code に native BGP/BFD, VIP ownership, TLS store, L7 dataplane 実装がある; L4 algorithm は in-tree tests を持つ。未証明事項:実機 BGP peer との interop; L2/VIP 所有権の hardware proof; IPv6 と mixed peer topology。P0:いまの static survey で即死級の file-level breakage は未検出。P1:FIBERLB-P1-01は 2026-04-10 に scope-fixed。fiberlb/crates/fiberlb-server/src/healthcheck.rsの HTTPS health check は依然として backend TLS 証明書検証をしないが、その理由と supported 範囲 (TCP reachability + HTTP status) は docs/guard/source comment に固定された。将来の CA-aware verification は別 tranche。P1:FIBERLB-P1-02は 2026-04-10 に narrowed。provider-vm-reality-proofが listener publication, backend disable, drain, restore, re-convergence の artifact を dated root に保存するようになった。未解消の次段は native BGP/BFD peer interop と hardware VIP ownership を real network proof へ広げること。P2:FIBERLB-P2-01は 2026-04-10 に解消済み。single-node devoptional bundle はnix/single-node/surface.nix上の TCP health gating を持つようになった。依存関係:iam,flaredb, optionalchainfire; publication consumer はk8shostとfleet-scheduler; 実ネットワーク peer が必要。
plasmavmc
責務:tenant VM control plane と worker agent。VM lifecycle, image/materialization, worker registration, hypervisor integration を持つ。Canonical entrypoint:nix/modules/plasmavmc.nix;plasmavmc/crates/plasmavmc-server/src/main.rs; supported public backend はplasmavmc-kvm。現在ある証拠:README.mdが KVM-only public contract を明記;docs/testing.mdがsingle-node-quickstart,fresh-smoke,fresh-matrixでHYPERVISOR_TYPE_KVMを正本 proof とする;vm_service.rsはHYPERVISOR_TYPE_KVM以外を public surface 外とする;volume_manager.rsがcoronafs/lightningstorintegration を持つ。未証明事項:実機での migration / storage handoff; long-running guest upgrade; network + storage fault 下での recovery。P0:いまの static survey で即死級の file-level breakage は未検出。P1:PLASMAVMC-P1-01は 2026-04-10 に narrowed。provider-vm-reality-proofが shared-storage migration, PrismNet-attached post-migration networking, CoronaFS handoff, post-migration restart state を dated root に保存するようになった。未解消の次段は real-hardware migration と storage handoff の release proof を足すこと。P2:PLASMAVMC-P2-01Firecracker / mvisor の archived code が in-tree に残るため、supported surface への逆流を guard し続ける必要がある。依存関係:iam,flaredb,prismnet, optionalchainfire,lightningstor,coronafs, host KVM/QEMU。
coronafs
責務:mutable VM volume layer。raw volume を管理し、qemu-nbdで worker に export する。Canonical entrypoint:nix/modules/coronafs.nix;coronafs/crates/coronafs-server/src/main.rs; 製品説明はcoronafs/README.md。現在ある証拠:coronafs/README.mdが mutable VM-volume layer としての split を明言;coronafs-serverは/healthzと volume/export API を持つ;docs/testing.mdがplasmavmc + coronafs + lightningstorをfresh-matrixで proof 対象にしている;plasmavmc/volume_manager.rsに深い integration がある。未証明事項:export interruption 後の recovery の長時間耐久; 実ディスク/実ネットワーク上での latency budget。P0:いまの static survey で即死級の file-level breakage は未検出。P1:CORONAFS-P1-01は 2026-04-10 に解消済み。nix/single-node/surface.nixの quickstart health URL はhttp://127.0.0.1:50088/healthzに修正された。P1:CORONAFS-P1-02は 2026-04-10 に解消済み。durability-proofが controller outage 中も node-local materialized volume の read と node-only capability split を検証する canonical failure-injection lane を持つ。P2:CORONAFS-P2-01storage benchmark はあるが、canonical publish gate では recovery path の比重がまだ弱い。依存関係:qemu-nbd,qemu-img, local disk; optionalchainfiremetadata backend; primary consumer はplasmavmc。
lightningstor
責務:object storage と VM image backing。metadata plane と data node plane を持つ。Canonical entrypoint:nix/modules/lightningstor.nix;lightningstor/crates/lightningstor-server/src/main.rs;lightningstor/crates/lightningstor-node/src/main.rs; S3 path はsrc/s3/*。現在ある証拠:README.mdが bucket versioning / policy / tagging / object version listing を supported surface と明記;docs/testing.mdがfresh-matrixで bucket metadata と object-version APIs を proof 対象にしている; server は S3 auth, distributed backend, repair queue を持つ; module は metadata/data/all-in-one mode を持つ。未証明事項:distributed backend の実機 failover; S3 compatibility breadth; cold-start image distribution on hardware。P0:いまの static survey で即死級の file-level breakage は未検出。P1:LIGHTNINGSTOR-P1-01は 2026-04-10 に解消済み。durability-proofが node05 outage 中の write/head/read と service restore 後の repair/read-back を canonical failure-injection artifact として保存する。P2:LIGHTNINGSTOR-P2-01は 2026-04-10 に解消済み。single-node devoptional bundle はnix/single-node/surface.nix上の TCP health gating を持つようになった。依存関係:iam,flaredb, optionalchainfire; optionallightningstor-node; consumer はplasmavmcと tenant object clients。
k8shost
責務:tenant workload API surface。pod/deployment/service を扱い、prismnet,flashdns,fiberlb, optionalcreditserviceに投影する。Canonical entrypoint:nix/modules/k8shost.nix;k8shost/crates/k8shost-server/src/main.rs; API protobuf はk8shost/crates/k8shost-proto/proto/k8s.proto。現在ある証拠:k8shost/README.mdが supported scope を定義;README.mdがWatchPodsを bounded snapshot stream と明記;k8shost-server/src/services/pod.rsがReceiverStreamベースのWatchPodsを実装;docs/testing.mdがfresh-smoke/fresh-matrixで API contract を proof 対象にしている; 2026-04-10 には docs/guard/TODO で API/control-plane product surface のみに固定された。未証明事項:実 workload runtime; tenant networking dataplane with real CNI/CSI; node-level execution semantics。P0:K8SHOST-P0-01は 2026-04-10 に解消済み。実 workload dataplane (k8shost-cni,k8shost-controllers,lightningstor-csi) は archived non-product として固定し、製品 narrative を API/control-plane scope のみに揃えた。P1:K8SHOST-P1-01は 2026-04-10 に scope-resolved。canonical proof が API contract 中心であること自体を製品境界として明文化し、実 pod runtime は製品 claim から外した。P2:K8SHOST-P2-01は 2026-04-10 に解消済み。archived scaffolds の非正本扱いはsupported-surface-guardの contract marker で継続監視される。依存関係:iam,flaredb,chainfire,prismnet,flashdns,fiberlb, optionalcreditservice。
apigateway
責務:external API/proxy surface。route, auth provider, credit provider, request mediation を持つ。Canonical entrypoint:nix/modules/apigateway.nix;apigateway/crates/apigateway-server/src/main.rs。現在ある証拠:node06がapigatewayを正本 gateway node として起動;docs/testing.mdとnix/test-cluster/README.mdが API-gateway-mediated flows をfresh-matrixに含める; server code は route, auth, credit provider, upstream timeout, request-id を持つ。未証明事項:multi-node HA; config distribution / reload; TLS termination strategy; gateway as product docs。P0:いまの static survey で即死級の file-level breakage は未検出。P1:APIGW-P1-01は 2026-04-10 に scope-fixed。APIGateway は stateless replicated behind external L4/VIP として supported、config distribution は rendered config + restart-based rollout、live in-process reload は unsupported と docs に固定された。次段で残るのは dedicated multi-gateway HA proof の追加。P2:APIGW-P2-01release proof はnode06とfresh-matrixへの間接依存が中心で、専用 smoke gate が無い。依存関係:upstream services; optionaliam/creditserviceprovider; external clients。
nightlight
責務:metrics ingestion と query。Prometheus remote_write / query API と gRPC query/admin を持つ。Canonical entrypoint:nix/modules/nightlight.nix;nightlight/crates/nightlight-server/src/main.rs; API proto はnightlight/crates/nightlight-api/proto/*。現在ある証拠:nightlight-serverは HTTP と gRPC を両方 bind する;node06が gateway node で起動;docs/testing.mdとnix/test-cluster/README.mdが NightLight HTTP surface の host-forward proof を記述; local WAL/snapshot/retention loop がある。未証明事項:replicated metrics topology; large retention; sustained remote_write load; tenant isolation。P0:いまの static survey で即死級の file-level breakage は未検出。P1:NIGHTLIGHT-P1-01は 2026-04-10 に scope-fixed。NightLight は single-node WAL/snapshot service として product shape を固定し、replicated / HA metrics path は unsupported であることを docs と guard に反映した。P2:NIGHTLIGHT-P2-01は 2026-04-10 に narrowed。tenant boundary は deployment-scoped か upstream-auth-scoped であり、process 内の hard multi-tenant auth や per-tenant retention は current product contract に含めないことを docs に固定した。次段は auth or quota aware multi-tenant proof の追加。依存関係:local disk; optionalapigateway; external metric writers/readers。
creditservice
責務:quota, wallet, reservation, admission control。Canonical entrypoint:nix/modules/creditservice.nix;creditservice/crates/creditservice-server/src/main.rs; 製品スコープはcreditservice/README.md。現在ある証拠:creditservice/README.mdが supported scope と non-goals を明記;docs/testing.mdがfresh-matrixで quota/wallet/reservation/API-gateway path を proof 対象にしている; module はiamAddr,flaredbAddr, optional SQL backend を持つ;node06が canonical gateway node で起動する。未証明事項:backend migration; finance-system との分離運用; export/reporting path。P0:いまの static survey で即死級の file-level breakage は未検出。P1:CREDIT-P1-01製品 narrative が README の non-goal を越えて finance ledger に膨らまないよう境界維持が必要。P2:CREDIT-P2-01は 2026-04-10 に narrowed。export と backend migration は offline export/import or backend-native snapshot workflows として README へ固定し、live mixed-writer migration は unsupported と明示した。次段は dedicated export proof の追加。依存関係:iam,flaredb, optionalchainfire;apigateway,k8shost, tenant admission flow が consumer。
deployer
責務:bootstrap and rollout-intent authority。/api/v1/phone-home, install plan, desired-system reference, cluster inventory を持つ。Canonical entrypoint:nix/modules/deployer.nix;deployer/crates/deployer-server/src/main.rs; route wiring はdeployer/crates/deployer-server/src/lib.rs。現在ある証拠:/api/v1/phone-homeが server route に存在;nix/modules/deployer.nixが package/service/cluster-state seed を持つ;docs/testing.md,docs/rollout-bundle.md,nix/test-cluster/README.mdがbaremetal-iso,baremetal-iso-e2e,deployer-vm-smoke,deployer-bootstrap-e2e,durability-proof,rollout-soakを正本 proof とする;verify-baremetal-iso.shが install path を end-to-end で辿る; 2026-04-10 のdurability-proofはdeployer-pre-register-request.json,deployer-backup-list.json,deployer-post-restart-list.json,deployer-replayed-list.jsonを保存し、rollout-soakは/mnt/d2/centra/photoncloud-monorepo/work/rollout-soak/20260410T164549+0900/deployer-post-restart-nodes.json,scope-fixed-contract.json,deployer-scope-fixed.txt,deployer-journal.logで longer-run live restart と release boundary marker を保存した。未証明事項:実機 USB/BMC install; deployer 自身の true HA; ChainFire-backed multi-instance active failover の実装; operator disaster recovery の実機確認。P0:DEPLOYER-P0-01現在の canonical bare-metal proof は QEMU-as-hardware までで、実機 regression lane はまだ無い。P1:DEPLOYER-P1-01は 2026-04-10 に scope-fixed final。release contract は one active writer plus optional cold-standby restore withultracloud.clusterstate re-apply and preserved admin request replay で固定し、automatic ChainFire-backed multi-instance failover は supported surface の外へ明示的に出した。rollout-soakは/mnt/d2/centra/photoncloud-monorepo/work/rollout-soak/20260410T164549+0900/deployer-post-restart-nodes.json,scope-fixed-contract.json,deployer-scope-fixed.txtで live restart proof と boundary marker を保存した。今後やるなら scope 拡張として true HA 実装を別 ticket で扱う。P2:DEPLOYER-P2-01bootstrapFlakeBundleと optional binary cache を production でどう供給するかの標準運用形がまだ文書化不足。依存関係:chainfire;nix-agent;install-target; ISO/first-boot path; optional binary cache。
fleet-scheduler
責務:non-Kubernetes native service scheduler。cluster-native service placement, failover, publication reconciliation を持つ。Canonical entrypoint:nix/modules/fleet-scheduler.nix;deployer/crates/fleet-scheduler/src/main.rs; publication code はpublish.rs。現在ある証拠:docs/testing.md,docs/rollout-bundle.md,nix/test-cluster/README.mdがfresh-smoke,fresh-matrix,fleet-scheduler-e2e,rollout-soakをこの境界の proof とする; module はiamEndpoint,fiberlbEndpoint,flashdnsEndpoint,heartbeatTimeoutSecsを持つ; scheduler code はchainfirewatch, dependency summary, publication reconciliation を持つ;fresh-smokeはnode04 -> draining,node05fail-stop, worker return 後の replica restore を通し、rollout-soakは/mnt/d2/centra/photoncloud-monorepo/work/rollout-soak/20260410T164549+0900/maintenance-held.json,power-loss-held.json,fleet-scheduler-post-restart.json,scope-fixed-contract.json,fleet-scheduler-scope-fixed.txtで scope-fixed longer-run proof を保存した。未証明事項:大規模クラスタ; multi-hour maintenance 窓; operator approval workflow を伴う drain choreography。P0:いまの static survey で即死級の file-level breakage は未検出。P1:FLEET-P1-01は 2026-04-10 に scope-fixed final。release contract は two native-runtime workers 上の one planned drain cycle + one fail-stop worker-loss cycle + 30-second held degraded states で固定し、rollout-soakは/mnt/d2/centra/photoncloud-monorepo/work/rollout-soak/20260410T164549+0900/maintenance-held.json,power-loss-held.json,fleet-scheduler-post-restart.json,scope-fixed-contract.json,fleet-scheduler-scope-fixed.txtでその upper bound を live KVM artifact として保存した。multi-hour maintenance windows, pinned singleton policies, operator approval workflows, and larger-cluster drain storms は supported surface の外へ明示的に出した。P2:FLEET-P2-01は 2026-04-10 に解消済み。module/binary default のchainfireEndpointは canonicalhttp://127.0.0.1:2379へ揃えた。依存関係:chainfire;node-agent; optionaliam,fiberlb,flashdns।
nix-agent
責務:host-local NixOS convergence only。desired system を build/apply し、health check と rollback を担う。Canonical entrypoint:nix/modules/nix-agent.nix;deployer/crates/nix-agent/src/main.rs。現在ある証拠:docs/testing.md,docs/rollout-bundle.md,nix/test-cluster/README.mdがbaremetal-iso,baremetal-iso-e2e,deployer-vm-smoke,deployer-vm-rollback,portable-control-plane-regressionsを proof とする; code は desired-system, observed-system, rollback-on-failure, health-check-command を持つ;nix/modules/nix-agent.nixがその CLI 契約を正本生成する; 2026-04-10 のrollout-soakは/mnt/d2/centra/photoncloud-monorepo/work/rollout-soak/20260410T154744+0900/node01-nix-agent-scope.txtとnode04-nix-agent-scope.txtを保存し、steady-statetest-clusterでは livenix-agent.servicerestart を pretending しない boundary を artifact と docs で固定した。未証明事項:kernel/network failure 下の rollback; multi-node wave rollout; real hardware recovery after partial switch。P0:いまの static survey で即死級の file-level breakage は未検出。P1:NIXAGENT-P1-01は 2026-04-10 に解消済み。healthCheckCommandの argv 契約、rollbackOnFailureのrolled-backsemantics、deployer-vm-rollbackproof、partial failure recovery 手順はdocs/rollout-bundle.mdとdocs/testing.mdに固定された。P2:NIXAGENT-P2-01は 2026-04-10 に解消済み。module/binary default のchainfireEndpointは canonicalhttp://127.0.0.1:2379へ揃えた。依存関係:chainfire; deployer が publish する desired-system; local NixOS flake / switch-to-configuration。
node-agent
責務:host-local runtime reconcile only。native service instance の heartbeat, process/container 実行, local observed state を担う。Canonical entrypoint:nix/modules/node-agent.nix;deployer/crates/node-agent/src/main.rs。現在ある証拠:docs/testing.md,docs/rollout-bundle.md,nix/test-cluster/README.mdがfresh-smoke,fresh-matrix,fleet-scheduler-e2e,portable-control-plane-regressionsを proof とする; code はwatcher,agent,processを持つ; module は Podman enable, stateDir, pidDir,allowLocalInstanceUpsertを持つ;process.rsは${stateDir}/pids/*.logと${stateDir}/pids/*.meta.jsonの contract を実装する。未証明事項:heterogeneous runtime support; crash-looping host service の細かな SLO; secret-rotation workflow そのもの。P0:いまの static survey で即死級の file-level breakage は未検出。P1:NODEAGENT-P1-01は 2026-04-10 に解消済み。logs / secrets / volume / upgrade 契約はdocs/rollout-bundle.mdと module description に固定された。P2:NODEAGENT-P2-01は 2026-04-10 に解消済み。module/binary default のchainfireEndpointは canonicalhttp://127.0.0.1:2379へ揃えた。依存関係:chainfire;fleet-scheduler; optional Podman; host systemd/process model。
Nix/bootstrap/harness
責務:製品 surface を定義し、single-node dev,3-node HA control plane,bare-metal bootstrapの NixOS outputs と VM/QEMU harness を正本化する。Canonical entrypoint:flake.nix;nix/modules/default.nix;nix/single-node/base.nix;nix/test-cluster/run-publishable-kvm-suite.sh;nix/test-cluster/run-local-baseline.sh;nix/test-cluster/verify-baremetal-iso.sh;nix/nodes/baremetal-qemu/*。現在ある証拠:flake.nixにsingle-node-quickstart,single-node-trial-vm,canonical-profile-eval-guards,portable-control-plane-regressions,baremetal-iso-e2eがある;nix/modules/default.nixが現在の module surface を一括 import する;nix/single-node/base.nixが最小 VM platform core と optional bundle を組む;run-publishable-kvm-suite.shとrun-local-baseline.shが local CPU 並列度と local builder を固定する;verify-baremetal-iso.shが ISO -> phone-home -> bundle fetch -> Disko -> reboot ->nix-agent activeを辿る;run-cluster.shにはdurability-proofとrollout-soakが追加され、chainfire,flaredb,deployer,coronafs,lightningstorの backup/restore と failure-injection artifact を/work/durability-proofに、longer-run rollout/control-plane maintenance artifact を/work/rollout-soakに保存する; 2026-04-10 の local AMD/KVM baseline で required 6 checks とsingle-node-quickstart,baremetal-iso,fresh-smokeがすべて pass した。未証明事項:実機 USB/BMC install;/nix/store容量制御の自動 guard; optional bundle 全部入り quickstart の release proof; non-Nix easy-trial artifact。P0:HARNESS-P0-01real hardware regression lane がまだ無く、canonical bare-metal proof は QEMU stand-in のまま。P1:HARNESS-P1-01は 2026-04-10 に解消済み。quickstart optional bundle の health gating はlightningstor,flashdns,fiberlbの TCP probe とcoronafsの50088/healthzへ揃えた。P1:HARNESS-P1-02は 2026-04-10 に scope-fixed。easy-trial はsingle-node-trial-vmによる Nix VM appliance で成立し、より軽い Docker/OCI 風 trial path を supported としない理由はdocs/edge-trial-surface.md,README.md,docs/testing.md,docs/component-matrix.md,nix/single-node/surface.nix,supported-surface-guardに揃えた。P1:HARNESS-P1-03は 2026-04-10 に解消済み。fresh-smokeの stale VM cleanup は current profile のvm_dir/vde_switch_dirに含まれる PID に限定し、別 checkout の同名 cluster VM を巻き込まないようにした。P2:HARNESS-P2-01は 2026-04-10 に解消済み。./workと local builder parallelism に加えて./nix/test-cluster/work-root-budget.shがstatusに加えてenforceとprune-proof-logsを持つようになり、disk budget advisory だけでなく stronger local budget gate と safer dated-proof cleanup workflow を提供するようになった。依存関係:nix,nixpkgs, QEMU/KVM, host disk under./work, local CPU parallelism, 全 component module 群。
Notes For The Next Implementation Agent
- まず
DEPLOYER-P0-01/HARNESS-P0-01を処理すると、hardware proof と実機 operator path の残件を低コストで減らせる。 - baseline 再現は
nix/test-cluster/run-local-baseline.shを使うと、local-only builder と./work配下ログを固定したまま同じ経路を再実行できる。 - その次に
DEPLOYER-P0-01/HARNESS-P0-01を実機 smoke へ進めると、QEMU-only から hardware path へ移れる。 DEPLOYER-P1-01とFLEET-P1-01は scope-fixed final になった。今後それらを再度開くなら、current release boundary を拡張する別 tranche として true deployer HA や larger-cluster scheduler maintenance proof を扱うとよい。FIBERLB-P1-01は scope-fixed になったが、将来的に backend certificate verification を製品化するなら docs/guard の限定契約を書き換える必要がある。