photoncloud-monorepo/docs/control-plane-ops.md
2026-05-05 22:49:03 +09:00

7.6 KiB

Core Control Plane Operations

This document fixes the supported operator lifecycle for the core control-plane services: chainfire, flaredb, and iam.

ChainFire Membership And Node Replacement

ChainFire supports live membership add, remove, endpoint replacement, and live leader transfer on the supported surface. The supported reconfiguration boundary is sequential one-voter transitions; arbitrary multi-voter swaps still require future joint-consensus work.

The supported public surface is the replicated cluster API documented in chainfire-api: MemberAdd, MemberRemove, MemberList, Status, and LeaderTransfer operate on the current committed membership rather than only the bootstrap shape.

Supported operator actions today:

  1. Scale out by adding a learner or voter with MemberAdd.
  2. Promote a learner to voter by re-adding the same member ID with is_learner=false.
  3. Replace a learner, follower, voter, or current-leader endpoint in place by re-adding the same member ID with updated peer or client URLs.
  4. Hand leadership to another live voting member with LeaderTransfer before maintenance that should avoid the current leader taking the election hit.
  5. Scale in or retire a learner, follower, voter, or current leader with MemberRemove; when the current leader is removed, the remaining voters elect the replacement leader.
  6. Use the canonical durability-proof backup/restore lane before disruptive maintenance or before a membership change you cannot quickly roll back.
  7. Use nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof when you need the dedicated KVM proof for scale-out, learner promotion, leader transfer, temporary-voter restart, current-leader removal, re-add, and scale-in on the canonical control-plane shape.
  8. Use nix run ./nix/test-cluster#cluster -- rollout-soak when you need the longer-running restart and degraded-service proof for the canonical control-plane shape after maintenance or rollout work.

Unsupported operator actions today:

  1. Treating internal Raft helpers outside chainfire-api and chainfire-server as the supported operator contract.
  2. Treating larger-cluster, hardware, or arbitrary-topology live reconfiguration beyond the canonical KVM proof lane as release-proven. The current proof is fixed to the canonical 3-node control plane plus one temporary node04 replica.

The focused boundary proof is ./nix/test-cluster/run-core-control-plane-ops-proof.sh, which records the published ChainFire API surface and the public docs markers under ./work/core-control-plane-ops-proof. The dedicated live-membership KVM proof is nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof, which records learner add, voter promotion, live leader transfer, temporary-voter restart, current-leader removal, removed-leader re-add, and final scale-in artifacts under ./work/chainfire-live-membership-proof. The live-operations restart companion remains nix run ./nix/test-cluster#cluster -- rollout-soak, which on 2026-04-10 recorded chainfire-post-restart-put.json, chainfire-post-restart.json, and post-control-plane-restarts.json under ./work/rollout-soak/20260410T164549+0900 after repeated maintenance and worker power-loss for the canonical 3-node control-plane shape.

FlareDB Online Migration And Schema Evolution

FlareDB online migration and schema evolution must start from the durability-proof backup/restore baseline.

The supported operator contract is additive-first schema evolution:

  1. Run nix run ./nix/test-cluster#cluster -- durability-proof or keep an equivalent logical backup artifact before changing schema.
  2. Apply additive changes first: new tables, new nullable columns, new indexes, and code paths that tolerate both old and new shapes.
  3. Backfill data and cut read traffic to the new schema before deleting or rewriting old state.
  4. Treat destructive cleanup, DROP TABLE, and incompatible column rewrites as a later maintenance step after a fresh backup.

This keeps the migration runbook consistent with the current product proof: the durability lane proves logical SQL backup/restore, and the 2026-04-10 rollout-soak artifact root ./work/rollout-soak/20260410T164549+0900 rechecks additive SQL operations through flaredb-post-restart-create.json, flaredb-post-restart-insert.json, and flaredb-post-restart.json after a FlareDB member restart. The operator contract for live changes stays additive schema evolution rather than destructive in-place rewrites.

FlareDB destructive DDL and fully automated online migration remain outside the supported product contract for this release. When you need DROP TABLE, incompatible column rewrites, or automated destructive cutover, stop at the additive-first boundary above, take a fresh logical backup, and treat the destructive step as an explicit offline maintenance action rather than a release-proven online behavior.

Internal raft membership helpers in flaredb-raft exist for implementation work, but they are not the published operator API for schema migration.

IAM Bootstrap Hardening And Rotation

IAM bootstrap hardening requires an explicit admin token, an explicit signing key, and a 32-byte IAM_CRED_MASTER_KEY; signing-key rotation, credential rotation, and mTLS overlap-and-cutover rotation are the supported recovery paths.

Production bootstrap contract:

  1. Set IAM_ADMIN_TOKEN or PHOTON_IAM_ADMIN_TOKEN.
  2. Set authn.internal_token.signing_key in config or provide the equivalent environment-backed configuration.
  3. Set IAM_CRED_MASTER_KEY to a 32-byte value before enabling credential issuance.
  4. Keep admin.allow_unauthenticated=true, IAM_ALLOW_UNAUTHENTICATED_ADMIN=true, and random signing keys limited to local development or lab proof environments.

Supported token and key rotation flow:

  1. Add the new signing key and keep the old key available for verification during the overlap window.
  2. Issue new tokens from the new active key.
  3. Wait for the maximum supported token TTL or explicitly revoke the old population before retiring the old key.
  4. Purge retired keys only after the overlap and retirement windows are complete.

Supported credential rotation flow:

  1. Keep IAM_CRED_MASTER_KEY explicit and stable across the overlap window.
  2. Mint a new credential for the same principal before revoking the old one.
  3. Move clients to the new access key and verify it can still read back its secret material.
  4. Revoke the old credential only after cutover is complete.

Supported mTLS overlap-and-cutover rotation flow:

  1. Configure IAM to trust both the old and new service identity mapping or trust roots during the overlap window.
  2. Issue or install the new client certificate and cut traffic over to it.
  3. Remove the old mapping or trust root only after the new certificate is serving traffic successfully.
  4. Verify the old certificate is rejected once the overlap window closes.

Multi-node IAM failover remains outside the supported product contract for this release. The current release proof is lifecycle-oriented rather than HA-oriented: bootstrap hardening, signing-key rotation, credential overlap-and-revoke rotation, and mTLS overlap-and-cutover rotation are supported; clustered IAM failover is future scope expansion.

The standalone proof is ./nix/test-cluster/run-core-control-plane-ops-proof.sh. It runs the iam-authn signing-key and mTLS rotation tests plus the iam-api credential rotation test, records the bootstrap hardening source markers from iam-server, and persists logs plus result.json and scope-fixed-contract.json under ./work/core-control-plane-ops-proof. The dated 2026-04-10 artifact root is ./work/core-control-plane-ops-proof/20260410T172148+09:00.