77 lines
6.3 KiB
Markdown
77 lines
6.3 KiB
Markdown
# Core Control Plane Operations
|
|
|
|
This document fixes the supported operator lifecycle for the core control-plane services: `chainfire`, `flaredb`, and `iam`.
|
|
|
|
## ChainFire Membership And Node Replacement
|
|
|
|
ChainFire dynamic membership, replace-node, and scale-out are unsupported on the supported surface.
|
|
|
|
The supported public surface is the fixed-membership cluster API already documented in `chainfire-api`: `MemberList` and `Status` report the membership that the node booted with, and operators should treat that membership as immutable for a release branch.
|
|
|
|
Supported operator actions today:
|
|
|
|
1. Keep the canonical control plane at the documented fixed membership for the branch.
|
|
2. Use the canonical `durability-proof` backup/restore lane before disruptive maintenance.
|
|
3. Use `nix run ./nix/test-cluster#cluster -- rollout-soak` when you need a longer-running fixed-membership restart proof after maintenance or rollout work.
|
|
4. Recover failed nodes by restoring the same fixed-membership cluster shape or by rebuilding the whole cluster with a freshly published static membership and then restoring data.
|
|
|
|
Unsupported operator actions today:
|
|
|
|
1. Live `replace-node` through a public ChainFire API.
|
|
2. Live `scale-out` by adding new voters on the supported surface.
|
|
3. Relying on internal membership helpers as a published product contract.
|
|
|
|
The focused boundary proof is `./nix/test-cluster/run-core-control-plane-ops-proof.sh`, which records the fixed-membership source marker from `chainfire-api` and the public docs markers under `./work/core-control-plane-ops-proof`. The live-operations companion is `nix run ./nix/test-cluster#cluster -- rollout-soak`, which on 2026-04-10 recorded `chainfire-post-restart-put.json`, `chainfire-post-restart.json`, and `post-control-plane-restarts.json` under `./work/rollout-soak/20260410T164549+0900` after repeated maintenance and worker power-loss, without promoting dynamic membership to supported scope.
|
|
|
|
## FlareDB Online Migration And Schema Evolution
|
|
|
|
FlareDB online migration and schema evolution must start from the durability-proof backup/restore baseline.
|
|
|
|
The supported operator contract is additive-first schema evolution:
|
|
|
|
1. Run `nix run ./nix/test-cluster#cluster -- durability-proof` or keep an equivalent logical backup artifact before changing schema.
|
|
2. Apply additive changes first: new tables, new nullable columns, new indexes, and code paths that tolerate both old and new shapes.
|
|
3. Backfill data and cut read traffic to the new schema before deleting or rewriting old state.
|
|
4. Treat destructive cleanup, `DROP TABLE`, and incompatible column rewrites as a later maintenance step after a fresh backup.
|
|
|
|
This keeps the migration runbook consistent with the current product proof: the durability lane proves logical SQL backup/restore, and the 2026-04-10 `rollout-soak` artifact root `./work/rollout-soak/20260410T164549+0900` rechecks additive SQL operations through `flaredb-post-restart-create.json`, `flaredb-post-restart-insert.json`, and `flaredb-post-restart.json` after a FlareDB member restart. The operator contract for live changes stays additive schema evolution rather than destructive in-place rewrites.
|
|
|
|
FlareDB destructive DDL and fully automated online migration remain outside the supported product contract for this release. When you need `DROP TABLE`, incompatible column rewrites, or automated destructive cutover, stop at the additive-first boundary above, take a fresh logical backup, and treat the destructive step as an explicit offline maintenance action rather than a release-proven online behavior.
|
|
|
|
Internal raft membership helpers in `flaredb-raft` exist for implementation work, but they are not the published operator API for schema migration.
|
|
|
|
## IAM Bootstrap Hardening And Rotation
|
|
|
|
IAM bootstrap hardening requires an explicit admin token, an explicit signing key, and a 32-byte IAM_CRED_MASTER_KEY; signing-key rotation, credential rotation, and mTLS overlap-and-cutover rotation are the supported recovery paths.
|
|
|
|
Production bootstrap contract:
|
|
|
|
1. Set `IAM_ADMIN_TOKEN` or `PHOTON_IAM_ADMIN_TOKEN`.
|
|
2. Set `authn.internal_token.signing_key` in config or provide the equivalent environment-backed configuration.
|
|
3. Set `IAM_CRED_MASTER_KEY` to a 32-byte value before enabling credential issuance.
|
|
4. Keep `admin.allow_unauthenticated=true`, `IAM_ALLOW_UNAUTHENTICATED_ADMIN=true`, and random signing keys limited to local development or lab proof environments.
|
|
|
|
Supported token and key rotation flow:
|
|
|
|
1. Add the new signing key and keep the old key available for verification during the overlap window.
|
|
2. Issue new tokens from the new active key.
|
|
3. Wait for the maximum supported token TTL or explicitly revoke the old population before retiring the old key.
|
|
4. Purge retired keys only after the overlap and retirement windows are complete.
|
|
|
|
Supported credential rotation flow:
|
|
|
|
1. Keep `IAM_CRED_MASTER_KEY` explicit and stable across the overlap window.
|
|
2. Mint a new credential for the same principal before revoking the old one.
|
|
3. Move clients to the new access key and verify it can still read back its secret material.
|
|
4. Revoke the old credential only after cutover is complete.
|
|
|
|
Supported mTLS overlap-and-cutover rotation flow:
|
|
|
|
1. Configure IAM to trust both the old and new service identity mapping or trust roots during the overlap window.
|
|
2. Issue or install the new client certificate and cut traffic over to it.
|
|
3. Remove the old mapping or trust root only after the new certificate is serving traffic successfully.
|
|
4. Verify the old certificate is rejected once the overlap window closes.
|
|
|
|
Multi-node IAM failover remains outside the supported product contract for this release. The current release proof is lifecycle-oriented rather than HA-oriented: bootstrap hardening, signing-key rotation, credential overlap-and-revoke rotation, and mTLS overlap-and-cutover rotation are supported; clustered IAM failover is future scope expansion.
|
|
|
|
The standalone proof is `./nix/test-cluster/run-core-control-plane-ops-proof.sh`. It runs the `iam-authn` signing-key and mTLS rotation tests plus the `iam-api` credential rotation test, records the bootstrap hardening source markers from `iam-server`, and persists logs plus `result.json` and `scope-fixed-contract.json` under `./work/core-control-plane-ops-proof`. The dated 2026-04-10 artifact root is `./work/core-control-plane-ops-proof/20260410T172148+09:00`.
|