diff --git a/README.md b/README.md index 646aeb6..2f6de10 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@ The canonical bare-metal bootstrap proof is the ISO-on-QEMU path under [`nix/tes ## Core API Notes -- `chainfire` ships a fixed-membership cluster API on the supported surface. Public cluster management is `MemberList` plus `Status`, and the internal Raft transport surface is `Vote` plus `AppendEntries`. `chainfire-core` is workspace-internal only; the old embeddable builder and distributed-KV scaffold are not part of the supported product contract. +- `chainfire` ships a live cluster-management API on the supported surface. Public cluster management is `MemberAdd`, `MemberRemove`, `MemberList`, `Status`, and `LeaderTransfer`, and the internal Raft transport surface is `Vote`, `AppendEntries`, plus `TimeoutNow`. `chainfire-core` is workspace-internal only; the old embeddable builder and distributed-KV scaffold are not part of the supported product contract. - `flaredb` ships SQL on both gRPC and REST. The supported REST SQL surface is `POST /api/v1/sql` for statement execution and `GET /api/v1/tables` for table discovery, alongside the existing KV and scan endpoints. - `plasmavmc` ships a KVM-only public VM backend contract. The supported create and recovery surface is the KVM path exercised in `single-node-quickstart`, `fresh-smoke`, and `fresh-matrix`; Firecracker and mvisor remain archived non-product backends outside the supported surface until they have real tenant-network coverage. - `lightningstor` keeps its optional gRPC surface live: bucket versioning, bucket policy, bucket tagging, and explicit object version listing are part of the supported contract for the canonical optional bundle. @@ -38,7 +38,7 @@ The canonical bare-metal bootstrap proof is the ISO-on-QEMU path under [`nix/tes The control-plane operator contract is fixed in [docs/control-plane-ops.md](docs/control-plane-ops.md). -- ChainFire dynamic membership, replace-node, and scale-out are unsupported on the supported surface; the supported operator path is fixed-membership restore or whole-cluster replacement backed by the `durability-proof` backup/restore baseline. +- ChainFire supports live membership add, remove, promotion, endpoint replacement, and leader transfer for voters and learners on the public surface, including current-leader removal followed by election on the remaining voters. The supported reconfiguration boundary is sequential one-voter transitions until joint consensus lands. The fallback operator path remains backup plus restore through `durability-proof`, and the dedicated KVM proof lane is `nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof`. - FlareDB online migration and schema evolution must start from the durability-proof backup/restore baseline and stay additive-first until a later destructive cleanup window. FlareDB destructive DDL and fully automated online migration remain outside the supported product contract for this release. - IAM bootstrap hardening requires an explicit admin token, an explicit signing key, and a 32-byte IAM_CRED_MASTER_KEY. Signing-key rotation, credential overlap-and-revoke rotation, and mTLS overlap-and-cutover rotation are part of the supported operator contract; multi-node IAM failover remains outside the supported product contract. The standalone proof is `./nix/test-cluster/run-core-control-plane-ops-proof.sh`. @@ -93,6 +93,7 @@ nix develop nix run ./nix/test-cluster#cluster -- fresh-smoke nix run ./nix/test-cluster#cluster -- fresh-demo-vm-webapp nix run ./nix/test-cluster#cluster -- fresh-matrix +nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof ./nix/test-cluster/run-publishable-kvm-suite.sh ./work/publishable-kvm-suite ``` @@ -100,6 +101,7 @@ The checked-in entrypoint for the publishable nested-KVM suite is the local wrap For the full supported-surface proof on a local AMD/KVM host, use `./nix/test-cluster/run-supported-surface-final-proof.sh ./work/final-proofs/latest`; it keeps builders local, builds `single-node-trial-vm`, runs `single-node-quickstart`, and captures the publishable KVM suite logs in one place. `nix run ./nix/test-cluster#cluster -- durability-proof` is the canonical chainfire flaredb deployer backup/restore lane. It persists artifacts under `./work/durability-proof/latest`, proves logical backup/restore for ChainFire keys and FlareDB SQL rows, uses the canonical Deployer admin pre-register request itself as the backup artifact, verifies that the pre-registered node survives a `deployer.service` restart, replays the same request idempotently, and injects CoronaFS plus LightningStor failures against the same live KVM cluster. `nix run ./nix/test-cluster#cluster -- rollout-soak` is the longer-running control-plane and rollout companion lane. It rebuilds from clean local KVM runtime state, persists artifacts under `./work/rollout-soak/latest`, validates exactly one planned `draining` maintenance cycle and one fail-stop worker-loss cycle on the two native-runtime workers, holds each degraded state for the configured soak window, then restarts `deployer`, `fleet-scheduler`, `node-agent`, `chainfire`, and `flaredb` before revalidating the cluster. The soak root also carries explicit scope markers so the supported boundary is encoded in the proof artifacts rather than only in docs. The steady-state KVM nodes do not run `nix-agent.service`, so the soak lane records explicit `nix-agent` scope markers instead of pretending a live-cluster `nix-agent` restart happened. +`nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof` is the focused local-KVM live-reconfiguration lane for ChainFire. It rebuilds from clean local runtime state, starts a temporary ChainFire replica on `node04`, proves learner add plus local replication, voter promotion, live leader transfer, temporary-voter restart and rejoin, current-leader removal followed by re-election, removed-leader re-add, and final scale-in back to the canonical 3-node control-plane shape, and stores the resulting membership or local-read artifacts under `./work/chainfire-live-membership-proof/latest`. `nix run ./nix/test-cluster#cluster -- provider-vm-reality-proof` is the focused local-KVM reality lane for the provider and VM-hosting bundles. It stores artifacts under `./work/provider-vm-reality-proof/latest`, captures authoritative FlashDNS answers, FiberLB backend drain and restore evidence, and PlasmaVMC KVM shared-storage migration plus post-migration restart state. The 2026-04-10 local AMD/KVM proof logs are in `./work/final-proofs/32f64c10-1b74-4d8a-8d7d-b2cc6bf6b4f0-final` for `supported-surface-guard`, `single-node-trial-vm`, and `single-node-quickstart`, and in `./work/publishable-kvm-suite` for the final passing `fresh-smoke`, `fresh-demo-vm-webapp`, and `fresh-matrix` run through `./nix/test-cluster/run-publishable-kvm-suite.sh`. The exact bare-metal check-runner proof from `2026-04-10` is in `./work/baremetal-iso-e2e/0de75570-dabd-471b-95fe-5898c54e2e8c`; its outer `environment.txt` records `execution_model=materialized-check-runner`, and `state/environment.txt` records `vm_accelerator_mode=kvm`. @@ -108,13 +110,13 @@ The 2026-04-10 longer-running rollout and control-plane soak is in `./work/rollo The 2026-04-10 provider and VM-hosting reality proof logs are in `./work/provider-vm-reality-proof/20260410T135827+0900`; `result.json` records `success=true`, and the artifact set includes `network-provider/fiberlb-drain-summary.txt`, `network-provider/flashdns-service-authoritative-answer.txt`, `vm-hosting/migration-summary.json`, and `vm-hosting/root-volume-after-post-migration-restart.json`. Physical-node bring-up now has a canonical preflight wrapper as well: `nix run ./nix/test-cluster#hardware-smoke -- preflight`. It writes `kernel-params.txt`, expected markers, failure markers, and a machine-readable blocked or ready state under `./work/hardware-smoke/latest`, and the same entrypoint can later be rerun as `run` or `capture` when USB or BMC/Redfish transport is actually present. -Within that suite, `fresh-matrix` is the public provider-bundle proof: it exercises PrismNet VPC/subnet/port flows plus security-group ACL add/remove, FlashDNS record publication, and FiberLB TCP plus TLS-terminated `Https` / `TerminatedHttps` listeners in one tenant-scoped composition run. The published FiberLB L4 algorithms are kept honest with targeted server unit tests in-tree. `provider-vm-reality-proof` is the artifact-producing companion lane for the same bundle and for the VM-hosting path. +Within that suite, `fresh-matrix` is the public provider-bundle proof: it exercises PrismNet VPC/subnet/port flows plus security-group ACL add/remove, FlashDNS record publication, and FiberLB TCP plus TLS-terminated `Https` / `TerminatedHttps` listeners in one tenant-scoped composition run. The published FiberLB L4 algorithms are kept honest with targeted server unit tests in-tree. `provider-vm-reality-proof` is the artifact-producing companion lane for the same bundle and for the VM-hosting path, and `chainfire-live-membership-proof` is the dedicated control-plane live-reconfiguration companion for ChainFire. PrismNet real OVS/OVN dataplane validation remains outside the supported local KVM surface. FiberLB native BGP or BFD peer interop plus hardware VIP ownership also remain outside the supported local KVM surface. PlasmaVMC real-hardware migration or storage handoff remains a later hardware proof; the current local-KVM proof fixes the release surface to KVM shared-storage migration on the worker pair. Project-done release proof now requires both halves of the public validation surface to be green: - `baremetal-iso` and `baremetal-iso-e2e` for the canonical `deployer -> installer -> nix-agent` bare-metal bootstrap path -- the KVM publishable suite (`fresh-smoke`, `fresh-demo-vm-webapp`, `fresh-matrix`) for the nested-KVM multi-node VM-hosting path +- the KVM publishable suite (`fresh-smoke`, `fresh-demo-vm-webapp`, `fresh-matrix`, `chainfire-live-membership-proof`) for the nested-KVM multi-node VM-hosting and live-control-plane path Canonical bare-metal bootstrap proof: diff --git a/chainfire/chainfire-client/src/client.rs b/chainfire/chainfire-client/src/client.rs index edc2d83..f2a4fc6 100644 --- a/chainfire/chainfire-client/src/client.rs +++ b/chainfire/chainfire-client/src/client.rs @@ -4,7 +4,8 @@ use crate::error::{ClientError, Result}; use crate::watch::WatchHandle; use chainfire_proto::proto::{ cluster_client::ClusterClient, compare, kv_client::KvClient, request_op, response_op, - watch_client::WatchClient, Compare, DeleteRangeRequest, PutRequest, RangeRequest, RequestOp, + watch_client::WatchClient, Compare, DeleteRangeRequest, LeaderTransferRequest, Member, + MemberAddRequest, MemberListRequest, MemberRemoveRequest, PutRequest, RangeRequest, RequestOp, StatusRequest, TxnRequest, }; use std::time::Duration; @@ -616,6 +617,89 @@ impl Client { raft_term: resp.raft_term, }) } + + /// List current cluster members. + pub async fn member_list(&mut self) -> Result> { + let resp = self + .with_cluster_retry(|mut cluster| async move { + cluster + .member_list(MemberListRequest {}) + .await + .map(|resp| resp.into_inner()) + }) + .await?; + + Ok(resp + .members + .into_iter() + .map(ClusterMemberInfo::from) + .collect()) + } + + /// Add or update a cluster member. + pub async fn member_add( + &mut self, + id: u64, + name: impl Into, + peer_urls: Vec, + client_urls: Vec, + is_learner: bool, + ) -> Result { + let name = name.into(); + let resp = self + .with_cluster_retry(|mut cluster| { + let request = MemberAddRequest { + id, + name: name.clone(), + peer_urls: peer_urls.clone(), + client_urls: client_urls.clone(), + is_learner, + }; + async move { + cluster + .member_add(request) + .await + .map(|resp| resp.into_inner()) + } + }) + .await?; + + resp.member + .map(ClusterMemberInfo::from) + .ok_or_else(|| ClientError::Internal("member_add response missing member".to_string())) + } + + /// Remove a cluster member. + pub async fn member_remove(&mut self, id: u64) -> Result> { + let resp = self + .with_cluster_retry(|mut cluster| async move { + cluster + .member_remove(MemberRemoveRequest { id }) + .await + .map(|resp| resp.into_inner()) + }) + .await?; + + Ok(resp + .members + .into_iter() + .map(ClusterMemberInfo::from) + .collect()) + } + + /// Transfer leadership to a specific voting member. + pub async fn leader_transfer(&mut self, target_id: u64) -> Result { + let resp = self + .with_cluster_retry(|mut cluster| async move { + cluster + .leader_transfer(LeaderTransferRequest { target_id }) + .await + .map(|resp| resp.into_inner()) + }) + .await?; + + Ok(resp.leader) + } } /// Cluster status @@ -629,6 +713,33 @@ pub struct ClusterStatus { pub raft_term: u64, } +/// Cluster member returned by cluster-management RPCs. +#[derive(Debug, Clone)] +pub struct ClusterMemberInfo { + /// Unique member ID. + pub id: u64, + /// Human-readable node name. + pub name: String, + /// Peer URLs used for Raft replication. + pub peer_urls: Vec, + /// Client URLs exposed by the node. + pub client_urls: Vec, + /// Whether this member is configured as a learner. + pub is_learner: bool, +} + +impl From for ClusterMemberInfo { + fn from(member: Member) -> Self { + Self { + id: member.id, + name: member.name, + peer_urls: member.peer_urls, + client_urls: member.client_urls, + is_learner: member.is_learner, + } + } +} + /// CAS outcome returned by compare_and_swap #[derive(Debug, Clone)] pub struct CasOutcome { diff --git a/chainfire/crates/chainfire-api/src/cluster_service.rs b/chainfire/crates/chainfire-api/src/cluster_service.rs index 504294c..3ff1475 100644 --- a/chainfire/crates/chainfire-api/src/cluster_service.rs +++ b/chainfire/crates/chainfire-api/src/cluster_service.rs @@ -1,76 +1,153 @@ -//! Cluster management service implementation +//! Cluster management service implementation. //! -//! This service handles cluster operations and status queries. -//! The supported surface reports the fixed membership that the node booted with. +//! This service exposes live member add/remove/list/status operations backed by +//! the replicated membership state in `RaftCore`. use crate::conversions::make_header; use crate::proto::{ - cluster_server::Cluster, Member, MemberListRequest, MemberListResponse, StatusRequest, - StatusResponse, + cluster_server::Cluster, LeaderTransferRequest, LeaderTransferResponse, Member, + MemberAddRequest, MemberAddResponse, MemberListRequest, MemberListResponse, + MemberRemoveRequest, MemberRemoveResponse, StatusRequest, StatusResponse, }; -use chainfire_raft::core::RaftCore; +use chainfire_raft::core::{ClusterMember as CoreClusterMember, ClusterMembership, RaftCore}; use std::sync::Arc; use tonic::{Request, Response, Status}; use tracing::debug; -/// Cluster service implementation +/// Cluster service implementation. pub struct ClusterServiceImpl { - /// Raft core + /// Raft core. raft: Arc, - /// Cluster ID + /// Cluster ID. cluster_id: u64, - /// Configured members with client and peer URLs - members: Vec, - /// Server version + /// Server version. version: String, } impl ClusterServiceImpl { - /// Create a new cluster service - pub fn new( - raft: Arc, - cluster_id: u64, - members: Vec, - ) -> Self { + /// Create a new cluster service. + pub fn new(raft: Arc, cluster_id: u64) -> Self { Self { raft, cluster_id, - members, version: env!("CARGO_PKG_VERSION").to_string(), } } - fn make_header(&self, revision: u64) -> crate::proto::ResponseHeader { - make_header(self.cluster_id, self.raft.node_id(), revision, 0) + async fn make_header(&self, revision: u64) -> crate::proto::ResponseHeader { + let term = self.raft.current_term().await; + make_header(self.cluster_id, self.raft.node_id(), revision, term) } - /// Get current members as proto Member list - /// Return the configured static membership that the server was booted with. - async fn get_member_list(&self) -> Vec { - if self.members.is_empty() { - return vec![Member { - id: self.raft.node_id(), - name: format!("node-{}", self.raft.node_id()), - peer_urls: vec![], - client_urls: vec![], - is_learner: false, - }]; + fn proto_member(member: &CoreClusterMember) -> Member { + Member { + id: member.id, + name: member.name.clone(), + peer_urls: member.peer_urls.clone(), + client_urls: member.client_urls.clone(), + is_learner: member.is_learner, + } + } + + fn proto_members(membership: &ClusterMembership) -> Vec { + membership.members.iter().map(Self::proto_member).collect() + } +} + +fn map_raft_error(error: chainfire_raft::core::RaftError) -> Status { + match error { + chainfire_raft::core::RaftError::NotLeader { leader_id } => { + Status::failed_precondition(format!("not leader; current leader is {:?}", leader_id)) + } + chainfire_raft::core::RaftError::Rejected(message) => Status::failed_precondition(message), + chainfire_raft::core::RaftError::StorageError(message) + | chainfire_raft::core::RaftError::NetworkError(message) => Status::internal(message), + chainfire_raft::core::RaftError::Timeout => { + Status::deadline_exceeded("cluster operation timed out") } - self.members.clone() } } #[tonic::async_trait] impl Cluster for ClusterServiceImpl { + async fn member_add( + &self, + request: Request, + ) -> Result, Status> { + let req = request.into_inner(); + debug!(member_id = req.id, "Member add request"); + + if req.id == 0 { + return Err(Status::invalid_argument("member id must be non-zero")); + } + if req.peer_urls.is_empty() { + return Err(Status::invalid_argument( + "member add requires at least one peer URL", + )); + } + + let member = CoreClusterMember { + id: req.id, + name: if req.name.trim().is_empty() { + format!("node-{}", req.id) + } else { + req.name + }, + peer_urls: req.peer_urls, + client_urls: req.client_urls, + is_learner: req.is_learner, + }; + + let membership = self + .raft + .add_member(member.clone()) + .await + .map_err(map_raft_error)?; + let revision = self.raft.last_applied().await; + let applied_member = membership.member(member.id).cloned().unwrap_or(member); + + Ok(Response::new(MemberAddResponse { + header: Some(self.make_header(revision).await), + member: Some(Self::proto_member(&applied_member)), + members: Self::proto_members(&membership), + })) + } + + async fn member_remove( + &self, + request: Request, + ) -> Result, Status> { + let req = request.into_inner(); + debug!(member_id = req.id, "Member remove request"); + + if req.id == 0 { + return Err(Status::invalid_argument("member id must be non-zero")); + } + + let membership = self + .raft + .remove_member(req.id) + .await + .map_err(map_raft_error)?; + let revision = self.raft.last_applied().await; + + Ok(Response::new(MemberRemoveResponse { + header: Some(self.make_header(revision).await), + members: Self::proto_members(&membership), + })) + } + async fn member_list( &self, _request: Request, ) -> Result, Status> { debug!("Member list request"); + let revision = self.raft.last_applied().await; + let membership = self.raft.cluster_membership().await; Ok(Response::new(MemberListResponse { - header: Some(self.make_header(0)), - members: self.get_member_list().await, + header: Some(self.make_header(revision).await), + members: Self::proto_members(&membership), })) } @@ -86,7 +163,7 @@ impl Cluster for ClusterServiceImpl { let last_applied = self.raft.last_applied().await; Ok(Response::new(StatusResponse { - header: Some(self.make_header(last_applied)), + header: Some(self.make_header(last_applied).await), version: self.version.clone(), db_size: 0, leader: leader.unwrap_or(0), @@ -95,4 +172,30 @@ impl Cluster for ClusterServiceImpl { raft_applied_index: last_applied, })) } + + async fn leader_transfer( + &self, + request: Request, + ) -> Result, Status> { + let req = request.into_inner(); + debug!(target_id = req.target_id, "Leader transfer request"); + + if req.target_id == 0 { + return Err(Status::invalid_argument( + "leader transfer target must be non-zero", + )); + } + + let leader = self + .raft + .transfer_leader(req.target_id) + .await + .map_err(map_raft_error)?; + let revision = self.raft.last_applied().await; + + Ok(Response::new(LeaderTransferResponse { + header: Some(self.make_header(revision).await), + leader, + })) + } } diff --git a/chainfire/crates/chainfire-api/src/internal_service.rs b/chainfire/crates/chainfire-api/src/internal_service.rs index ab77877..0ea29d0 100644 --- a/chainfire/crates/chainfire-api/src/internal_service.rs +++ b/chainfire/crates/chainfire-api/src/internal_service.rs @@ -5,8 +5,9 @@ use crate::internal_proto::{ raft_service_server::RaftService, AppendEntriesRequest as ProtoAppendEntriesRequest, - AppendEntriesResponse as ProtoAppendEntriesResponse, VoteRequest as ProtoVoteRequest, - VoteResponse as ProtoVoteResponse, + AppendEntriesResponse as ProtoAppendEntriesResponse, EntryType as ProtoEntryType, + TimeoutNowRequest as ProtoTimeoutNowRequest, TimeoutNowResponse as ProtoTimeoutNowResponse, + VoteRequest as ProtoVoteRequest, VoteResponse as ProtoVoteResponse, }; use chainfire_raft::core::{AppendEntriesRequest, RaftCore, VoteRequest}; use chainfire_storage::{EntryPayload, LogEntry as RaftLogEntry, LogId}; @@ -31,6 +32,32 @@ impl RaftServiceImpl { } } +fn decode_log_entry( + entry: crate::internal_proto::LogEntry, +) -> Result, Status> { + let payload = match ProtoEntryType::try_from(entry.entry_type).unwrap_or(ProtoEntryType::Blank) + { + ProtoEntryType::Blank => EntryPayload::Blank, + ProtoEntryType::Normal => { + let command = bincode::deserialize::(&entry.data).map_err(|err| { + Status::invalid_argument(format!( + "failed to decode normal raft entry payload: {err}" + )) + })?; + EntryPayload::Normal(command) + } + ProtoEntryType::Membership => EntryPayload::Membership(entry.data), + }; + + Ok(RaftLogEntry { + log_id: LogId { + term: entry.term, + index: entry.index, + }, + payload, + }) +} + #[tonic::async_trait] impl RaftService for RaftServiceImpl { async fn vote( @@ -91,26 +118,8 @@ impl RaftService for RaftServiceImpl { let entries: Vec> = req .entries .into_iter() - .map(|e| { - let payload = if e.data.is_empty() { - EntryPayload::Blank - } else { - // Deserialize the command from the entry data - match bincode::deserialize::(&e.data) { - Ok(cmd) => EntryPayload::Normal(cmd), - Err(_) => EntryPayload::Blank, - } - }; - - RaftLogEntry { - log_id: LogId { - term: e.term, - index: e.index, - }, - payload, - } - }) - .collect(); + .map(decode_log_entry) + .collect::, _>>()?; let append_req = AppendEntriesRequest { term: req.term, @@ -140,4 +149,48 @@ impl RaftService for RaftServiceImpl { })) } + async fn timeout_now( + &self, + _request: Request, + ) -> Result, Status> { + let (resp_tx, resp_rx) = oneshot::channel(); + self.raft.timeout_now_rpc(resp_tx).await; + + let result = resp_rx.await.map_err(|e| { + warn!(error = %e, "TimeoutNow request channel closed"); + Status::internal("TimeoutNow request failed: channel closed") + })?; + + let term = self.raft.current_term().await; + match result { + Ok(()) => Ok(Response::new(ProtoTimeoutNowResponse { + accepted: true, + term, + })), + Err(err) => Err(Status::failed_precondition(err.to_string())), + } + } +} + +#[cfg(test)] +mod tests { + use super::*; + use chainfire_storage::EntryPayload; + + #[test] + fn decode_log_entry_preserves_membership_payloads() { + let expected = vec![1, 2, 3, 4]; + let decoded = decode_log_entry(crate::internal_proto::LogEntry { + index: 7, + term: 3, + data: expected.clone(), + entry_type: ProtoEntryType::Membership as i32, + }) + .expect("decode membership entry"); + + match decoded.payload { + EntryPayload::Membership(bytes) => assert_eq!(bytes, expected), + other => panic!("expected membership payload, got {other:?}"), + } + } } diff --git a/chainfire/crates/chainfire-api/src/raft_client.rs b/chainfire/crates/chainfire-api/src/raft_client.rs index 36a5f3a..acb1f22 100644 --- a/chainfire/crates/chainfire-api/src/raft_client.rs +++ b/chainfire/crates/chainfire-api/src/raft_client.rs @@ -5,7 +5,8 @@ use crate::internal_proto::{ raft_service_client::RaftServiceClient, AppendEntriesRequest as ProtoAppendEntriesRequest, - LogEntry as ProtoLogEntry, VoteRequest as ProtoVoteRequest, + EntryType as ProtoEntryType, LogEntry as ProtoLogEntry, + TimeoutNowRequest as ProtoTimeoutNowRequest, VoteRequest as ProtoVoteRequest, }; use chainfire_raft::network::{RaftNetworkError, RaftRpcClient}; use chainfire_types::NodeId; @@ -241,6 +242,30 @@ impl Default for GrpcRaftClient { #[async_trait::async_trait] impl RaftRpcClient for GrpcRaftClient { + async fn add_node(&self, target: NodeId, addr: String) -> Result<(), RaftNetworkError> { + GrpcRaftClient::add_node(self, target, addr).await; + Ok(()) + } + + async fn remove_node(&self, target: NodeId) -> Result<(), RaftNetworkError> { + GrpcRaftClient::remove_node(self, target).await; + Ok(()) + } + + async fn timeout_now(&self, target: NodeId) -> Result<(), RaftNetworkError> { + trace!(target = target, "Sending timeout-now request"); + + self.with_retry(target, "timeout_now", || async { + let mut client = self.get_client(target).await?; + client + .timeout_now(ProtoTimeoutNowRequest {}) + .await + .map_err(|e| RaftNetworkError::RpcFailed(e.to_string()))?; + Ok(()) + }) + .await + } + async fn vote( &self, target: NodeId, @@ -286,17 +311,22 @@ impl RaftRpcClient for GrpcRaftClient { ); // Clone entries once for potential retries - let entries_data: Vec<(u64, u64, Vec)> = req + let entries_data: Vec<(u64, u64, Vec, i32)> = req .entries .iter() .map(|e| { use chainfire_storage::EntryPayload; - let data = match &e.payload { - EntryPayload::Blank => vec![], - EntryPayload::Normal(cmd) => bincode::serialize(cmd).unwrap_or_default(), - EntryPayload::Membership(_) => vec![], + let (data, entry_type) = match &e.payload { + EntryPayload::Blank => (vec![], ProtoEntryType::Blank as i32), + EntryPayload::Normal(cmd) => ( + bincode::serialize(cmd).unwrap_or_default(), + ProtoEntryType::Normal as i32, + ), + EntryPayload::Membership(bytes) => { + (bytes.clone(), ProtoEntryType::Membership as i32) + } }; - (e.log_id.index, e.log_id.term, data) + (e.log_id.index, e.log_id.term, data, entry_type) }) .collect(); @@ -313,7 +343,12 @@ impl RaftRpcClient for GrpcRaftClient { let entries: Vec = entries_data .into_iter() - .map(|(index, term, data)| ProtoLogEntry { index, term, data }) + .map(|(index, term, data, entry_type)| ProtoLogEntry { + index, + term, + data, + entry_type, + }) .collect(); let proto_req = ProtoAppendEntriesRequest { diff --git a/chainfire/crates/chainfire-core/src/lib.rs b/chainfire/crates/chainfire-core/src/lib.rs index 24a2f79..2a82dbd 100644 --- a/chainfire/crates/chainfire-core/src/lib.rs +++ b/chainfire/crates/chainfire-core/src/lib.rs @@ -1,6 +1,6 @@ //! Internal compatibility crate for workspace-local ChainFire types. //! -//! The supported ChainFire product surface is the fixed-membership +//! The supported ChainFire product surface is the live-membership //! `chainfire-server` / `chainfire-api` contract documented in the repository //! root. This crate intentionally does not export an embeddable cluster, //! membership-mutation, or distributed-KV API. diff --git a/chainfire/crates/chainfire-raft/src/core.rs b/chainfire/crates/chainfire-raft/src/core.rs index ed31295..74dfd06 100644 --- a/chainfire/crates/chainfire-raft/src/core.rs +++ b/chainfire/crates/chainfire-raft/src/core.rs @@ -9,7 +9,8 @@ //! - RaftTimer: Election and heartbeat timeout management //! - Integration with existing chainfire-storage and network layers -use std::collections::HashMap; +use serde::{Deserialize, Serialize}; +use std::collections::{BTreeSet, HashMap, HashSet}; use std::sync::Arc; use std::time::Duration; use tokio::sync::{mpsc, oneshot, Mutex, RwLock}; @@ -24,6 +25,67 @@ pub type NodeId = u64; pub type Term = u64; pub type LogIndex = u64; +/// Public member description replicated through membership-change log entries. +#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize, Default)] +pub struct ClusterMember { + /// Unique member ID. + pub id: NodeId, + /// Human-readable member name. + pub name: String, + /// Peer URLs used for Raft replication. + pub peer_urls: Vec, + /// Client URLs exposed for public APIs. + pub client_urls: Vec, + /// Whether this member is a learner. + pub is_learner: bool, +} + +/// Replicated cluster membership payload. +#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize, Default)] +pub struct ClusterMembership { + /// Ordered member set. + pub members: Vec, +} + +impl ClusterMembership { + /// Return a normalized copy sorted by member ID with duplicate IDs removed. + pub fn normalized(&self) -> Self { + let mut members = self.members.clone(); + members.sort_by_key(|member| member.id); + members.dedup_by_key(|member| member.id); + Self { members } + } + + /// Return all voting member IDs. + pub fn voter_ids(&self) -> Vec { + self.members + .iter() + .filter(|member| !member.is_learner) + .map(|member| member.id) + .collect() + } + + /// Find a member by ID. + pub fn member(&self, id: NodeId) -> Option<&ClusterMember> { + self.members.iter().find(|member| member.id == id) + } + + /// Insert or replace a member and return the normalized result. + pub fn with_member(&self, member: ClusterMember) -> Self { + let mut members = self.members.clone(); + members.retain(|existing| existing.id != member.id); + members.push(member); + Self { members }.normalized() + } + + /// Remove a member by ID and return the normalized result. + pub fn without_member(&self, id: NodeId) -> Self { + let mut members = self.members.clone(); + members.retain(|member| member.id != id); + Self { members }.normalized() + } +} + // ============================================================================ // Core Raft Types // ============================================================================ @@ -60,7 +122,7 @@ pub struct VolatileState { #[derive(Debug, Clone)] pub struct CandidateState { /// Nodes that have granted votes (includes self) - pub votes_received: std::collections::HashSet, + pub votes_received: HashSet, } /// Volatile state on leaders (reinitialized after election) @@ -144,6 +206,11 @@ pub enum RaftEvent { command: RaftCommand, response_tx: oneshot::Sender>, }, + /// Cluster membership change request. + MembershipChange { + membership: ClusterMembership, + response_tx: oneshot::Sender>, + }, /// RequestVote RPC received VoteRequest { req: VoteRequest, @@ -154,6 +221,10 @@ pub enum RaftEvent { req: AppendEntriesRequest, response_tx: oneshot::Sender, }, + /// Immediate-election request received from the current leader. + TimeoutNow { + response_tx: oneshot::Sender>, + }, /// RequestVote RPC response received VoteResponse { from: NodeId, resp: VoteResponse }, /// AppendEntries RPC response received @@ -170,6 +241,7 @@ pub enum RaftEvent { #[derive(Debug, Clone)] pub enum RaftError { NotLeader { leader_id: Option }, + Rejected(String), StorageError(String), NetworkError(String), Timeout, @@ -181,6 +253,7 @@ impl std::fmt::Display for RaftError { RaftError::NotLeader { leader_id } => { write!(f, "Not leader, leader is: {:?}", leader_id) } + RaftError::Rejected(msg) => write!(f, "Rejected: {}", msg), RaftError::StorageError(msg) => write!(f, "Storage error: {}", msg), RaftError::NetworkError(msg) => write!(f, "Network error: {}", msg), RaftError::Timeout => write!(f, "Operation timed out"), @@ -197,8 +270,12 @@ impl std::error::Error for RaftError {} pub struct RaftCore { /// This node's ID node_id: NodeId, - /// Cluster members (excluding self) - peers: Vec, + /// Voting peers (excluding self). + peers: Arc>>, + /// Replication targets (voters and learners, excluding self). + replication_peers: Arc>>, + /// Current replicated membership information including endpoint metadata. + membership: Arc>, /// Persistent state persistent: Arc>, @@ -251,17 +328,31 @@ impl Default for RaftConfig { impl RaftCore { pub fn new( node_id: NodeId, - peers: Vec, + membership: ClusterMembership, storage: Arc, state_machine: Arc, network: Arc, config: RaftConfig, ) -> Self { let (event_tx, event_rx) = mpsc::unbounded_channel(); + let membership = membership.normalized(); + let peers = membership + .voter_ids() + .into_iter() + .filter(|id| *id != node_id) + .collect(); + let replication_peers = membership + .members + .iter() + .map(|member| member.id) + .filter(|id| *id != node_id) + .collect(); Self { node_id, - peers, + peers: Arc::new(RwLock::new(peers)), + replication_peers: Arc::new(RwLock::new(replication_peers)), + membership: Arc::new(RwLock::new(membership)), persistent: Arc::new(RwLock::new(PersistentState { current_term: 0, voted_for: None, @@ -308,6 +399,41 @@ impl RaftCore { ))); } } + + match self.storage.read_membership() { + Ok(Some(bytes)) => { + let membership: ClusterMembership = bincode::deserialize(&bytes).map_err(|e| { + RaftError::StorageError(format!( + "Failed to deserialize membership payload: {}", + e + )) + })?; + self.apply_runtime_membership(membership, false).await?; + tracing::info!("Loaded membership from storage"); + } + Ok(None) => { + let membership = self.membership.read().await.clone(); + let bytes = bincode::serialize(&membership).map_err(|e| { + RaftError::StorageError(format!( + "Failed to serialize initial membership payload: {}", + e + )) + })?; + self.storage.save_membership(&bytes).map_err(|e| { + RaftError::StorageError(format!( + "Failed to persist initial membership payload: {}", + e + )) + })?; + } + Err(e) => { + return Err(RaftError::StorageError(format!( + "Failed to load membership: {}", + e + ))); + } + } + Ok(()) } @@ -327,6 +453,138 @@ impl RaftCore { Ok(()) } + async fn peers_snapshot(&self) -> Vec { + self.peers.read().await.clone() + } + + async fn replication_targets_snapshot(&self) -> Vec { + self.replication_peers.read().await.clone() + } + + async fn is_voting_member(&self, node_id: NodeId) -> bool { + self.membership + .read() + .await + .member(node_id) + .map(|member| !member.is_learner) + .unwrap_or(false) + } + + async fn self_is_voting_member(&self) -> bool { + self.is_voting_member(self.node_id).await + } + + fn serialize_membership(membership: &ClusterMembership) -> Result, RaftError> { + bincode::serialize(membership).map_err(|e| { + RaftError::StorageError(format!("Failed to serialize membership payload: {}", e)) + }) + } + + async fn apply_runtime_membership( + &self, + membership: ClusterMembership, + persist: bool, + ) -> Result<(), RaftError> { + let membership = membership.normalized(); + let old_membership = self.membership.read().await.clone(); + + for member in &membership.members { + if let Some(peer_url) = member.peer_urls.first() { + let addr = peer_url + .strip_prefix("http://") + .or_else(|| peer_url.strip_prefix("https://")) + .unwrap_or(peer_url) + .to_string(); + self.network + .add_node(member.id, addr) + .await + .map_err(|e| RaftError::NetworkError(e.to_string()))?; + } + } + + for removed in old_membership + .members + .iter() + .filter(|member| membership.member(member.id).is_none()) + { + self.network + .remove_node(removed.id) + .await + .map_err(|e| RaftError::NetworkError(e.to_string()))?; + } + + let peers = membership + .voter_ids() + .into_iter() + .filter(|id| *id != self.node_id) + .collect::>(); + let replication_peers = membership + .members + .iter() + .map(|member| member.id) + .filter(|id| *id != self.node_id) + .collect::>(); + + { + let mut membership_guard = self.membership.write().await; + *membership_guard = membership.clone(); + } + { + let mut peers_guard = self.peers.write().await; + *peers_guard = peers.clone(); + } + { + let mut replication_guard = self.replication_peers.write().await; + *replication_guard = replication_peers.clone(); + } + + if !membership + .member(self.node_id) + .map(|member| !member.is_learner) + .unwrap_or(false) + { + *self.role.write().await = RaftRole::Follower; + *self.candidate_state.write().await = None; + *self.leader_state.write().await = None; + let mut volatile = self.volatile.write().await; + if volatile.current_leader == Some(self.node_id) { + volatile.current_leader = None; + } + } + + if persist { + let bytes = Self::serialize_membership(&membership)?; + self.storage.save_membership(&bytes).map_err(|e| { + RaftError::StorageError(format!("Failed to save membership: {}", e)) + })?; + } + + let current_role = *self.role.read().await; + if current_role == RaftRole::Leader { + let (last_log_index, _) = self.get_last_log_info().await?; + let mut leader_state_guard = self.leader_state.write().await; + if let Some(leader_state) = leader_state_guard.as_mut() { + let old_next = leader_state.next_index.clone(); + let old_match = leader_state.match_index.clone(); + leader_state.next_index.clear(); + leader_state.match_index.clear(); + for peer_id in &replication_peers { + let next_index = old_next + .get(peer_id) + .copied() + .unwrap_or(1) + .min(last_log_index + 1); + leader_state.next_index.insert(*peer_id, next_index); + leader_state + .match_index + .insert(*peer_id, old_match.get(peer_id).copied().unwrap_or(0)); + } + } + } + + Ok(()) + } + /// Start the Raft event loop pub async fn run(&self) -> Result<(), RaftError> { eprintln!("[Node {}] EVENT LOOP STARTING", self.node_id); @@ -353,8 +611,10 @@ impl RaftCore { RaftEvent::VoteRequest { .. } => "VoteRequest", RaftEvent::VoteResponse { .. } => "VoteResponse", RaftEvent::AppendEntries { .. } => "AppendEntries", + RaftEvent::TimeoutNow { .. } => "TimeoutNow", RaftEvent::AppendEntriesResponse { .. } => "AppendEntriesResponse", RaftEvent::ClientWrite { .. } => "ClientWrite", + RaftEvent::MembershipChange { .. } => "MembershipChange", }; eprintln!("[Node {}] EVENT LOOP received: {}", self.node_id, event_type); if let Err(e) = self.handle_event(event).await { @@ -389,6 +649,13 @@ impl RaftCore { let result = self.handle_client_write(command).await; let _ = response_tx.send(result); } + RaftEvent::MembershipChange { + membership, + response_tx, + } => { + let result = self.handle_membership_change(membership).await; + let _ = response_tx.send(result); + } RaftEvent::VoteRequest { req, response_tx } => { let resp = self.handle_vote_request(req).await?; let _ = response_tx.send(resp); @@ -401,6 +668,10 @@ impl RaftCore { let resp = self.handle_append_entries(req).await?; let _ = response_tx.send(resp); } + RaftEvent::TimeoutNow { response_tx } => { + let result = self.handle_timeout_now().await; + let _ = response_tx.send(result); + } RaftEvent::VoteResponse { from, resp } => { self.handle_vote_response(from, resp).await?; } @@ -417,6 +688,10 @@ impl RaftCore { /// Handle election timeout - transition to candidate and start election async fn handle_election_timeout(&self) -> Result<(), RaftError> { + if !self.self_is_voting_member().await { + return Ok(()); + } + let role = *self.role.read().await; eprintln!( @@ -463,11 +738,12 @@ impl RaftCore { }); // Check if already have majority (single-node case) - let cluster_size = self.peers.len() + 1; + let peers = self.peers_snapshot().await; + let cluster_size = peers.len() + 1; let majority = cluster_size / 2 + 1; eprintln!( "[Node {}] Cluster size={}, majority={}, peers={:?}", - self.node_id, cluster_size, majority, self.peers + self.node_id, cluster_size, majority, peers ); if 1 >= majority { // For single-node cluster, immediately become leader @@ -491,7 +767,7 @@ impl RaftCore { }; // Send vote requests in parallel - for peer_id in &self.peers { + for peer_id in &peers { let peer_id = *peer_id; let network = self.network.clone(); let req = vote_request.clone(); @@ -515,6 +791,17 @@ impl RaftCore { Ok(()) } + /// Handle TimeoutNow RPC by immediately starting an election on this voter. + async fn handle_timeout_now(&self) -> Result<(), RaftError> { + if !self.self_is_voting_member().await { + return Err(RaftError::NetworkError( + "timeout-now requires a voting member target".to_string(), + )); + } + + self.handle_election_timeout().await + } + /// Handle RequestVote RPC async fn handle_vote_request(&self, req: VoteRequest) -> Result { let mut persistent = self.persistent.write().await; @@ -538,6 +825,13 @@ impl RaftCore { persistent = self.persistent.write().await; } + if !self.self_is_voting_member().await { + return Ok(VoteResponse { + term: persistent.current_term, + vote_granted: false, + }); + } + // Check if we can grant vote let can_vote = persistent.voted_for.is_none() || persistent.voted_for == Some(req.candidate_id); @@ -605,13 +899,11 @@ impl RaftCore { // Count votes if resp.vote_granted { + let cluster_size = self.peers_snapshot().await.len() + 1; + let majority = cluster_size / 2 + 1; let mut candidate_state_guard = self.candidate_state.write().await; if let Some(candidate_state) = candidate_state_guard.as_mut() { candidate_state.votes_received.insert(from); - - // Calculate majority (cluster size = peers + 1 for self) - let cluster_size = self.peers.len() + 1; - let majority = cluster_size / 2 + 1; let votes_count = candidate_state.votes_received.len(); // If received majority, become leader @@ -645,19 +937,15 @@ impl RaftCore { match_index: HashMap::new(), }; - for peer_id in &self.peers { + let replication_targets = self.replication_targets_snapshot().await; + for peer_id in &replication_targets { leader_state.next_index.insert(*peer_id, next_index); leader_state.match_index.insert(*peer_id, 0); } *self.leader_state.write().await = Some(leader_state); - // Start sending heartbeats immediately - self.event_tx - .send(RaftEvent::HeartbeatTimeout) - .map_err(|e| RaftError::NetworkError(format!("Failed to send heartbeat: {}", e)))?; - - Ok(()) + self.append_blank_leader_entry().await } /// Step down to follower @@ -694,13 +982,14 @@ impl RaftCore { let term = self.persistent.read().await.current_term; let (last_log_index, _) = self.get_last_log_info().await?; + let peers = self.replication_targets_snapshot().await; eprintln!( "[Node {}] Sending heartbeat to peers: {:?} (term={})", - self.node_id, self.peers, term + self.node_id, peers, term ); // Send AppendEntries (with entries if available) to all peers - for peer_id in &self.peers { + for peer_id in &peers { let peer_id = *peer_id; // Read commit_index fresh for each peer to ensure it's up-to-date @@ -1122,16 +1411,22 @@ impl RaftCore { /// Advance commit index based on majority replication async fn advance_commit_index(&self) -> Result<(), RaftError> { - let leader_state = self.leader_state.read().await; - if leader_state.is_none() { + let voter_followers = self.peers_snapshot().await; + let Some(match_indices_from_followers) = ({ + let leader_state = self.leader_state.read().await; + leader_state.as_ref().map(|state| { + voter_followers + .iter() + .map(|peer_id| state.match_index.get(peer_id).copied().unwrap_or(0)) + .collect::>() + }) + }) else { return Ok(()); // Not leader - } - - let leader_state = leader_state.as_ref().unwrap(); + }; // Collect all match_index values plus leader's own log let (last_log_index, _) = self.get_last_log_info().await?; - let mut match_indices: Vec = leader_state.match_index.values().copied().collect(); + let mut match_indices = match_indices_from_followers; // Add leader's own index match_indices.push(last_log_index); @@ -1183,9 +1478,10 @@ impl RaftCore { /// Apply committed entries to state machine async fn apply_committed_entries(&self) -> Result<(), RaftError> { - let mut volatile = self.volatile.write().await; - let commit_index = volatile.commit_index; - let last_applied = volatile.last_applied; + let (commit_index, last_applied) = { + let volatile = self.volatile.read().await; + (volatile.commit_index, volatile.last_applied) + }; if commit_index <= last_applied { return Ok(()); // Nothing to apply @@ -1201,26 +1497,51 @@ impl RaftCore { // Apply each entry to state machine for entry in &stored_entries { - if let EntryPayload::Normal(data) = &entry.payload { - // Deserialize the command - let command: RaftCommand = bincode::deserialize(data).map_err(|e| { - RaftError::StorageError(format!("Failed to deserialize for apply: {}", e)) - })?; + match &entry.payload { + EntryPayload::Normal(data) => { + // Deserialize the command + let command: RaftCommand = bincode::deserialize(data).map_err(|e| { + RaftError::StorageError(format!("Failed to deserialize for apply: {}", e)) + })?; - self.state_machine.apply(command).map_err(|e| { - RaftError::StorageError(format!("Failed to apply to state machine: {}", e)) - })?; + self.state_machine.apply(command).map_err(|e| { + RaftError::StorageError(format!("Failed to apply to state machine: {}", e)) + })?; - debug!( - index = entry.log_id.index, - term = entry.log_id.term, - "Applied entry to state machine" - ); + debug!( + index = entry.log_id.index, + term = entry.log_id.term, + "Applied entry to state machine" + ); + } + EntryPayload::Membership(bytes) => { + let membership: ClusterMembership = + bincode::deserialize(bytes).map_err(|e| { + RaftError::StorageError(format!( + "Failed to deserialize membership for apply: {}", + e + )) + })?; + let removing_or_demoting_self = membership + .member(self.node_id) + .map(|member| member.is_learner) + .unwrap_or(true); + if removing_or_demoting_self && self.role().await == RaftRole::Leader { + self.handle_heartbeat_timeout().await?; + } + self.apply_runtime_membership(membership, true).await?; + debug!( + index = entry.log_id.index, + term = entry.log_id.term, + "Applied membership change" + ); + } + EntryPayload::Blank => {} } } // Update last_applied - volatile.last_applied = commit_index; + self.volatile.write().await.last_applied = commit_index; debug!( last_applied = commit_index, @@ -1235,11 +1556,66 @@ impl RaftCore { // P3: Client Requests // ======================================================================== + async fn handle_membership_change( + &self, + membership: ClusterMembership, + ) -> Result { + let role = *self.role.read().await; + if role != RaftRole::Leader { + return Err(RaftError::NotLeader { + leader_id: self.volatile.read().await.current_leader, + }); + } + + let current_membership = self.membership.read().await.clone(); + let membership = membership.normalized(); + Self::validate_membership_transition(¤t_membership, &membership)?; + if current_membership == membership { + let (last_log_index, _) = self.get_last_log_info().await?; + return Ok(last_log_index); + } + if self.has_pending_membership_change().await? { + return Err(RaftError::Rejected( + "another membership change is still in flight".to_string(), + )); + } + + let term = self.persistent.read().await.current_term; + let (last_log_index, _) = self.get_last_log_info().await?; + let new_index = last_log_index + 1; + let payload = Self::serialize_membership(&membership)?; + let entry: LogEntry> = LogEntry { + log_id: LogId { + term, + index: new_index, + }, + payload: EntryPayload::Membership(payload), + }; + + self.storage.append(&[entry]).map_err(|e| { + RaftError::StorageError(format!("Failed to append membership entry: {}", e)) + })?; + + self.event_tx + .send(RaftEvent::HeartbeatTimeout) + .map_err(|e| { + RaftError::NetworkError(format!("Failed to trigger membership replication: {}", e)) + })?; + + if self.peers_snapshot().await.is_empty() { + self.advance_commit_index().await?; + } + + Ok(new_index) + } + async fn handle_client_write(&self, command: RaftCommand) -> Result<(), RaftError> { let role = *self.role.read().await; if role != RaftRole::Leader { - return Err(RaftError::NotLeader { leader_id: None }); + return Err(RaftError::NotLeader { + leader_id: self.volatile.read().await.current_leader, + }); } // Get current term and last log index @@ -1320,7 +1696,7 @@ impl RaftCore { })?; // Single-node cluster: immediately commit since we're the only voter - if self.peers.is_empty() { + if self.peers_snapshot().await.is_empty() { self.advance_commit_index().await?; } @@ -1331,6 +1707,171 @@ impl RaftCore { Ok(()) } + async fn append_blank_leader_entry(&self) -> Result<(), RaftError> { + let term = self.persistent.read().await.current_term; + let (last_log_index, _) = self.get_last_log_info().await?; + let new_index = last_log_index + 1; + let entry: LogEntry> = LogEntry { + log_id: LogId { + term, + index: new_index, + }, + payload: EntryPayload::Blank, + }; + + self.storage + .append(&[entry]) + .map_err(|e| RaftError::StorageError(format!("Failed to append blank entry: {}", e)))?; + + self.event_tx + .send(RaftEvent::HeartbeatTimeout) + .map_err(|e| { + RaftError::NetworkError(format!("Failed to trigger blank-entry replication: {}", e)) + })?; + + if self.peers_snapshot().await.is_empty() { + self.advance_commit_index().await?; + } + + Ok(()) + } + + async fn has_pending_membership_change(&self) -> Result { + let last_applied = self.last_applied().await; + let (last_log_index, _) = self.get_last_log_info().await?; + if last_log_index <= last_applied { + return Ok(false); + } + + let entries: Vec>> = self + .storage + .get_log_entries((last_applied + 1)..=last_log_index) + .map_err(|e| { + RaftError::StorageError(format!( + "Failed to inspect pending membership changes: {}", + e + )) + })?; + + Ok(entries + .iter() + .any(|entry| matches!(entry.payload, EntryPayload::Membership(_)))) + } + + fn validate_membership_change(membership: &ClusterMembership) -> Result<(), RaftError> { + if membership.members.is_empty() { + return Err(RaftError::Rejected( + "membership change must keep at least one member".to_string(), + )); + } + if membership.voter_ids().is_empty() { + return Err(RaftError::Rejected( + "membership change must keep at least one voting member".to_string(), + )); + } + Ok(()) + } + + fn validate_membership_transition( + current: &ClusterMembership, + target: &ClusterMembership, + ) -> Result<(), RaftError> { + Self::validate_membership_change(target)?; + + let current_voters = current.voter_ids().into_iter().collect::>(); + let target_voters = target.voter_ids().into_iter().collect::>(); + let added_voters = target_voters + .difference(¤t_voters) + .copied() + .collect::>(); + let removed_voters = current_voters + .difference(&target_voters) + .copied() + .collect::>(); + + if added_voters.is_empty() && removed_voters.is_empty() { + return Ok(()); + } + + if !added_voters.is_empty() && !removed_voters.is_empty() { + return Err(RaftError::Rejected(format!( + "membership change cannot add voters {:?} and remove voters {:?} in the same step; use sequential one-voter transitions until joint consensus is implemented", + added_voters, removed_voters + ))); + } + + if added_voters.len() > 1 || removed_voters.len() > 1 { + return Err(RaftError::Rejected(format!( + "membership change modifies multiple voting members in one step (added {:?}, removed {:?}); use sequential one-voter transitions until joint consensus is implemented", + added_voters, removed_voters + ))); + } + + Ok(()) + } + + async fn wait_for_transfer_target_caught_up( + &self, + target: NodeId, + timeout: Duration, + ) -> Result<(), RaftError> { + let start = time::Instant::now(); + loop { + if *self.role.read().await != RaftRole::Leader { + return Err(RaftError::NotLeader { + leader_id: self.volatile.read().await.current_leader, + }); + } + + let (last_log_index, _) = self.get_last_log_info().await?; + let match_index = { + let leader_state = self.leader_state.read().await; + leader_state + .as_ref() + .and_then(|state| state.match_index.get(&target).copied()) + .unwrap_or(0) + }; + + if match_index >= last_log_index { + return Ok(()); + } + + self.event_tx + .send(RaftEvent::HeartbeatTimeout) + .map_err(|e| { + RaftError::NetworkError(format!( + "Failed to trigger leader-transfer catch-up heartbeat: {}", + e + )) + })?; + + if start.elapsed() > timeout { + return Err(RaftError::Timeout); + } + + time::sleep(Duration::from_millis(20)).await; + } + } + + async fn wait_for_observed_leader( + &self, + target: NodeId, + timeout: Duration, + ) -> Result<(), RaftError> { + let start = time::Instant::now(); + loop { + if self.leader().await == Some(target) { + return Ok(()); + } + + if start.elapsed() > timeout { + return Err(RaftError::Timeout); + } + + time::sleep(Duration::from_millis(20)).await; + } + } + // ======================================================================== // Helper Methods // ======================================================================== @@ -1455,7 +1996,7 @@ impl RaftCore { req, response_tx: resp_tx, }); - if let Err(e) = result { + if let Err(_e) = result { eprintln!( "[Node {}] ERROR: Failed to send AppendEntries event: channel closed", self.node_id @@ -1463,6 +2004,13 @@ impl RaftCore { } } + /// Inject TimeoutNow RPC (for testing or transport bridges). + pub async fn timeout_now_rpc(&self, resp_tx: oneshot::Sender>) { + let _ = self.event_tx.send(RaftEvent::TimeoutNow { + response_tx: resp_tx, + }); + } + /// Get current leader pub async fn leader(&self) -> Option { self.volatile.read().await.current_leader @@ -1526,6 +2074,104 @@ impl RaftCore { } } + /// Submit a membership change and wait until it is committed and applied. + pub async fn change_membership( + &self, + membership: ClusterMembership, + ) -> Result { + let target = membership.normalized(); + let (tx, rx) = oneshot::channel(); + self.event_tx + .send(RaftEvent::MembershipChange { + membership: target.clone(), + response_tx: tx, + }) + .map_err(|e| { + RaftError::NetworkError(format!("Failed to send membership change: {}", e)) + })?; + + let target_index = rx.await.map_err(|e| { + RaftError::NetworkError(format!("Membership change response lost: {}", e)) + })??; + + let timeout = tokio::time::Duration::from_secs(10); + let start = tokio::time::Instant::now(); + loop { + let applied = self.last_applied().await; + let current = self.membership.read().await.clone(); + if applied >= target_index && current == target { + return Ok(current); + } + + if start.elapsed() > timeout { + return Err(RaftError::Timeout); + } + + tokio::time::sleep(tokio::time::Duration::from_millis(10)).await; + } + } + + /// Add or update a member and wait for the change to be applied. + pub async fn add_member(&self, member: ClusterMember) -> Result { + let membership = self.membership.read().await.clone().with_member(member); + self.change_membership(membership).await + } + + /// Remove a member and wait for the change to be applied. + pub async fn remove_member(&self, id: NodeId) -> Result { + let membership = self.membership.read().await.clone().without_member(id); + self.change_membership(membership).await + } + + /// Transfer leadership to a specific voting member. + pub async fn transfer_leader(&self, target: NodeId) -> Result { + if target == 0 { + return Err(RaftError::Rejected( + "leader transfer target must be non-zero".to_string(), + )); + } + + if *self.role.read().await != RaftRole::Leader { + return Err(RaftError::NotLeader { + leader_id: self.volatile.read().await.current_leader, + }); + } + + if target == self.node_id { + return Ok(target); + } + + let membership = self.membership.read().await.clone(); + let Some(member) = membership.member(target) else { + return Err(RaftError::Rejected(format!( + "leader transfer target {target} is not a cluster member" + ))); + }; + if member.is_learner { + return Err(RaftError::Rejected(format!( + "leader transfer target {target} must be a voting member" + ))); + } + if self.has_pending_membership_change().await? { + return Err(RaftError::Rejected( + "cannot transfer leader while a membership change is still in flight".to_string(), + )); + } + + self.wait_for_transfer_target_caught_up(target, Duration::from_secs(5)) + .await?; + + self.network + .timeout_now(target) + .await + .map_err(|e| RaftError::NetworkError(e.to_string()))?; + + self.wait_for_observed_leader(target, Duration::from_secs(5)) + .await?; + + Ok(target) + } + /// Get current commit index pub async fn commit_index(&self) -> LogIndex { self.volatile.read().await.commit_index @@ -1546,13 +2192,14 @@ impl RaftCore { Arc::clone(&self.storage) } - /// Get current cluster membership as list of node IDs - /// NOTE: Custom RaftCore uses static membership configured at startup + /// Get the current cluster membership details. + pub async fn cluster_membership(&self) -> ClusterMembership { + self.membership.read().await.clone() + } + + /// Get current cluster membership as a sorted list of node IDs. pub async fn membership(&self) -> Vec { - let mut members = vec![self.node_id]; - members.extend(self.peers.iter().cloned()); - members.sort(); - members + self.cluster_membership().await.voter_ids() } } @@ -1563,6 +2210,14 @@ impl RaftCore { #[cfg(test)] mod tests { use super::*; + use crate::network::test_client::{InMemoryRpcClient, RpcMessage}; + use chainfire_storage::RocksStore; + use std::future::Future; + use tempfile::{tempdir, TempDir}; + use tokio::{ + sync::mpsc, + time::{sleep, Duration, Instant}, + }; #[test] fn test_vote_request_creation() { @@ -1577,8 +2232,635 @@ mod tests { assert_eq!(req.candidate_id, 1); } + fn member(id: NodeId, raft_addr: &str, client_url: &str) -> ClusterMember { + ClusterMember { + id, + name: format!("node-{id}"), + peer_urls: vec![raft_addr.to_string()], + client_urls: vec![client_url.to_string()], + is_learner: false, + } + } + + fn learner_member(id: NodeId, raft_addr: &str, client_url: &str) -> ClusterMember { + ClusterMember { + is_learner: true, + ..member(id, raft_addr, client_url) + } + } + + fn quiet_test_config() -> RaftConfig { + RaftConfig { + election_timeout_min: 10_000, + election_timeout_max: 20_000, + heartbeat_interval: 25, + } + } + + struct TestClusterNode { + raft: Arc, + _dir: TempDir, + } + + async fn spawn_single_node_leader( + store: RocksStore, + membership: ClusterMembership, + ) -> Arc { + let raft = Arc::new(RaftCore::new( + 1, + membership, + Arc::new(LogStorage::new(store.clone())), + Arc::new(StateMachine::new(store).expect("state machine")), + Arc::new(InMemoryRpcClient::new()) as Arc, + quiet_test_config(), + )); + raft.initialize().await.expect("initialize raft"); + + let raft_task = Arc::clone(&raft); + tokio::spawn(async move { + raft_task.run().await.expect("raft event loop"); + }); + sleep(Duration::from_millis(25)).await; + + raft.become_leader().await.expect("become leader"); + sleep(Duration::from_millis(25)).await; + raft + } + + async fn spawn_cluster_node( + node_id: NodeId, + membership: ClusterMembership, + network: Arc, + ) -> TestClusterNode { + let dir = tempdir().expect("tempdir"); + let store = RocksStore::new(dir.path()).expect("rocksdb store"); + let raft = Arc::new(RaftCore::new( + node_id, + membership, + Arc::new(LogStorage::new(store.clone())), + Arc::new(StateMachine::new(store).expect("state machine")), + Arc::clone(&network) as Arc, + quiet_test_config(), + )); + raft.initialize().await.expect("initialize raft"); + + let (tx, mut rx) = mpsc::unbounded_channel(); + network.register(node_id, tx).await; + + let rpc_target = Arc::clone(&raft); + tokio::spawn(async move { + while let Some(message) = rx.recv().await { + match message { + RpcMessage::Vote(req, resp_tx) => { + rpc_target.request_vote_rpc(req, resp_tx).await; + } + RpcMessage::AppendEntries(req, resp_tx) => { + rpc_target.append_entries_rpc(req, resp_tx).await; + } + RpcMessage::TimeoutNow(resp_tx) => { + let (raft_resp_tx, raft_resp_rx) = tokio::sync::oneshot::channel(); + rpc_target.timeout_now_rpc(raft_resp_tx).await; + let result = raft_resp_rx + .await + .map_err(|err| format!("TimeoutNow response lost: {err}")) + .and_then(|result| result.map_err(|err| err.to_string())); + let _ = resp_tx.send(result); + } + } + } + }); + + let raft_task = Arc::clone(&raft); + tokio::spawn(async move { + raft_task.run().await.expect("raft event loop"); + }); + sleep(Duration::from_millis(25)).await; + + TestClusterNode { raft, _dir: dir } + } + + async fn wait_until(label: &str, timeout: Duration, mut predicate: F) + where + F: FnMut() -> Fut, + Fut: Future, + { + let start = Instant::now(); + loop { + if predicate().await { + return; + } + assert!( + start.elapsed() <= timeout, + "{label}: condition not met within {:?}", + timeout + ); + sleep(Duration::from_millis(10)).await; + } + } + #[tokio::test] - async fn test_raft_core_creation() { - // TODO: Add proper unit tests with mock storage/network + async fn test_single_node_scale_out_persists_membership() { + let dir = tempdir().expect("tempdir"); + let store = RocksStore::new(dir.path()).expect("rocksdb store"); + let raft = spawn_single_node_leader( + store.clone(), + ClusterMembership { + members: vec![member(1, "http://127.0.0.1:9001", "http://127.0.0.1:7001")], + }, + ) + .await; + + let membership = raft + .add_member(member(2, "http://127.0.0.1:9002", "http://127.0.0.1:7002")) + .await + .expect("scale-out membership change"); + + assert_eq!(membership.voter_ids(), vec![1, 2]); + assert_eq!(raft.cluster_membership().await, membership); + + let persisted = raft + .storage() + .read_membership() + .expect("read membership") + .expect("membership payload"); + let persisted: ClusterMembership = + bincode::deserialize(&persisted).expect("deserialize membership"); + assert_eq!(persisted, membership); + } + + #[tokio::test] + async fn test_replace_member_updates_existing_record() { + let dir = tempdir().expect("tempdir"); + let store = RocksStore::new(dir.path()).expect("rocksdb store"); + let raft = spawn_single_node_leader( + store, + ClusterMembership { + members: vec![member(1, "http://127.0.0.1:9001", "http://127.0.0.1:7001")], + }, + ) + .await; + + let membership = raft + .add_member(member(1, "http://10.0.0.11:9001", "http://10.0.0.11:7001")) + .await + .expect("replace member"); + let local = membership.member(1).expect("member 1"); + + assert_eq!(membership.voter_ids(), vec![1]); + assert_eq!(local.peer_urls, vec!["http://10.0.0.11:9001".to_string()]); + assert_eq!(local.client_urls, vec!["http://10.0.0.11:7001".to_string()]); + } + + #[tokio::test] + async fn test_membership_change_rejects_multi_voter_swap_without_joint_consensus() { + let dir = tempdir().expect("tempdir"); + let store = RocksStore::new(dir.path()).expect("rocksdb store"); + let raft = spawn_single_node_leader( + store, + ClusterMembership { + members: vec![ + member(1, "http://127.0.0.1:9001", "http://127.0.0.1:7001"), + member(2, "http://127.0.0.1:9002", "http://127.0.0.1:7002"), + member(3, "http://127.0.0.1:9003", "http://127.0.0.1:7003"), + ], + }, + ) + .await; + + let err = raft + .change_membership( + ClusterMembership { + members: vec![ + member(1, "http://127.0.0.1:9001", "http://127.0.0.1:7001"), + member(3, "http://127.0.0.1:9003", "http://127.0.0.1:7003"), + member(4, "http://127.0.0.1:9004", "http://127.0.0.1:7004"), + ], + } + .normalized(), + ) + .await + .expect_err("multi-voter swap should be rejected"); + + assert!( + err.to_string().contains("joint consensus"), + "unexpected error: {err}" + ); + } + + #[tokio::test] + async fn test_initialize_loads_persisted_membership() { + let dir = tempdir().expect("tempdir"); + let store = RocksStore::new(dir.path()).expect("rocksdb store"); + let raft = spawn_single_node_leader( + store.clone(), + ClusterMembership { + members: vec![member(1, "http://127.0.0.1:9001", "http://127.0.0.1:7001")], + }, + ) + .await; + + raft.add_member(member(2, "http://127.0.0.1:9002", "http://127.0.0.1:7002")) + .await + .expect("persist membership"); + + let restored = Arc::new(RaftCore::new( + 1, + ClusterMembership { + members: vec![member(1, "http://127.0.0.1:9001", "http://127.0.0.1:7001")], + }, + Arc::new(LogStorage::new(store.clone())), + Arc::new(StateMachine::new(store).expect("state machine")), + Arc::new(InMemoryRpcClient::new()) as Arc, + RaftConfig::default(), + )); + restored.initialize().await.expect("restore raft"); + + assert_eq!(restored.cluster_membership().await.voter_ids(), vec![1, 2]); + } + + #[tokio::test] + async fn test_learner_membership_updates_replication_targets() { + let dir = tempdir().expect("tempdir"); + let store = RocksStore::new(dir.path()).expect("rocksdb store"); + let raft = spawn_single_node_leader( + store, + ClusterMembership { + members: vec![member(1, "http://127.0.0.1:9001", "http://127.0.0.1:7001")], + }, + ) + .await; + + let membership = raft + .add_member(learner_member( + 2, + "http://127.0.0.1:9002", + "http://127.0.0.1:7002", + )) + .await + .expect("add learner"); + + assert_eq!(membership.voter_ids(), vec![1]); + assert!(membership.member(2).expect("learner member").is_learner); + assert_eq!(raft.replication_targets_snapshot().await, vec![2]); + let leader_state = raft.leader_state.read().await; + assert!(leader_state + .as_ref() + .expect("leader state") + .next_index + .contains_key(&2)); + } + + #[tokio::test] + async fn test_removed_node_does_not_start_elections_or_vote() { + let dir = tempdir().expect("tempdir"); + let store = RocksStore::new(dir.path()).expect("rocksdb store"); + let raft = Arc::new(RaftCore::new( + 1, + ClusterMembership { + members: vec![member(1, "http://127.0.0.1:9001", "http://127.0.0.1:7001")], + }, + Arc::new(LogStorage::new(store.clone())), + Arc::new(StateMachine::new(store).expect("state machine")), + Arc::new(InMemoryRpcClient::new()) as Arc, + RaftConfig::default(), + )); + raft.initialize().await.expect("initialize raft"); + + raft.apply_runtime_membership( + ClusterMembership { + members: vec![member(2, "http://127.0.0.1:9002", "http://127.0.0.1:7002")], + }, + false, + ) + .await + .expect("apply removal"); + + raft.handle_election_timeout() + .await + .expect("removed node should ignore election timeout"); + assert_eq!(raft.current_term().await, 0); + assert_eq!(raft.role().await, RaftRole::Follower); + + let vote = raft + .handle_vote_request(VoteRequest { + term: 1, + candidate_id: 2, + last_log_index: 0, + last_log_term: 0, + }) + .await + .expect("vote response"); + assert!(!vote.vote_granted); + } + + #[tokio::test] + async fn test_leader_removal_reconfigures_remaining_cluster() { + let network = Arc::new(InMemoryRpcClient::new()); + let membership = ClusterMembership { + members: vec![ + member(1, "http://127.0.0.1:9001", "http://127.0.0.1:7001"), + member(2, "http://127.0.0.1:9002", "http://127.0.0.1:7002"), + member(3, "http://127.0.0.1:9003", "http://127.0.0.1:7003"), + ], + } + .normalized(); + + let node1 = spawn_cluster_node(1, membership.clone(), Arc::clone(&network)).await; + let node2 = spawn_cluster_node(2, membership.clone(), Arc::clone(&network)).await; + let node3 = spawn_cluster_node(3, membership, Arc::clone(&network)).await; + + node1 + .raft + .handle_election_timeout() + .await + .expect("start initial election"); + wait_until("node1 leader elected", Duration::from_secs(2), || { + let raft = Arc::clone(&node1.raft); + async move { raft.role().await == RaftRole::Leader } + }) + .await; + + let updated = node1 + .raft + .remove_member(1) + .await + .expect("remove current leader"); + assert_eq!(updated.voter_ids(), vec![2, 3]); + + wait_until("node1 stepped down", Duration::from_secs(2), || { + let raft = Arc::clone(&node1.raft); + async move { + raft.role().await == RaftRole::Follower + && raft.cluster_membership().await.voter_ids() == vec![2, 3] + } + }) + .await; + + node2 + .raft + .handle_election_timeout() + .await + .expect("start replacement-leader election"); + wait_until( + "node2 replacement leader elected", + Duration::from_secs(2), + || { + let raft = Arc::clone(&node2.raft); + async move { raft.role().await == RaftRole::Leader } + }, + ) + .await; + wait_until( + "remaining nodes reconfigured", + Duration::from_secs(2), + || { + let raft2 = Arc::clone(&node2.raft); + let raft3 = Arc::clone(&node3.raft); + async move { + raft2.cluster_membership().await.voter_ids() == vec![2, 3] + && raft3.cluster_membership().await.voter_ids() == vec![2, 3] + } + }, + ) + .await; + + node2 + .raft + .write(RaftCommand::Put { + key: b"leader-replaced".to_vec(), + value: b"ok".to_vec(), + lease_id: None, + prev_kv: false, + }) + .await + .expect("write after leader replacement"); + + wait_until( + "node3 received post-replacement write", + Duration::from_secs(2), + || { + let sm = node3.raft.state_machine(); + async move { + sm.kv() + .get(b"leader-replaced") + .expect("read follower state") + .is_some() + } + }, + ) + .await; + } + + #[tokio::test] + async fn test_leader_transfer_moves_leadership_to_requested_voter() { + let network = Arc::new(InMemoryRpcClient::new()); + let membership = ClusterMembership { + members: vec![ + member(1, "http://127.0.0.1:9001", "http://127.0.0.1:7001"), + member(2, "http://127.0.0.1:9002", "http://127.0.0.1:7002"), + member(3, "http://127.0.0.1:9003", "http://127.0.0.1:7003"), + ], + } + .normalized(); + + let node1 = spawn_cluster_node(1, membership.clone(), Arc::clone(&network)).await; + let node2 = spawn_cluster_node(2, membership.clone(), Arc::clone(&network)).await; + let node3 = spawn_cluster_node(3, membership, Arc::clone(&network)).await; + + node1 + .raft + .handle_election_timeout() + .await + .expect("start initial election"); + wait_until("node1 leader elected", Duration::from_secs(2), || { + let raft = Arc::clone(&node1.raft); + async move { raft.role().await == RaftRole::Leader } + }) + .await; + + node1 + .raft + .write(RaftCommand::Put { + key: b"leader-transfer".to_vec(), + value: b"warmup".to_vec(), + lease_id: None, + prev_kv: false, + }) + .await + .expect("warm up replication before transfer"); + + node1 + .raft + .transfer_leader(2) + .await + .expect("transfer leadership to node2"); + + wait_until("node2 became leader", Duration::from_secs(2), || { + let raft = Arc::clone(&node2.raft); + async move { raft.role().await == RaftRole::Leader && raft.leader().await == Some(2) } + }) + .await; + wait_until("node1 observes node2 leader", Duration::from_secs(2), || { + let raft = Arc::clone(&node1.raft); + async move { raft.role().await == RaftRole::Follower && raft.leader().await == Some(2) } + }) + .await; + wait_until( + "node3 observes node2 leader", + Duration::from_secs(2), + || { + let raft = Arc::clone(&node3.raft); + async move { raft.leader().await == Some(2) } + }, + ) + .await; + + node2 + .raft + .write(RaftCommand::Put { + key: b"leader-transfer-post".to_vec(), + value: b"ok".to_vec(), + lease_id: None, + prev_kv: false, + }) + .await + .expect("write after transfer"); + + wait_until( + "other followers apply post-transfer write", + Duration::from_secs(2), + || { + let sm1 = node1.raft.state_machine(); + let sm3 = node3.raft.state_machine(); + async move { + sm1.kv() + .get(b"leader-transfer-post") + .expect("read node1 state") + .is_some() + && sm3 + .kv() + .get(b"leader-transfer-post") + .expect("read node3 state") + .is_some() + } + }, + ) + .await; + } + + #[tokio::test] + async fn test_leader_transfer_rejects_learner_target() { + let network = Arc::new(InMemoryRpcClient::new()); + let membership = ClusterMembership { + members: vec![ + member(1, "http://127.0.0.1:9001", "http://127.0.0.1:7001"), + learner_member(2, "http://127.0.0.1:9002", "http://127.0.0.1:7002"), + ], + } + .normalized(); + + let node1 = spawn_cluster_node(1, membership.clone(), Arc::clone(&network)).await; + let _node2 = spawn_cluster_node(2, membership, Arc::clone(&network)).await; + + node1 + .raft + .handle_election_timeout() + .await + .expect("start initial election"); + wait_until("node1 leader elected", Duration::from_secs(2), || { + let raft = Arc::clone(&node1.raft); + async move { raft.role().await == RaftRole::Leader } + }) + .await; + + let err = node1 + .raft + .transfer_leader(2) + .await + .expect_err("learner target should be rejected"); + assert!( + err.to_string().contains("voting member"), + "unexpected error: {err}" + ); + } + + #[tokio::test] + async fn test_joining_node_applies_membership_changes_locally() { + let network = Arc::new(InMemoryRpcClient::new()); + let canonical_membership = ClusterMembership { + members: vec![ + member(1, "http://127.0.0.1:9001", "http://127.0.0.1:7001"), + member(2, "http://127.0.0.1:9002", "http://127.0.0.1:7002"), + member(3, "http://127.0.0.1:9003", "http://127.0.0.1:7003"), + ], + } + .normalized(); + + let node1 = spawn_cluster_node(1, canonical_membership.clone(), Arc::clone(&network)).await; + let _node2 = + spawn_cluster_node(2, canonical_membership.clone(), Arc::clone(&network)).await; + let _node3 = + spawn_cluster_node(3, canonical_membership.clone(), Arc::clone(&network)).await; + + node1 + .raft + .handle_election_timeout() + .await + .expect("start initial election"); + wait_until("node1 leader elected", Duration::from_secs(2), || { + let raft = Arc::clone(&node1.raft); + async move { raft.role().await == RaftRole::Leader } + }) + .await; + + let node4 = spawn_cluster_node(4, canonical_membership, Arc::clone(&network)).await; + + node1 + .raft + .add_member(learner_member( + 4, + "http://127.0.0.1:9004", + "http://127.0.0.1:7004", + )) + .await + .expect("add node4 as learner"); + wait_until( + "node4 observes learner membership locally", + Duration::from_secs(2), + || { + let raft = Arc::clone(&node4.raft); + async move { + let membership = raft.cluster_membership().await; + membership.members.len() == 4 + && membership + .member(4) + .map(|member| member.is_learner) + .unwrap_or(false) + } + }, + ) + .await; + + node1 + .raft + .add_member(member(4, "http://127.0.0.1:9004", "http://127.0.0.1:7004")) + .await + .expect("promote node4 to voter"); + wait_until( + "node4 observes voter membership locally", + Duration::from_secs(2), + || { + let raft = Arc::clone(&node4.raft); + async move { + let membership = raft.cluster_membership().await; + membership.members.len() == 4 + && membership + .member(4) + .map(|member| !member.is_learner) + .unwrap_or(false) + } + }, + ) + .await; } } diff --git a/chainfire/crates/chainfire-raft/src/network.rs b/chainfire/crates/chainfire-raft/src/network.rs index 83b96cb..34f8aff 100644 --- a/chainfire/crates/chainfire-raft/src/network.rs +++ b/chainfire/crates/chainfire-raft/src/network.rs @@ -27,6 +27,12 @@ pub enum RaftNetworkError { /// Trait for sending Raft RPCs #[async_trait::async_trait] pub trait RaftRpcClient: Send + Sync + 'static { + async fn add_node(&self, target: NodeId, addr: String) -> Result<(), RaftNetworkError>; + + async fn remove_node(&self, target: NodeId) -> Result<(), RaftNetworkError>; + + async fn timeout_now(&self, target: NodeId) -> Result<(), RaftNetworkError>; + async fn vote( &self, target: NodeId, @@ -59,6 +65,7 @@ pub mod test_client { AppendEntriesRequest, tokio::sync::oneshot::Sender, ), + TimeoutNow(tokio::sync::oneshot::Sender>), } impl Default for InMemoryRpcClient { @@ -81,15 +88,46 @@ pub mod test_client { #[async_trait::async_trait] impl RaftRpcClient for InMemoryRpcClient { + async fn add_node(&self, _target: NodeId, _addr: String) -> Result<(), RaftNetworkError> { + Ok(()) + } + + async fn remove_node(&self, target: NodeId) -> Result<(), RaftNetworkError> { + self.channels.write().await.remove(&target); + Ok(()) + } + + async fn timeout_now(&self, target: NodeId) -> Result<(), RaftNetworkError> { + let tx = { + let channels = self.channels.read().await; + channels + .get(&target) + .cloned() + .ok_or(RaftNetworkError::NodeNotFound(target))? + }; + + let (resp_tx, resp_rx) = tokio::sync::oneshot::channel(); + tx.send(RpcMessage::TimeoutNow(resp_tx)) + .map_err(|_| RaftNetworkError::RpcFailed("Channel closed".into()))?; + + resp_rx + .await + .map_err(|_| RaftNetworkError::RpcFailed("Response channel closed".into()))? + .map_err(RaftNetworkError::RpcFailed) + } + async fn vote( &self, target: NodeId, req: VoteRequest, ) -> Result { - let channels = self.channels.read().await; - let tx = channels - .get(&target) - .ok_or(RaftNetworkError::NodeNotFound(target))?; + let tx = { + let channels = self.channels.read().await; + channels + .get(&target) + .cloned() + .ok_or(RaftNetworkError::NodeNotFound(target))? + }; let (resp_tx, resp_rx) = tokio::sync::oneshot::channel(); tx.send(RpcMessage::Vote(req, resp_tx)) @@ -105,20 +143,22 @@ pub mod test_client { target: NodeId, req: AppendEntriesRequest, ) -> Result { - let channels = self.channels.read().await; - let tx = channels.get(&target).ok_or_else(|| { - eprintln!( - "[RPC] NodeNotFound: target={}, registered={:?}", - target, - channels.keys().collect::>() - ); - RaftNetworkError::NodeNotFound(target) - })?; + let tx = { + let channels = self.channels.read().await; + channels.get(&target).cloned().ok_or_else(|| { + eprintln!( + "[RPC] NodeNotFound: target={}, registered={:?}", + target, + channels.keys().collect::>() + ); + RaftNetworkError::NodeNotFound(target) + })? + }; let (resp_tx, resp_rx) = tokio::sync::oneshot::channel(); let send_result = tx.send(RpcMessage::AppendEntries(req.clone(), resp_tx)); - if let Err(e) = send_result { + if let Err(_e) = send_result { eprintln!("[RPC] Send failed to node {}: channel closed", target); return Err(RaftNetworkError::RpcFailed("Channel closed".into())); } diff --git a/chainfire/crates/chainfire-server/src/node.rs b/chainfire/crates/chainfire-server/src/node.rs index 8298dba..37149cc 100644 --- a/chainfire/crates/chainfire-server/src/node.rs +++ b/chainfire/crates/chainfire-server/src/node.rs @@ -6,7 +6,7 @@ use crate::config::ServerConfig; use anyhow::Result; use chainfire_api::GrpcRaftClient; use chainfire_gossip::{GossipAgent, GossipId}; -use chainfire_raft::core::{RaftConfig, RaftCore}; +use chainfire_raft::core::{ClusterMember, ClusterMembership, RaftConfig, RaftCore}; use chainfire_raft::network::RaftRpcClient; use chainfire_storage::{LogStorage, RocksStore, StateMachine}; use chainfire_types::node::NodeRole; @@ -43,6 +43,7 @@ impl Node { // Create Raft core only if role participates in Raft let (raft, rpc_client) = if config.raft.role.participates_in_raft() { + let membership = initial_membership(&config); // Create RocksDB store let store = RocksStore::new(&config.storage.data_dir)?; info!(data_dir = ?config.storage.data_dir, "Opened storage"); @@ -57,26 +58,22 @@ impl Node { // Create gRPC Raft client and register peer addresses let rpc_client = Arc::new(GrpcRaftClient::new()); - for member in &config.cluster.initial_members { - rpc_client - .add_node(member.id, member.raft_addr.clone()) - .await; - info!(node_id = member.id, addr = %member.raft_addr, "Registered peer"); + for member in &membership.members { + if let Some(peer_url) = member.peer_urls.first() { + let addr = peer_url + .strip_prefix("http://") + .or_else(|| peer_url.strip_prefix("https://")) + .unwrap_or(peer_url) + .to_string(); + rpc_client.add_node(member.id, addr.clone()).await; + info!(node_id = member.id, addr = %addr, "Registered peer"); + } } - // Extract peer node IDs (excluding self) - let peers: Vec = config - .cluster - .initial_members - .iter() - .map(|m| m.id) - .filter(|&id| id != config.node.id) - .collect(); - // Create RaftCore with default config let raft_core = Arc::new(RaftCore::new( config.node.id, - peers, + membership, log_storage, state_machine, Arc::clone(&rpc_client) as Arc, @@ -179,7 +176,7 @@ impl Node { /// NOTE: Custom RaftCore handles multi-node initialization via the peers parameter /// in the constructor. All nodes start with the same peer list and will elect a leader. pub async fn maybe_bootstrap(&self) -> Result<()> { - let Some(raft) = &self.raft else { + let Some(_raft) = &self.raft else { info!("No Raft core to bootstrap (role=none)"); return Ok(()); }; @@ -231,3 +228,53 @@ impl Node { let _ = self.shutdown_tx.send(()); } } + +fn initial_membership(config: &ServerConfig) -> ClusterMembership { + let api_port = config.network.api_addr.port(); + let mut members: Vec = config + .cluster + .initial_members + .iter() + .map(|member| ClusterMember { + id: member.id, + name: format!("node-{}", member.id), + peer_urls: vec![normalize_peer_url(&member.raft_addr)], + client_urls: grpc_endpoint_from_raft_addr(&member.raft_addr, api_port) + .into_iter() + .collect(), + is_learner: member.id == config.node.id && config.raft.role == RaftRole::Learner, + }) + .collect(); + + if members.is_empty() { + members.push(ClusterMember { + id: config.node.id, + name: config.node.name.clone(), + peer_urls: vec![normalize_peer_url(&config.network.raft_addr.to_string())], + client_urls: vec![format!( + "http://{}:{}", + config.network.api_addr.ip(), + config.network.api_addr.port() + )], + is_learner: config.raft.role == RaftRole::Learner, + }); + } + + ClusterMembership { members }.normalized() +} + +fn grpc_endpoint_from_raft_addr(raft_addr: &str, api_port: u16) -> Option { + if let Ok(addr) = raft_addr.parse::() { + return Some(format!("http://{}:{}", addr.ip(), api_port)); + } + let (host, _) = raft_addr.rsplit_once(':')?; + Some(format!("http://{}:{}", host, api_port)) +} + +fn normalize_peer_url(raft_addr: &str) -> String { + if raft_addr.contains("://") { + raft_addr.to_string() + } else { + format!("http://{raft_addr}") + } +} diff --git a/chainfire/crates/chainfire-server/src/rest.rs b/chainfire/crates/chainfire-server/src/rest.rs index 442d44f..8c0fab7 100644 --- a/chainfire/crates/chainfire-server/src/rest.rs +++ b/chainfire/crates/chainfire-server/src/rest.rs @@ -7,18 +7,20 @@ //! - GET /api/v1/kv?prefix={prefix} - Range scan //! - GET /api/v1/cluster/status - Cluster health //! - POST /api/v1/cluster/members - Add member +//! - POST /api/v1/cluster/leader/transfer - Transfer cluster leadership use axum::{ extract::{Path, Query, State}, http::StatusCode, - routing::{get, post}, + routing::{delete, get, post}, Json, Router, }; -use chainfire_api::GrpcRaftClient; -use chainfire_raft::{core::RaftError, RaftCore}; +use chainfire_raft::{ + core::{ClusterMember, RaftError}, + RaftCore, +}; use chainfire_types::command::RaftCommand; use serde::{Deserialize, Serialize}; -use std::collections::HashMap; use std::sync::Arc; /// REST API state @@ -26,9 +28,8 @@ use std::sync::Arc; pub struct RestApiState { pub raft: Arc, pub cluster_id: u64, - pub rpc_client: Option>, pub http_client: reqwest::Client, - pub peer_http_addrs: Arc>, + pub http_port: u16, } /// Standard REST error response @@ -113,21 +114,39 @@ pub struct ClusterStatusResponse { } /// Add member request -#[derive(Debug, Deserialize)] +#[derive(Debug, Deserialize, Serialize)] pub struct AddMemberRequest { pub node_id: u64, pub raft_addr: String, + #[serde(default)] + pub client_url: Option, + #[serde(default)] + pub name: Option, + #[serde(default)] + pub is_learner: bool, } /// Add member request (legacy format from first-boot-automation) /// Accepts string id and converts to numeric node_id -#[derive(Debug, Deserialize)] +#[derive(Debug, Deserialize, Serialize)] pub struct AddMemberRequestLegacy { /// Node ID as string (e.g., "node01", "node02") pub id: String, pub raft_addr: String, } +/// Remove member request body. +#[derive(Debug, Deserialize)] +pub struct RemoveMemberRequest { + pub node_id: u64, +} + +/// Leader-transfer request body. +#[derive(Debug, Deserialize, Serialize)] +pub struct LeaderTransferRequest { + pub target_id: u64, +} + /// Query parameters for prefix scan #[derive(Debug, Deserialize)] pub struct PrefixQuery { @@ -154,6 +173,8 @@ pub fn build_router(state: RestApiState) -> Router { .route("/api/v1/kv", get(list_kv)) .route("/api/v1/cluster/status", get(cluster_status)) .route("/api/v1/cluster/members", post(add_member)) + .route("/api/v1/cluster/leader/transfer", post(transfer_leader)) + .route("/api/v1/cluster/members/:node_id", delete(remove_member)) // Legacy endpoint for first-boot-automation compatibility .route("/admin/member/add", post(add_member_legacy)) .route("/health", get(health_check)) @@ -342,38 +363,77 @@ fn string_to_node_id(s: &str) -> u64 { hasher.finish() } +fn cluster_operation_error(err: &RaftError) -> (StatusCode, &'static str, String) { + match err { + RaftError::Rejected(message) => ( + StatusCode::PRECONDITION_FAILED, + "PRECONDITION_FAILED", + message.clone(), + ), + RaftError::Timeout => ( + StatusCode::REQUEST_TIMEOUT, + "TIMEOUT", + "cluster operation timed out".to_string(), + ), + _ => ( + StatusCode::INTERNAL_SERVER_ERROR, + "INTERNAL_ERROR", + err.to_string(), + ), + } +} + /// POST /api/v1/cluster/members - Add member async fn add_member( State(state): State, Json(req): Json, ) -> Result<(StatusCode, Json>), (StatusCode, Json)> { - let rpc_client = state.rpc_client.as_ref().ok_or_else(|| { - error_response( - StatusCode::SERVICE_UNAVAILABLE, - "SERVICE_UNAVAILABLE", - "RPC client not available", - ) - })?; + let member = ClusterMember { + id: req.node_id, + name: req + .name + .clone() + .filter(|value| !value.trim().is_empty()) + .unwrap_or_else(|| format!("node-{}", req.node_id)), + peer_urls: vec![normalize_peer_url(&req.raft_addr)], + client_urls: req.client_url.clone().into_iter().collect(), + is_learner: req.is_learner, + }; - // Add node to RPC client's routing table - rpc_client - .add_node(req.node_id, req.raft_addr.clone()) - .await; - - // Note: RaftCore doesn't have add_peer() - members are managed via configuration - // For now, we just register the node in the RPC client - // In a full implementation, this would trigger a Raft configuration change - - Ok(( - StatusCode::CREATED, - Json(SuccessResponse::new(serde_json::json!({ - "node_id": req.node_id, - "raft_addr": req.raft_addr, - "success": true, - "note": "Node registered in RPC client routing table" - }))), - )) + match state.raft.add_member(member).await { + Ok(membership) => { + return Ok(( + StatusCode::CREATED, + Json(SuccessResponse::new(serde_json::json!({ + "node_id": req.node_id, + "raft_addr": req.raft_addr, + "members": membership.members.len(), + "success": true + }))), + )); + } + Err(RaftError::NotLeader { leader_id }) => { + return proxy_cluster_write_to_leader( + &state, + leader_id, + "/api/v1/cluster/members", + reqwest::Method::POST, + Some(serde_json::to_value(&req).map_err(|err| { + error_response( + StatusCode::INTERNAL_SERVER_ERROR, + "INTERNAL_ERROR", + &format!("failed to encode add-member request: {err}"), + ) + })?), + ) + .await; + } + Err(err) => { + let (status, code, message) = cluster_operation_error(&err); + return Err(error_response(status, code, &message)); + } + } } /// POST /admin/member/add - Add member (legacy format for first-boot-automation) @@ -383,28 +443,94 @@ async fn add_member_legacy( ) -> Result<(StatusCode, Json>), (StatusCode, Json)> { let node_id = string_to_node_id(&req.id); + add_member( + State(state), + Json(AddMemberRequest { + node_id, + raft_addr: req.raft_addr, + client_url: None, + name: Some(req.id), + is_learner: false, + }), + ) + .await +} - let rpc_client = state.rpc_client.as_ref().ok_or_else(|| { - error_response( - StatusCode::SERVICE_UNAVAILABLE, - "SERVICE_UNAVAILABLE", - "RPC client not available", - ) - })?; +/// DELETE /api/v1/cluster/members/:node_id - Remove member. +async fn remove_member( + State(state): State, + Path(node_id): Path, +) -> Result<(StatusCode, Json>), (StatusCode, Json)> +{ + match state.raft.remove_member(node_id).await { + Ok(membership) => Ok(( + StatusCode::OK, + Json(SuccessResponse::new(serde_json::json!({ + "node_id": node_id, + "members": membership.members.len(), + "success": true + }))), + )), + Err(RaftError::NotLeader { leader_id }) => { + proxy_cluster_write_to_leader( + &state, + leader_id, + &format!("/api/v1/cluster/members/{node_id}"), + reqwest::Method::DELETE, + None, + ) + .await + } + Err(err) => { + let (status, code, message) = cluster_operation_error(&err); + Err(error_response(status, code, &message)) + } + } +} - // Add node to RPC client's routing table - rpc_client.add_node(node_id, req.raft_addr.clone()).await; +/// POST /api/v1/cluster/leader/transfer - Transfer cluster leadership. +async fn transfer_leader( + State(state): State, + Json(req): Json, +) -> Result<(StatusCode, Json>), (StatusCode, Json)> +{ + if req.target_id == 0 { + return Err(error_response( + StatusCode::BAD_REQUEST, + "INVALID_ARGUMENT", + "leader transfer target must be non-zero", + )); + } - Ok(( - StatusCode::CREATED, - Json(SuccessResponse::new(serde_json::json!({ - "id": req.id, - "node_id": node_id, - "raft_addr": req.raft_addr, - "success": true, - "note": "Node registered in RPC client routing table (legacy API)" - }))), - )) + match state.raft.transfer_leader(req.target_id).await { + Ok(leader) => Ok(( + StatusCode::OK, + Json(SuccessResponse::new(serde_json::json!({ + "leader": leader, + "success": true + }))), + )), + Err(RaftError::NotLeader { leader_id }) => { + proxy_cluster_write_to_leader( + &state, + leader_id, + "/api/v1/cluster/leader/transfer", + reqwest::Method::POST, + Some(serde_json::to_value(&req).map_err(|err| { + error_response( + StatusCode::INTERNAL_SERVER_ERROR, + "INTERNAL_ERROR", + &format!("failed to encode leader-transfer request: {err}"), + ) + })?), + ) + .await + } + Err(err) => { + let (status, code, message) = cluster_operation_error(&err); + Err(error_response(status, code, &message)) + } + } } /// Helper to create error response @@ -426,6 +552,54 @@ fn error_response( ) } +fn normalize_peer_url(raft_addr: &str) -> String { + if raft_addr.contains("://") { + raft_addr.to_string() + } else { + format!("http://{raft_addr}") + } +} + +fn http_endpoint_from_peer_url(peer_url: &str, http_port: u16) -> Option { + let trimmed = peer_url + .strip_prefix("http://") + .or_else(|| peer_url.strip_prefix("https://")) + .unwrap_or(peer_url); + if let Ok(addr) = trimmed.parse::() { + return Some(format!("http://{}:{}", addr.ip(), http_port)); + } + let (host, _) = trimmed.rsplit_once(':')?; + Some(format!("http://{}:{}", host, http_port)) +} + +async fn leader_http_addr( + state: &RestApiState, + leader_id: u64, +) -> Result)> { + let membership = state.raft.cluster_membership().await; + let leader = membership.member(leader_id).ok_or_else(|| { + error_response( + StatusCode::SERVICE_UNAVAILABLE, + "NOT_LEADER", + &format!("leader {leader_id} is known but has no membership record"), + ) + })?; + let peer_url = leader.peer_urls.first().ok_or_else(|| { + error_response( + StatusCode::SERVICE_UNAVAILABLE, + "NOT_LEADER", + &format!("leader {leader_id} is known but has no peer URL"), + ) + })?; + http_endpoint_from_peer_url(peer_url, state.http_port).ok_or_else(|| { + error_response( + StatusCode::SERVICE_UNAVAILABLE, + "NOT_LEADER", + &format!("leader {leader_id} peer URL {peer_url} cannot be mapped to HTTP"), + ) + }) +} + async fn submit_rest_write( state: &RestApiState, command: RaftCommand, @@ -464,13 +638,7 @@ async fn proxy_write_to_leader( "current node is not the leader and no leader is known yet", ) })?; - let leader_http_addr = state.peer_http_addrs.get(&leader_id).ok_or_else(|| { - error_response( - StatusCode::SERVICE_UNAVAILABLE, - "NOT_LEADER", - &format!("leader {leader_id} is known but has no HTTP endpoint mapping"), - ) - })?; + let leader_http_addr = leader_http_addr(state, leader_id).await?; let url = format!( "{}/api/v1/kv/{}", leader_http_addr.trim_end_matches('/'), @@ -506,6 +674,64 @@ async fn proxy_write_to_leader( Err((status, Json(payload))) } +async fn proxy_cluster_write_to_leader( + state: &RestApiState, + leader_id: Option, + path: &str, + method: reqwest::Method, + body: Option, +) -> Result<(StatusCode, Json>), (StatusCode, Json)> +{ + let leader_id = leader_id.ok_or_else(|| { + error_response( + StatusCode::SERVICE_UNAVAILABLE, + "NOT_LEADER", + "current node is not the leader and no leader is known yet", + ) + })?; + let leader_http_addr = leader_http_addr(state, leader_id).await?; + let url = format!("{}{}", leader_http_addr.trim_end_matches('/'), path); + let mut request = state.http_client.request(method, &url); + if let Some(body) = body { + request = request.json(&body); + } + let response = request.send().await.map_err(|err| { + error_response( + StatusCode::BAD_GATEWAY, + "LEADER_PROXY_FAILED", + &format!("failed to forward cluster write to leader {leader_id}: {err}"), + ) + })?; + if response.status().is_success() { + let status = StatusCode::from_u16(response.status().as_u16()).unwrap_or(StatusCode::OK); + let payload = response + .json::>() + .await + .map_err(|err| { + error_response( + StatusCode::BAD_GATEWAY, + "LEADER_PROXY_FAILED", + &format!("failed to decode leader {leader_id} response: {err}"), + ) + })?; + return Ok((status, Json(payload))); + } + let status = + StatusCode::from_u16(response.status().as_u16()).unwrap_or(StatusCode::BAD_GATEWAY); + let payload = response + .json::() + .await + .unwrap_or_else(|err| ErrorResponse { + error: ErrorDetail { + code: "LEADER_PROXY_FAILED".to_string(), + message: format!("leader {leader_id} returned {status}: {err}"), + details: None, + }, + meta: ResponseMeta::new(), + }); + Err((status, Json(payload))) +} + async fn should_proxy_read(consistency: Option<&str>, state: &RestApiState) -> bool { let node_id = state.raft.node_id(); let leader_id = state.raft.leader().await; @@ -517,7 +743,12 @@ fn read_requires_leader_proxy( node_id: u64, leader_id: Option, ) -> bool { - if matches!(consistency, Some(mode) if mode.eq_ignore_ascii_case("local")) { + if matches!( + consistency, + Some(mode) + if mode.eq_ignore_ascii_case("local") + || mode.eq_ignore_ascii_case("serializable") + ) { return false; } matches!(leader_id, Some(leader_id) if leader_id != node_id) @@ -538,13 +769,7 @@ where "current node is not the leader and no leader is known yet", ) })?; - let leader_http_addr = state.peer_http_addrs.get(&leader_id).ok_or_else(|| { - error_response( - StatusCode::SERVICE_UNAVAILABLE, - "NOT_LEADER", - &format!("leader {leader_id} is known but has no HTTP endpoint mapping"), - ) - })?; + let leader_http_addr = leader_http_addr(state, leader_id).await?; let url = format!("{}{}", leader_http_addr.trim_end_matches('/'), path); let mut request = state.http_client.get(&url); if let Some(query) = query { @@ -591,7 +816,26 @@ mod tests { fn read_requires_leader_proxy_defaults_to_leader_consistency() { assert!(read_requires_leader_proxy(None, 2, Some(1))); assert!(!read_requires_leader_proxy(Some("local"), 2, Some(1))); + assert!(!read_requires_leader_proxy( + Some("serializable"), + 2, + Some(1) + )); + assert!(!read_requires_leader_proxy( + Some("SERIALIZABLE"), + 2, + Some(1) + )); assert!(!read_requires_leader_proxy(None, 2, Some(2))); assert!(!read_requires_leader_proxy(None, 2, None)); } + + #[test] + fn cluster_operation_error_maps_rejected_to_precondition_failed() { + let (status, code, message) = + cluster_operation_error(&RaftError::Rejected("needs sequential reconfigure".into())); + assert_eq!(status, StatusCode::PRECONDITION_FAILED); + assert_eq!(code, "PRECONDITION_FAILED"); + assert_eq!(message, "needs sequential reconfigure"); + } } diff --git a/chainfire/crates/chainfire-server/src/server.rs b/chainfire/crates/chainfire-server/src/server.rs index e217b09..a07cab7 100644 --- a/chainfire/crates/chainfire-server/src/server.rs +++ b/chainfire/crates/chainfire-server/src/server.rs @@ -11,11 +11,10 @@ use crate::rest::{build_router, RestApiState}; use anyhow::Result; use chainfire_api::internal_proto::raft_service_server::RaftServiceServer; use chainfire_api::proto::{ - cluster_server::ClusterServer, kv_server::KvServer, watch_server::WatchServer, Member, + cluster_server::ClusterServer, kv_server::KvServer, watch_server::WatchServer, }; use chainfire_api::{ClusterServiceImpl, KvServiceImpl, RaftServiceImpl, WatchServiceImpl}; use chainfire_types::RaftRole; -use std::collections::HashMap; use std::sync::Arc; use tokio::signal; use tonic::transport::{Certificate, Identity, Server as TonicServer, ServerTlsConfig}; @@ -94,11 +93,7 @@ impl Server { raft.node_id(), ); - let cluster_service = ClusterServiceImpl::new( - Arc::clone(&raft), - self.node.cluster_id(), - configured_members(&self.config), - ); + let cluster_service = ClusterServiceImpl::new(Arc::clone(&raft), self.node.cluster_id()); // Internal Raft service for inter-node communication let raft_service = RaftServiceImpl::new(Arc::clone(&raft)); @@ -155,24 +150,11 @@ impl Server { // HTTP REST API server let http_addr = self.config.network.http_addr; - let http_port = self.config.network.http_addr.port(); - let peer_http_addrs = Arc::new( - self.config - .cluster - .initial_members - .iter() - .filter_map(|member| { - http_endpoint_from_raft_addr(&member.raft_addr, http_port) - .map(|http_addr| (member.id, http_addr)) - }) - .collect::>(), - ); let rest_state = RestApiState { raft: Arc::clone(&raft), cluster_id: self.node.cluster_id(), - rpc_client: self.node.rpc_client().cloned(), http_client: reqwest::Client::new(), - peer_http_addrs, + http_port: self.config.network.http_addr.port(), }; let rest_app = build_router(rest_state); let http_listener = tokio::net::TcpListener::bind(&http_addr).await?; @@ -289,45 +271,3 @@ impl Server { Ok(()) } } - -fn http_endpoint_from_raft_addr(raft_addr: &str, http_port: u16) -> Option { - if let Ok(addr) = raft_addr.parse::() { - return Some(format!("http://{}:{}", addr.ip(), http_port)); - } - let (host, _) = raft_addr.rsplit_once(':')?; - Some(format!("http://{}:{}", host, http_port)) -} - -fn grpc_endpoint_from_raft_addr(raft_addr: &str, api_port: u16) -> Option { - if let Ok(addr) = raft_addr.parse::() { - return Some(format!("http://{}:{}", addr.ip(), api_port)); - } - let (host, _) = raft_addr.rsplit_once(':')?; - Some(format!("http://{}:{}", host, api_port)) -} - -fn normalize_peer_url(raft_addr: &str) -> String { - if raft_addr.contains("://") { - raft_addr.to_string() - } else { - format!("http://{raft_addr}") - } -} - -fn configured_members(config: &ServerConfig) -> Vec { - let api_port = config.network.api_addr.port(); - config - .cluster - .initial_members - .iter() - .map(|member| Member { - id: member.id, - name: format!("node-{}", member.id), - peer_urls: vec![normalize_peer_url(&member.raft_addr)], - client_urls: grpc_endpoint_from_raft_addr(&member.raft_addr, api_port) - .into_iter() - .collect(), - is_learner: false, - }) - .collect() -} diff --git a/chainfire/crates/chainfire-storage/src/log_storage.rs b/chainfire/crates/chainfire-storage/src/log_storage.rs index 3773bfa..d02f419 100644 --- a/chainfire/crates/chainfire-storage/src/log_storage.rs +++ b/chainfire/crates/chainfire-storage/src/log_storage.rs @@ -44,8 +44,8 @@ pub enum EntryPayload { Blank, /// A normal data entry Normal(D), - /// Membership change entry - Membership(Vec), // Just node IDs for simplicity + /// Membership change entry encoded as a serialized membership payload. + Membership(Vec), } impl LogEntry { @@ -189,6 +189,35 @@ impl LogStorage { } } + /// Save the current serialized membership payload. + pub fn save_membership(&self, membership: &[u8]) -> Result<(), StorageError> { + let cf = self + .store + .cf_handle(cf::META) + .ok_or_else(|| StorageError::RocksDb("META cf not found".into()))?; + + self.store + .db() + .put_cf(&cf, crate::meta_keys::MEMBERSHIP, membership) + .map_err(|e| StorageError::RocksDb(e.to_string()))?; + + debug!(bytes = membership.len(), "Saved membership payload"); + Ok(()) + } + + /// Read the current serialized membership payload from storage. + pub fn read_membership(&self) -> Result>, StorageError> { + let cf = self + .store + .cf_handle(cf::META) + .ok_or_else(|| StorageError::RocksDb("META cf not found".into()))?; + + self.store + .db() + .get_cf(&cf, crate::meta_keys::MEMBERSHIP) + .map_err(|e| StorageError::RocksDb(e.to_string())) + } + /// Append log entries pub fn append(&self, entries: &[LogEntry]) -> Result<(), StorageError> { if entries.is_empty() { diff --git a/chainfire/proto/chainfire.proto b/chainfire/proto/chainfire.proto index 03b17f4..e7161e2 100644 --- a/chainfire/proto/chainfire.proto +++ b/chainfire/proto/chainfire.proto @@ -23,14 +23,23 @@ service Watch { rpc Watch(stream WatchRequest) returns (stream WatchResponse); } -// Cluster management service for fixed-membership clusters. +// Cluster management service for live membership changes. service Cluster { - // MemberList lists the members configured at cluster bootstrap time + // MemberAdd adds a member into the cluster. + rpc MemberAdd(MemberAddRequest) returns (MemberAddResponse); + + // MemberRemove removes an existing member from the cluster. + rpc MemberRemove(MemberRemoveRequest) returns (MemberRemoveResponse); + + // MemberList lists the current committed cluster membership rpc MemberList(MemberListRequest) returns (MemberListResponse); // Status gets the status of the cluster rpc Status(StatusRequest) returns (StatusResponse); + // LeaderTransfer requests a live leadership handoff to a specific voting member. + rpc LeaderTransfer(LeaderTransferRequest) returns (LeaderTransferResponse); + } // Lease service for TTL-based key expiration @@ -283,6 +292,38 @@ message Member { bool is_learner = 5; } +message MemberAddRequest { + // ID is the member ID to add or update + uint64 id = 1; + // name is the human-readable name + string name = 2; + // peer_urls are URLs for Raft communication + repeated string peer_urls = 3; + // client_urls are URLs for client communication + repeated string client_urls = 4; + // is_learner indicates if the member is a learner + bool is_learner = 5; +} + +message MemberAddResponse { + ResponseHeader header = 1; + // member is the member information for the added member + Member member = 2; + // members is the list of all members after adding + repeated Member members = 3; +} + +message MemberRemoveRequest { + // ID is the member ID to remove + uint64 id = 1; +} + +message MemberRemoveResponse { + ResponseHeader header = 1; + // members is the list of all members after removing + repeated Member members = 2; +} + message MemberListRequest {} message MemberListResponse { @@ -309,6 +350,17 @@ message StatusResponse { uint64 raft_applied_index = 7; } +message LeaderTransferRequest { + // target_id is the voting member that should become leader. + uint64 target_id = 1; +} + +message LeaderTransferResponse { + ResponseHeader header = 1; + // leader is the member ID of the observed leader after transfer. + uint64 leader = 2; +} + // ========== Lease ========== message LeaseGrantRequest { diff --git a/chainfire/proto/internal.proto b/chainfire/proto/internal.proto index 6656ab0..0cd958f 100644 --- a/chainfire/proto/internal.proto +++ b/chainfire/proto/internal.proto @@ -9,6 +9,9 @@ service RaftService { // AppendEntries sends log entries to followers rpc AppendEntries(AppendEntriesRequest) returns (AppendEntriesResponse); + + // TimeoutNow requests an immediate election on the target voting peer. + rpc TimeoutNow(TimeoutNowRequest) returns (TimeoutNowResponse); } message VoteRequest { @@ -47,6 +50,12 @@ message AppendEntriesRequest { uint64 leader_commit = 6; } +enum EntryType { + ENTRY_TYPE_BLANK = 0; + ENTRY_TYPE_NORMAL = 1; + ENTRY_TYPE_MEMBERSHIP = 2; +} + message LogEntry { // index is the log entry index uint64 index = 1; @@ -54,6 +63,8 @@ message LogEntry { uint64 term = 2; // data is the command data bytes data = 3; + // entry_type identifies how data should be decoded + EntryType entry_type = 4; } message AppendEntriesResponse { @@ -66,3 +77,12 @@ message AppendEntriesResponse { // conflict_term is the term of the conflicting entry uint64 conflict_term = 4; } + +message TimeoutNowRequest {} + +message TimeoutNowResponse { + // accepted is true if the target accepted the immediate-election request. + bool accepted = 1; + // term is the target node's current term after processing the request. + uint64 term = 2; +} diff --git a/deployer/Cargo.lock b/deployer/Cargo.lock index 181a381..83d6051 100644 --- a/deployer/Cargo.lock +++ b/deployer/Cargo.lock @@ -364,6 +364,17 @@ dependencies = [ "generic-array", ] +[[package]] +name = "bootstrap-agent" +version = "0.1.0" +dependencies = [ + "anyhow", + "clap", + "deployer-types", + "serde", + "serde_json", +] + [[package]] name = "bumpalo" version = "3.20.2" diff --git a/deployer/Cargo.toml b/deployer/Cargo.toml index 27069bf..d88010e 100644 --- a/deployer/Cargo.toml +++ b/deployer/Cargo.toml @@ -1,6 +1,7 @@ [workspace] resolver = "2" members = [ + "crates/bootstrap-agent", "crates/deployer-types", "crates/deployer-server", "crates/node-agent", diff --git a/deployer/crates/bootstrap-agent/Cargo.toml b/deployer/crates/bootstrap-agent/Cargo.toml new file mode 100644 index 0000000..4d1a9aa --- /dev/null +++ b/deployer/crates/bootstrap-agent/Cargo.toml @@ -0,0 +1,16 @@ +[package] +name = "bootstrap-agent" +version.workspace = true +edition.workspace = true +rust-version.workspace = true +authors.workspace = true +license.workspace = true +repository.workspace = true + +[dependencies] +anyhow.workspace = true +clap.workspace = true +serde.workspace = true +serde_json.workspace = true + +deployer-types.workspace = true diff --git a/deployer/crates/bootstrap-agent/src/main.rs b/deployer/crates/bootstrap-agent/src/main.rs new file mode 100644 index 0000000..23d1dc5 --- /dev/null +++ b/deployer/crates/bootstrap-agent/src/main.rs @@ -0,0 +1,203 @@ +use std::collections::HashMap; +use std::fmt::Write as _; +use std::fs; +use std::path::{Path, PathBuf}; + +use anyhow::{Context, Result}; +use clap::{Parser, Subcommand, ValueEnum}; +use deployer_types::{DiskSelectorSource, NodeConfig, ResolvedInstallPlan}; + +#[derive(Parser, Debug)] +#[command(author, version, about)] +struct Cli { + #[command(subcommand)] + command: Command, +} + +#[derive(Subcommand, Debug)] +enum Command { + ResolveInstallContext(ResolveInstallContextArgs), +} + +#[derive(Parser, Debug)] +struct ResolveInstallContextArgs { + #[arg(long, default_value = "/etc/ultracloud/node-config.json")] + node_config: PathBuf, + + #[arg(long, default_value = "/etc/ultracloud/disko-script-paths.json")] + disko_script_paths: PathBuf, + + #[arg(long, default_value = "/etc/ultracloud/system-paths.json")] + system_paths: PathBuf, + + #[arg(long, value_enum, default_value_t = OutputFormat::Json)] + format: OutputFormat, + + #[arg(long)] + write: Option, +} + +#[derive(Clone, Copy, Debug, Eq, PartialEq, ValueEnum)] +enum OutputFormat { + Json, + Env, +} + +fn main() -> Result<()> { + let cli = Cli::parse(); + match cli.command { + Command::ResolveInstallContext(args) => resolve_install_context(args), + } +} + +fn resolve_install_context(args: ResolveInstallContextArgs) -> Result<()> { + let node_config = read_json::(&args.node_config) + .with_context(|| format!("failed to read node config from {}", args.node_config.display()))?; + let disko_script_paths = read_optional_path_map(&args.disko_script_paths)?; + let system_paths = read_optional_path_map(&args.system_paths)?; + let resolved = node_config.resolve_install_plan( + disko_script_paths.as_ref(), + system_paths.as_ref(), + )?; + + let rendered = match args.format { + OutputFormat::Json => serde_json::to_string_pretty(&resolved)?, + OutputFormat::Env => render_env_file(&resolved), + }; + + if let Some(path) = args.write { + if let Some(parent) = path.parent() { + fs::create_dir_all(parent).with_context(|| { + format!("failed to create parent directory for {}", path.display()) + })?; + } + fs::write(&path, rendered) + .with_context(|| format!("failed to write {}", path.display()))?; + } else { + print!("{rendered}"); + if !rendered.ends_with('\n') { + println!(); + } + } + + Ok(()) +} + +fn read_json(path: &Path) -> Result +where + T: serde::de::DeserializeOwned, +{ + let raw = fs::read_to_string(path) + .with_context(|| format!("failed to read {}", path.display()))?; + serde_json::from_str(&raw) + .with_context(|| format!("failed to parse {}", path.display())) +} + +fn read_optional_path_map(path: &Path) -> Result>> { + if !path.exists() { + return Ok(None); + } + + let map = read_json::>(path)?; + let sanitized = map + .into_iter() + .filter_map(|(key, value)| { + let trimmed_key = key.trim(); + let trimmed_value = value.trim(); + if trimmed_key.is_empty() || trimmed_value.is_empty() { + None + } else { + Some((trimmed_key.to_string(), trimmed_value.to_string())) + } + }) + .collect::>(); + Ok(Some(sanitized)) +} + +fn render_env_file(resolved: &ResolvedInstallPlan) -> String { + let mut rendered = String::new(); + let node_marker_id = resolved.installer_node_name(); + let display_target_disk = resolved.display_target_disk().unwrap_or_default(); + let disk_selector_source = match resolved.disk_selector_source { + DiskSelectorSource::AutoDiscovery => "auto-discovery", + DiskSelectorSource::InstallPlanTargetDisk => "install_plan.target_disk", + DiskSelectorSource::InstallPlanTargetDiskById => "install_plan.target_disk_by_id", + }; + + for (key, value) in [ + ("NODE_ID", node_marker_id), + ("NODE_IP", resolved.ip.as_str()), + ("NIXOS_CONFIGURATION", resolved.nixos_configuration.as_str()), + ( + "INSTALL_PLAN_DISKO_CONFIG_PATH", + resolved.disko_config_path.as_deref().unwrap_or(""), + ), + ( + "DISKO_SCRIPT_PATH", + resolved.disko_script_path.as_deref().unwrap_or(""), + ), + ( + "TARGET_SYSTEM_PATH", + resolved.target_system_path.as_deref().unwrap_or(""), + ), + ("TARGET_DISK", resolved.target_disk.as_deref().unwrap_or("")), + ( + "TARGET_DISK_BY_ID", + resolved.target_disk_by_id.as_deref().unwrap_or(""), + ), + ("DISPLAY_TARGET_DISK", display_target_disk), + ("DISK_SELECTOR_SOURCE", disk_selector_source), + ] { + writeln!( + rendered, + "{key}={}", + systemd_environment_quote(value) + ) + .expect("writing to String should never fail"); + } + + rendered +} + +fn systemd_environment_quote(value: &str) -> String { + let mut quoted = String::with_capacity(value.len() + 2); + quoted.push('"'); + for ch in value.chars() { + match ch { + '\\' => quoted.push_str("\\\\"), + '"' => quoted.push_str("\\\""), + '\n' => quoted.push_str("\\n"), + '\t' => quoted.push_str("\\t"), + _ => quoted.push(ch), + } + } + quoted.push('"'); + quoted +} + +#[cfg(test)] +mod tests { + use super::render_env_file; + use deployer_types::{DiskSelectorSource, ResolvedInstallPlan}; + + #[test] + fn env_rendering_quotes_values_for_environment_file_consumers() { + let rendered = render_env_file(&ResolvedInstallPlan { + node_id: "node-01".to_string(), + hostname: "node 01".to_string(), + ip: "10.0.0.10".to_string(), + nixos_configuration: "profile with spaces".to_string(), + disko_config_path: Some("profiles/worker/disko.nix".to_string()), + disko_script_path: Some("/nix/store/example script".to_string()), + target_system_path: Some("/nix/store/example-system".to_string()), + target_disk: Some("/dev/vda".to_string()), + target_disk_by_id: None, + disk_selector_source: DiskSelectorSource::InstallPlanTargetDisk, + }); + + assert!(rendered.contains("NODE_ID=\"node 01\"")); + assert!(rendered.contains("NIXOS_CONFIGURATION=\"profile with spaces\"")); + assert!(rendered.contains("DISK_SELECTOR_SOURCE=\"install_plan.target_disk\"")); + assert!(rendered.contains("DISPLAY_TARGET_DISK=\"/dev/vda\"")); + } +} diff --git a/deployer/crates/deployer-ctl/src/chainfire.rs b/deployer/crates/deployer-ctl/src/chainfire.rs index c3850a1..5327335 100644 --- a/deployer/crates/deployer-ctl/src/chainfire.rs +++ b/deployer/crates/deployer-ctl/src/chainfire.rs @@ -1550,6 +1550,8 @@ mod tests { install_plan: Some(InstallPlan { nixos_configuration: Some("worker-golden".to_string()), disko_config_path: Some("profiles/worker-linux/disko.nix".to_string()), + disko_script_path: None, + target_system_path: None, target_disk: Some("/dev/disk/by-id/worker-golden".to_string()), target_disk_by_id: None, }), diff --git a/deployer/crates/deployer-server/src/cloud_init.rs b/deployer/crates/deployer-server/src/cloud_init.rs index 04587a9..7921923 100644 --- a/deployer/crates/deployer-server/src/cloud_init.rs +++ b/deployer/crates/deployer-server/src/cloud_init.rs @@ -139,6 +139,8 @@ mod tests { install_plan: Some(InstallPlan { nixos_configuration: Some("worker-golden".to_string()), disko_config_path: Some("profiles/worker/disko.nix".to_string()), + disko_script_path: None, + target_system_path: None, target_disk: Some("/dev/vda".to_string()), target_disk_by_id: None, }), diff --git a/deployer/crates/deployer-server/src/phone_home.rs b/deployer/crates/deployer-server/src/phone_home.rs index 9a83e78..5122f4a 100644 --- a/deployer/crates/deployer-server/src/phone_home.rs +++ b/deployer/crates/deployer-server/src/phone_home.rs @@ -1064,6 +1064,8 @@ mod tests { install_plan: Some(InstallPlan { nixos_configuration: Some("gpu-worker".to_string()), disko_config_path: Some("profiles/gpu-worker/disko.nix".to_string()), + disko_script_path: None, + target_system_path: None, target_disk: Some("/dev/disk/by-id/nvme-gpu-worker".to_string()), target_disk_by_id: None, }), diff --git a/deployer/crates/deployer-types/src/lib.rs b/deployer/crates/deployer-types/src/lib.rs index 5404c37..c0c2ad5 100644 --- a/deployer/crates/deployer-types/src/lib.rs +++ b/deployer/crates/deployer-types/src/lib.rs @@ -111,6 +111,12 @@ pub struct InstallPlan { /// Repository-relative Disko file used during installation. #[serde(default, skip_serializing_if = "Option::is_none")] pub disko_config_path: Option, + /// Pre-built Disko formatter closure used instead of evaluating a flake on the ISO. + #[serde(default, skip_serializing_if = "Option::is_none")] + pub disko_script_path: Option, + /// Pre-built NixOS system closure installed directly by `nixos-install --system`. + #[serde(default, skip_serializing_if = "Option::is_none")] + pub target_system_path: Option, /// Explicit disk device path used by bootstrap installers. #[serde(default, skip_serializing_if = "Option::is_none")] pub target_disk: Option, @@ -128,6 +134,12 @@ impl InstallPlan { if self.disko_config_path.is_some() { merged.disko_config_path = self.disko_config_path.clone(); } + if self.disko_script_path.is_some() { + merged.disko_script_path = self.disko_script_path.clone(); + } + if self.target_system_path.is_some() { + merged.target_system_path = self.target_system_path.clone(); + } if self.target_disk.is_some() { merged.target_disk = self.target_disk.clone(); } @@ -149,6 +161,66 @@ impl InstallPlan { } } +#[derive(Debug, Clone, Copy, PartialEq, Eq, Serialize, Deserialize)] +#[serde(rename_all = "snake_case")] +pub enum DiskSelectorSource { + AutoDiscovery, + InstallPlanTargetDisk, + InstallPlanTargetDiskById, +} + +#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)] +pub struct ResolvedInstallPlan { + pub node_id: String, + pub hostname: String, + pub ip: String, + pub nixos_configuration: String, + #[serde(default, skip_serializing_if = "Option::is_none")] + pub disko_config_path: Option, + #[serde(default, skip_serializing_if = "Option::is_none")] + pub disko_script_path: Option, + #[serde(default, skip_serializing_if = "Option::is_none")] + pub target_system_path: Option, + #[serde(default, skip_serializing_if = "Option::is_none")] + pub target_disk: Option, + #[serde(default, skip_serializing_if = "Option::is_none")] + pub target_disk_by_id: Option, + pub disk_selector_source: DiskSelectorSource, +} + +impl ResolvedInstallPlan { + pub fn installer_node_name(&self) -> &str { + &self.hostname + } + + pub fn display_target_disk(&self) -> Option<&str> { + self.target_disk_by_id + .as_deref() + .or(self.target_disk.as_deref()) + } +} + +#[derive(Debug, Clone, PartialEq, Eq)] +pub enum InstallPlanResolveError { + MissingNodeId, + MissingNodeIp, +} + +impl std::fmt::Display for InstallPlanResolveError { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + match self { + InstallPlanResolveError::MissingNodeId => { + write!(f, "node_config assignment is missing node_id/hostname") + } + InstallPlanResolveError::MissingNodeIp => { + write!(f, "node_config assignment is missing ip") + } + } + } +} + +impl std::error::Error for InstallPlanResolveError {} + /// Stable node assignment returned by bootstrap enrollment. #[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Default)] pub struct NodeAssignment { @@ -212,6 +284,91 @@ impl NodeConfig { bootstrap_secrets, } } + + pub fn resolve_install_plan( + &self, + disko_script_paths: Option<&HashMap>, + system_paths: Option<&HashMap>, + ) -> Result { + let node_id = non_empty(self.assignment.node_id.as_str()) + .map(str::to_string) + .or_else(|| non_empty(self.assignment.hostname.as_str()).map(str::to_string)) + .ok_or(InstallPlanResolveError::MissingNodeId)?; + let hostname = non_empty(self.assignment.hostname.as_str()) + .unwrap_or(node_id.as_str()) + .to_string(); + let ip = non_empty(self.assignment.ip.as_str()) + .map(str::to_string) + .ok_or(InstallPlanResolveError::MissingNodeIp)?; + let install_plan = self.bootstrap_plan.install_plan.as_ref(); + let nixos_configuration = install_plan + .and_then(|plan| plan.nixos_configuration.as_deref()) + .and_then(non_empty) + .unwrap_or(hostname.as_str()) + .to_string(); + let disko_config_path = install_plan + .and_then(|plan| plan.disko_config_path.as_deref()) + .and_then(non_empty) + .map(str::to_string); + let disko_script_path = install_plan + .and_then(|plan| plan.disko_script_path.as_deref()) + .and_then(non_empty) + .map(str::to_string) + .or_else(|| lookup_path_map(disko_script_paths, &nixos_configuration)); + let target_system_path = install_plan + .and_then(|plan| plan.target_system_path.as_deref()) + .and_then(non_empty) + .map(str::to_string) + .or_else(|| lookup_path_map(system_paths, &nixos_configuration)); + let target_disk = install_plan + .and_then(|plan| plan.target_disk.as_deref()) + .and_then(non_empty) + .map(str::to_string); + let target_disk_by_id = install_plan + .and_then(|plan| plan.target_disk_by_id.as_deref()) + .and_then(non_empty) + .map(str::to_string); + let disk_selector_source = if target_disk_by_id.is_some() { + DiskSelectorSource::InstallPlanTargetDiskById + } else if target_disk.is_some() { + DiskSelectorSource::InstallPlanTargetDisk + } else { + DiskSelectorSource::AutoDiscovery + }; + + Ok(ResolvedInstallPlan { + node_id, + hostname, + ip, + nixos_configuration, + disko_config_path, + disko_script_path, + target_system_path, + target_disk, + target_disk_by_id, + disk_selector_source, + }) + } +} + +fn non_empty(value: &str) -> Option<&str> { + let trimmed = value.trim(); + if trimmed.is_empty() { + None + } else { + Some(trimmed) + } +} + +fn lookup_path_map( + paths: Option<&HashMap>, + nixos_configuration: &str, +) -> Option { + paths + .and_then(|entries| entries.get(nixos_configuration)) + .map(String::as_str) + .and_then(non_empty) + .map(str::to_string) } /// Basic inventory record for a physical disk observed during commissioning. @@ -1512,6 +1669,8 @@ mod tests { install_plan: Some(InstallPlan { nixos_configuration: Some("node01".to_string()), disko_config_path: Some("nix/nodes/vm-cluster/node01/disko.nix".to_string()), + disko_script_path: None, + target_system_path: None, target_disk: Some("/dev/vda".to_string()), target_disk_by_id: None, }), @@ -1572,6 +1731,8 @@ mod tests { install_plan: Some(InstallPlan { nixos_configuration: Some("worker-linux".to_string()), disko_config_path: Some("profiles/worker-linux/disko.nix".to_string()), + disko_script_path: None, + target_system_path: None, target_disk: None, target_disk_by_id: Some("/dev/disk/by-id/worker-default".to_string()), }), @@ -2064,22 +2225,176 @@ mod tests { let fallback = InstallPlan { nixos_configuration: Some("fallback".to_string()), disko_config_path: Some("fallback/disko.nix".to_string()), + disko_script_path: Some("/nix/store/fallback-disko".to_string()), + target_system_path: Some("/nix/store/fallback-system".to_string()), target_disk: Some("/dev/sda".to_string()), target_disk_by_id: None, }; let preferred = InstallPlan { nixos_configuration: None, disko_config_path: None, + disko_script_path: None, + target_system_path: None, target_disk: None, target_disk_by_id: Some("/dev/disk/by-id/nvme-example".to_string()), }; let merged = preferred.merged_with(Some(&fallback)); assert_eq!(merged.nixos_configuration.as_deref(), Some("fallback")); + assert_eq!( + merged.disko_script_path.as_deref(), + Some("/nix/store/fallback-disko") + ); + assert_eq!( + merged.target_system_path.as_deref(), + Some("/nix/store/fallback-system") + ); assert_eq!(merged.target_disk.as_deref(), Some("/dev/sda")); assert_eq!( merged.target_disk_by_id.as_deref(), Some("/dev/disk/by-id/nvme-example") ); } + + #[test] + fn test_node_config_resolves_install_plan_from_profile_maps() { + let config = NodeConfig::from_parts( + NodeAssignment { + node_id: "node01".to_string(), + hostname: "node01.example".to_string(), + role: "worker".to_string(), + ip: "10.0.0.10".to_string(), + labels: HashMap::new(), + pool: None, + node_class: None, + failure_domain: None, + }, + BootstrapPlan { + services: vec![], + nix_profile: None, + install_plan: Some(InstallPlan { + nixos_configuration: Some("worker-profile".to_string()), + disko_config_path: Some("profiles/worker/disko.nix".to_string()), + disko_script_path: None, + target_system_path: None, + target_disk: None, + target_disk_by_id: Some("/dev/disk/by-id/worker-root".to_string()), + }), + }, + BootstrapSecrets::default(), + ); + + let resolved = config + .resolve_install_plan( + Some(&HashMap::from([( + "worker-profile".to_string(), + "/nix/store/worker-disko".to_string(), + )])), + Some(&HashMap::from([( + "worker-profile".to_string(), + "/nix/store/worker-system".to_string(), + )])), + ) + .expect("install contract should resolve"); + + assert_eq!(resolved.node_id, "node01"); + assert_eq!(resolved.hostname, "node01.example"); + assert_eq!(resolved.nixos_configuration, "worker-profile"); + assert_eq!( + resolved.disko_script_path.as_deref(), + Some("/nix/store/worker-disko") + ); + assert_eq!( + resolved.target_system_path.as_deref(), + Some("/nix/store/worker-system") + ); + assert_eq!( + resolved.target_disk_by_id.as_deref(), + Some("/dev/disk/by-id/worker-root") + ); + assert_eq!( + resolved.disk_selector_source, + DiskSelectorSource::InstallPlanTargetDiskById + ); + } + + #[test] + fn test_node_config_prefers_direct_install_artifacts() { + let config = NodeConfig::from_parts( + NodeAssignment { + node_id: "node02".to_string(), + hostname: "node02".to_string(), + role: "control-plane".to_string(), + ip: "10.0.0.11".to_string(), + labels: HashMap::new(), + pool: None, + node_class: None, + failure_domain: None, + }, + BootstrapPlan { + services: vec![], + nix_profile: None, + install_plan: Some(InstallPlan { + nixos_configuration: None, + disko_config_path: None, + disko_script_path: Some("/nix/store/direct-disko".to_string()), + target_system_path: Some("/nix/store/direct-system".to_string()), + target_disk: Some("/dev/vda".to_string()), + target_disk_by_id: None, + }), + }, + BootstrapSecrets::default(), + ); + + let resolved = config + .resolve_install_plan( + Some(&HashMap::from([( + "node02".to_string(), + "/nix/store/fallback-disko".to_string(), + )])), + Some(&HashMap::from([( + "node02".to_string(), + "/nix/store/fallback-system".to_string(), + )])), + ) + .expect("install contract should resolve"); + + assert_eq!(resolved.nixos_configuration, "node02"); + assert_eq!( + resolved.disko_script_path.as_deref(), + Some("/nix/store/direct-disko") + ); + assert_eq!( + resolved.target_system_path.as_deref(), + Some("/nix/store/direct-system") + ); + assert_eq!(resolved.target_disk.as_deref(), Some("/dev/vda")); + assert_eq!( + resolved.disk_selector_source, + DiskSelectorSource::InstallPlanTargetDisk + ); + } + + #[test] + fn test_node_config_resolve_install_plan_requires_ip() { + let config = NodeConfig::from_parts( + NodeAssignment { + node_id: "node03".to_string(), + hostname: "node03".to_string(), + role: "worker".to_string(), + ip: "".to_string(), + labels: HashMap::new(), + pool: None, + node_class: None, + failure_domain: None, + }, + BootstrapPlan::default(), + BootstrapSecrets::default(), + ); + + let error = config + .resolve_install_plan(None, None) + .expect_err("missing ip should fail resolution"); + assert_eq!(error, InstallPlanResolveError::MissingNodeIp); + } } diff --git a/deployer/crates/nix-agent/src/main.rs b/deployer/crates/nix-agent/src/main.rs index 5b122f8..2b94dc7 100644 --- a/deployer/crates/nix-agent/src/main.rs +++ b/deployer/crates/nix-agent/src/main.rs @@ -776,6 +776,8 @@ mod tests { install_plan: Some(InstallPlan { nixos_configuration: Some("node01".to_string()), disko_config_path: Some("nix/nodes/vm-cluster/node01/disko.nix".to_string()), + disko_script_path: None, + target_system_path: None, target_disk: Some("/dev/vda".to_string()), target_disk_by_id: None, }), diff --git a/docs/README.md b/docs/README.md index 22967f3..235cc3c 100644 --- a/docs/README.md +++ b/docs/README.md @@ -44,7 +44,7 @@ This directory is the public documentation entrypoint for UltraCloud. ## Core API Notes -- `chainfire` supports fixed-membership cluster introspection on the public surface: `MemberList`, `Status`, and the internal `Vote` plus `AppendEntries` Raft transport. `chainfire-core` remains a workspace-internal compatibility crate rather than a supported embeddable API. +- `chainfire` supports live cluster membership management on the public surface: `MemberAdd`, `MemberRemove`, `MemberList`, `Status`, `LeaderTransfer`, and the internal `Vote`, `AppendEntries`, plus `TimeoutNow` Raft transport. The supported operator flow now includes learner add or promote, live leader transfer, temporary-voter restart and rejoin, and current-leader removal followed by election on the remaining voters. The supported reconfiguration boundary is sequential one-voter transitions until joint consensus exists. `chainfire-core` remains a workspace-internal compatibility crate rather than a supported embeddable API. - `flaredb` supports SQL over both gRPC and REST. The public REST endpoints are `POST /api/v1/sql` and `GET /api/v1/tables`. - `lightningstor` keeps bucket versioning, bucket policy, bucket tagging, and explicit object version listing on the supported optional surface. - `k8shost` keeps `WatchPods` on the supported surface as a bounded snapshot stream of the current matching pods. diff --git a/docs/control-plane-ops.md b/docs/control-plane-ops.md index 85cd673..6e1c805 100644 --- a/docs/control-plane-ops.md +++ b/docs/control-plane-ops.md @@ -4,24 +4,28 @@ This document fixes the supported operator lifecycle for the core control-plane ## ChainFire Membership And Node Replacement -ChainFire dynamic membership, replace-node, and scale-out are unsupported on the supported surface. +ChainFire supports live membership add, remove, endpoint replacement, and live leader transfer on the supported surface. +The supported reconfiguration boundary is sequential one-voter transitions; arbitrary multi-voter swaps still require future joint-consensus work. -The supported public surface is the fixed-membership cluster API already documented in `chainfire-api`: `MemberList` and `Status` report the membership that the node booted with, and operators should treat that membership as immutable for a release branch. +The supported public surface is the replicated cluster API documented in `chainfire-api`: `MemberAdd`, `MemberRemove`, `MemberList`, `Status`, and `LeaderTransfer` operate on the current committed membership rather than only the bootstrap shape. Supported operator actions today: -1. Keep the canonical control plane at the documented fixed membership for the branch. -2. Use the canonical `durability-proof` backup/restore lane before disruptive maintenance. -3. Use `nix run ./nix/test-cluster#cluster -- rollout-soak` when you need a longer-running fixed-membership restart proof after maintenance or rollout work. -4. Recover failed nodes by restoring the same fixed-membership cluster shape or by rebuilding the whole cluster with a freshly published static membership and then restoring data. +1. Scale out by adding a learner or voter with `MemberAdd`. +2. Promote a learner to voter by re-adding the same member ID with `is_learner=false`. +3. Replace a learner, follower, voter, or current-leader endpoint in place by re-adding the same member ID with updated peer or client URLs. +4. Hand leadership to another live voting member with `LeaderTransfer` before maintenance that should avoid the current leader taking the election hit. +5. Scale in or retire a learner, follower, voter, or current leader with `MemberRemove`; when the current leader is removed, the remaining voters elect the replacement leader. +6. Use the canonical `durability-proof` backup/restore lane before disruptive maintenance or before a membership change you cannot quickly roll back. +7. Use `nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof` when you need the dedicated KVM proof for scale-out, learner promotion, leader transfer, temporary-voter restart, current-leader removal, re-add, and scale-in on the canonical control-plane shape. +8. Use `nix run ./nix/test-cluster#cluster -- rollout-soak` when you need the longer-running restart and degraded-service proof for the canonical control-plane shape after maintenance or rollout work. Unsupported operator actions today: -1. Live `replace-node` through a public ChainFire API. -2. Live `scale-out` by adding new voters on the supported surface. -3. Relying on internal membership helpers as a published product contract. +1. Treating internal Raft helpers outside `chainfire-api` and `chainfire-server` as the supported operator contract. +2. Treating larger-cluster, hardware, or arbitrary-topology live reconfiguration beyond the canonical KVM proof lane as release-proven. The current proof is fixed to the canonical 3-node control plane plus one temporary `node04` replica. -The focused boundary proof is `./nix/test-cluster/run-core-control-plane-ops-proof.sh`, which records the fixed-membership source marker from `chainfire-api` and the public docs markers under `./work/core-control-plane-ops-proof`. The live-operations companion is `nix run ./nix/test-cluster#cluster -- rollout-soak`, which on 2026-04-10 recorded `chainfire-post-restart-put.json`, `chainfire-post-restart.json`, and `post-control-plane-restarts.json` under `./work/rollout-soak/20260410T164549+0900` after repeated maintenance and worker power-loss, without promoting dynamic membership to supported scope. +The focused boundary proof is `./nix/test-cluster/run-core-control-plane-ops-proof.sh`, which records the published ChainFire API surface and the public docs markers under `./work/core-control-plane-ops-proof`. The dedicated live-membership KVM proof is `nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof`, which records learner add, voter promotion, live leader transfer, temporary-voter restart, current-leader removal, removed-leader re-add, and final scale-in artifacts under `./work/chainfire-live-membership-proof`. The live-operations restart companion remains `nix run ./nix/test-cluster#cluster -- rollout-soak`, which on 2026-04-10 recorded `chainfire-post-restart-put.json`, `chainfire-post-restart.json`, and `post-control-plane-restarts.json` under `./work/rollout-soak/20260410T164549+0900` after repeated maintenance and worker power-loss for the canonical 3-node control-plane shape. ## FlareDB Online Migration And Schema Evolution diff --git a/docs/rollout-bundle.md b/docs/rollout-bundle.md index 078a1b5..3065393 100644 --- a/docs/rollout-bundle.md +++ b/docs/rollout-bundle.md @@ -27,7 +27,7 @@ The supported layering is still `deployer -> nix-agent` for host OS rollout and - `nix run ./nix/test-cluster#cluster -- rollout-soak` - `nix run ./nix/test-cluster#cluster -- durability-proof` -`deployer-vm-rollback` is the smallest reproducible proof for the `nix-agent` health-check and rollback path. `fresh-smoke` and `fleet-scheduler-e2e` keep the short regression semantics green. `rollout-soak` is the longer-running KVM operator lane for one planned drain cycle, one fail-stop worker-loss cycle, and service-restart behavior across `deployer`, `fleet-scheduler`, `node-agent`, and the fixed-membership control plane. It writes `scope-fixed-contract.json`, `deployer-scope-fixed.txt`, and `fleet-scheduler-scope-fixed.txt` so the release boundary is captured in the proof root instead of being implied only by docs. The steady-state `nix/test-cluster` nodes record explicit `nix-agent` scope markers instead of pretending they run `nix-agent.service`. `durability-proof` remains the canonical persisted artifact lane for `deployer` backup, restart, replay, and storage-side failure injection. +`deployer-vm-rollback` is the smallest reproducible proof for the `nix-agent` health-check and rollback path. `fresh-smoke` and `fleet-scheduler-e2e` keep the short regression semantics green. `rollout-soak` is the longer-running KVM operator lane for one planned drain cycle, one fail-stop worker-loss cycle, and service-restart behavior across `deployer`, `fleet-scheduler`, `node-agent`, and the canonical 3-node control plane. It writes `scope-fixed-contract.json`, `deployer-scope-fixed.txt`, and `fleet-scheduler-scope-fixed.txt` so the release boundary is captured in the proof root instead of being implied only by docs. The steady-state `nix/test-cluster` nodes record explicit `nix-agent` scope markers instead of pretending they run `nix-agent.service`. `durability-proof` remains the canonical persisted artifact lane for `deployer` backup, restart, replay, and storage-side failure injection. ## Deployer HA And DR diff --git a/docs/testing.md b/docs/testing.md index a6a1c03..da43c20 100644 --- a/docs/testing.md +++ b/docs/testing.md @@ -145,6 +145,7 @@ nix run ./nix/test-cluster#cluster -- baremetal-iso nix run ./nix/test-cluster#cluster -- fresh-smoke nix run ./nix/test-cluster#cluster -- fresh-demo-vm-webapp nix run ./nix/test-cluster#cluster -- fresh-matrix +nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof nix run ./nix/test-cluster#cluster -- provider-vm-reality-proof nix run ./nix/test-cluster#cluster -- rollout-soak ./nix/test-cluster/run-publishable-kvm-suite.sh ./work/publishable-kvm-suite @@ -163,10 +164,11 @@ Use these commands as the release-facing local proof set: - `fresh-smoke` also proves the supported PlasmaVMC backend contract by requiring both worker registrations to advertise `HYPERVISOR_TYPE_KVM` and nothing broader on the public surface - `fresh-demo-vm-webapp`: optional VM-hosting bundle proof for `plasmavmc + prismnet` with state persisted through `lightningstor` - `fresh-matrix`: optional composition proof for provider bundles such as `prismnet + flashdns + fiberlb` and `plasmavmc + coronafs + lightningstor`, including PrismNet security-group ACL add/remove, FiberLB TCP plus TLS-terminated `Https` / `TerminatedHttps` listeners, LightningStor bucket metadata plus object-version APIs, the published `k8shost` pod-watch surface, and the KVM-only PlasmaVMC worker contract +- `chainfire-live-membership-proof`: focused local-KVM ChainFire lane that starts from the canonical 3-node control plane, adds a temporary learner on `node04`, promotes it to voter, transfers leadership to another live voter, restarts the temporary voter, removes the current leader, re-adds the removed leader, and scales back into the canonical 3-node shape while proving local serializable reads through each transition - `provider-vm-reality-proof`: focused local-KVM provider and VM-hosting lane that writes dated artifacts under `./work/provider-vm-reality-proof/latest`, captures authoritative FlashDNS answers, FiberLB backend drain and re-convergence, and PlasmaVMC KVM shared-storage migration plus post-migration restart state - `rollout-soak`: focused longer-run control-plane and rollout lane that rebuilds from clean local runtime state, writes dated artifacts under `./work/rollout-soak/latest`, repeats `draining` maintenance and worker power-loss, then restarts `deployer`, `fleet-scheduler`, `node-agent`, `chainfire`, and `flaredb` while recording explicit `nix-agent` scope markers for the steady-state KVM nodes - `durability-proof`: canonical chainfire flaredb deployer backup/restore lane. It stores artifacts under `./work/durability-proof/latest`, proves logical backup/restore for ChainFire keys and FlareDB SQL rows, uses the canonical Deployer admin pre-register request itself as the backup artifact, verifies that the pre-registered node survives a `deployer.service` restart, replays the same request idempotently, and injects CoronaFS plus LightningStor failures on the live KVM cluster -- `run-publishable-kvm-suite.sh`: reproducible wrapper that captures the KVM environment, requires real `/dev/kvm` access, keeps runtime state under `./work` by default, and runs the full publishable nested-KVM trio in a single command +- `run-publishable-kvm-suite.sh`: reproducible wrapper that captures the KVM environment, requires real `/dev/kvm` access, keeps runtime state under `./work` by default, and runs the publishable nested-KVM application lanes plus the focused ChainFire live-membership proof in a single command - `run-supported-surface-final-proof.sh`: one-shot local wrapper that keeps builders local, records environment metadata, builds `single-node-trial-vm`, runs `supported-surface-guard`, `single-node-quickstart`, and then the publishable nested-KVM suite into one dated log root - `baremetal-iso-e2e`: materialized exact proof runner for the same canonical ISO harness; the build output keeps the attr stable, and `./result/bin/baremetal-iso-e2e` runs the real host-KVM proof with persisted log/meta - `deployer-vm-smoke`: lightweight regression proving that `nix-agent` can activate a host-built target closure without guest-side compilation @@ -186,8 +188,9 @@ The 2026-04-10 exact bare-metal check-runner proof is recorded under `./work/bar - `portable-control-plane-regressions` keeps the main non-KVM-safe boundaries under continuous coverage by composing `deployer-bootstrap-e2e`, `host-lifecycle-e2e`, `deployer-vm-smoke`, and `fleet-scheduler-e2e` behind the canonical profile eval guard. - `fresh-smoke` and `fresh-matrix` are the canonical proof for `deployer -> fleet-scheduler -> node-agent`. They cover native service placement, heartbeats, failover, and runtime reconciliation. - `fresh-smoke` proves the supported `fleet-scheduler` maintenance semantics: short-lived `active -> draining -> active` transitions, fail-stop worker loss, and replica restoration after the node returns. +- `chainfire-live-membership-proof` is the canonical KVM proof for ChainFire live reconfiguration on the supported surface. It covers learner add, local replica catch-up, voter promotion, live leader transfer, temporary-voter restart and rejoin, current-leader removal, removed-leader re-add, and final scale-in on the canonical control-plane shape. - `rollout-soak` is the longer-running companion lane for the same bundle. It validates exactly one planned drain cycle and one fail-stop worker-loss cycle on the two native-runtime workers, holds each degraded state for 30 seconds, restarts `deployer`, `fleet-scheduler`, `node-agent`, `chainfire`, and `flaredb`, and then revalidates the live cluster. It also writes `scope-fixed-contract.json`, `deployer-scope-fixed.txt`, and `fleet-scheduler-scope-fixed.txt` so the supported release boundary is captured in the proof root. The steady-state KVM nodes do not ship `nix-agent.service`, so the lane records scope markers there and leaves executable `nix-agent` proof to `deployer-vm-rollback`, `baremetal-iso`, and `baremetal-iso-e2e`. -- Multi-hour maintenance windows, pinned singleton relocation rules, dynamic ChainFire membership changes, destructive FlareDB schema rewrites, fully automated online migration, and large-cluster drain storms remain outside the release-proven scope and are called out explicitly in [rollout-bundle.md](rollout-bundle.md) and [control-plane-ops.md](control-plane-ops.md). +- Multi-hour maintenance windows, arbitrary multi-voter ChainFire swaps that still need joint consensus, larger-cluster or hardware ChainFire live membership reconfiguration beyond the canonical KVM proof lane, destructive FlareDB schema rewrites, fully automated online migration, and large-cluster drain storms remain outside the release-proven scope and are called out explicitly in [rollout-bundle.md](rollout-bundle.md) and [control-plane-ops.md](control-plane-ops.md). - `fresh-smoke` also covers `k8shost` separately from `fleet-scheduler`: `k8shost` exposes tenant pod and service semantics, while `fleet-scheduler` handles bare-metal host services. `k8shost` is fixed as an API/control-plane product surface; runtime dataplane helpers stay archived non-product. - `fresh-matrix` keeps the shipped add-on surface honest: it exercises the supported `creditservice` quota, wallet, reservation, and API-gateway flows, the published `k8shost-server` API contract, the supported LightningStor bucket metadata plus object-version APIs, and the network-provider bundle contract for PrismNet ACL lifecycle plus FiberLB TCP and TLS-terminated listeners. - `provider-vm-reality-proof` is the artifact-producing companion lane for that same provider or VM-hosting bundle. It records PrismNet port and ACL state, authoritative FlashDNS answers, FiberLB listener drain or restore artifacts, and PlasmaVMC migration or storage-handoff state in one dated proof root. @@ -200,18 +203,18 @@ The 2026-04-10 exact bare-metal check-runner proof is recorded under `./work/bar - FiberLB HTTPS health checks currently do not verify backend TLS certificates. Supported scope is limited to TCP reachability plus HTTP status for the backend endpoint until CA-aware verification is wired through config, server code, and the canonical harness. - `durability-proof` is the canonical backup, restore, and failure-injection companion lane for the publishable KVM suite. Use it after `fresh-matrix` when you need persisted artifacts for `chainfire`, `flaredb`, `deployer`, `coronafs`, and `lightningstor`. - `rollout-soak` is the longer-running maintenance and DR companion lane for the same control-plane and rollout bundle. Use it when a change is supposed to survive the current release boundary of one planned drain cycle, one fail-stop worker-loss cycle, and service-restart churn on the live KVM lab instead of only the short `fresh-smoke` window. -- `run-core-control-plane-ops-proof.sh` is the focused operator lifecycle proof for the core control plane. It records the fixed-membership ChainFire boundary, the FlareDB additive-first migration and destructive-DDL boundary, and the standalone IAM bootstrap hardening plus signing-key, credential, and mTLS rotation proof under `./work/core-control-plane-ops-proof`. +- `run-core-control-plane-ops-proof.sh` is the focused operator lifecycle proof for the core control plane. It records the published ChainFire API boundary, the FlareDB additive-first migration and destructive-DDL boundary, and the standalone IAM bootstrap hardening plus signing-key, credential, and mTLS rotation proof under `./work/core-control-plane-ops-proof`. - The supported `deployer` HA and DR boundary is scope-fixed to one active writer plus optional cold-standby restore, not automatic multi-instance failover. The canonical runbook is to recover one writer, re-apply `ultracloud.cluster` generated state with `deployer-ctl apply`, replay preserved admin pre-register requests, and then verify state through the admin API or `deployer-ctl node inspect`; the unsupported multi-instance boundary is fixed in [rollout-bundle.md](rollout-bundle.md). - The supported `node-agent` product contract is also fixed in [rollout-bundle.md](rollout-bundle.md): per-instance logs and pid metadata live under `${stateDir}/pids`, secrets must already exist in the rendered spec or mounted host files, host-path volumes are passed through but not provisioned, and upgrades are replace-and-reconcile operations rather than in-place patching. - The dated 2026-04-10 proof root for that lane is `./work/durability-proof/20260410T120618+0900`; `result.json` records `success=true`, and the artifact set includes `deployer-post-restart-list.json`, `coronafs-node04-local-state.json`, and `lightningstor-head-during-node05-outage.json`. - `single-node-quickstart` intentionally excludes `deployer`, `nix-agent`, `node-agent`, and `fleet-scheduler`, so the smallest trial surface stays focused on the VM-platform core instead of mixing rollout and scheduling responsibilities. -The three `fresh-*` VM-cluster commands are the publishable nested-KVM suite. They require a Linux host with `/dev/kvm` and nested virtualization, and the harness stops at preflight by design when that device is absent. `single-node-quickstart` and `baremetal-iso` can still fall back to `TCG` for debugging, but the release-facing `baremetal-iso-e2e` runner now requires host KVM so the exact proof lane matches the shipped hardware proxy route. `deployer-vm-smoke` and `portable-control-plane-regressions` remain the supported non-KVM developer lanes. +The three `fresh-*` VM-cluster commands plus `chainfire-live-membership-proof` make up the publishable nested-KVM suite. They require a Linux host with `/dev/kvm` and nested virtualization, and the harness stops at preflight by design when that device is absent. `single-node-quickstart` and `baremetal-iso` can still fall back to `TCG` for debugging, but the release-facing `baremetal-iso-e2e` runner now requires host KVM so the exact proof lane matches the shipped hardware proxy route. `deployer-vm-smoke` and `portable-control-plane-regressions` remain the supported non-KVM developer lanes. Release-facing completion now requires both of these to be green on the same branch: - the canonical bare-metal proof: `nix run ./nix/test-cluster#cluster -- baremetal-iso` plus `nix build .#checks.x86_64-linux.baremetal-iso-e2e` and `./result/bin/baremetal-iso-e2e` -- the publishable nested-KVM suite: `fresh-smoke`, `fresh-demo-vm-webapp`, and `fresh-matrix`, preferably through `./nix/test-cluster/run-publishable-kvm-suite.sh` +- the publishable nested-KVM suite: `fresh-smoke`, `fresh-demo-vm-webapp`, `fresh-matrix`, and `chainfire-live-membership-proof`, preferably through `./nix/test-cluster/run-publishable-kvm-suite.sh` Focused operator lifecycle proof for the core control plane: diff --git a/flake.nix b/flake.nix index acce039..e103add 100644 --- a/flake.nix +++ b/flake.nix @@ -861,6 +861,13 @@ description = "Node bootstrap and phone-home orchestration service"; }; + bootstrap-agent = buildRustWorkspace { + name = "bootstrap-agent"; + workspaceSubdir = "deployer"; + mainCrate = "bootstrap-agent"; + description = "Typed bootstrap helper for installer contract resolution"; + }; + deployer-ctl = buildRustWorkspace { name = "deployer-ctl"; workspaceSubdir = "deployer"; @@ -926,6 +933,7 @@ name = "deployer-workspace"; workspaceSubdir = "deployer"; crates = [ + "bootstrap-agent" "deployer-server" "deployer-ctl" "node-agent" @@ -2415,6 +2423,7 @@ EOF k8shost-server = self.packages.${final.system}.k8shost-server; deployer-workspace = self.packages.${final.system}.deployer-workspace; deployer-server = self.packages.${final.system}.deployer-server; + bootstrap-agent = self.packages.${final.system}.bootstrap-agent; deployer-ctl = self.packages.${final.system}.deployer-ctl; ultracloud-reconciler = self.packages.${final.system}.ultracloud-reconciler; ultracloudFlakeBundle = self.packages.${final.system}.ultracloudFlakeBundle; diff --git a/nix/ci/workspaces.json b/nix/ci/workspaces.json index ebf140e..8adc1a7 100644 --- a/nix/ci/workspaces.json +++ b/nix/ci/workspaces.json @@ -140,6 +140,7 @@ "deployer/**" ], "build_packages": [ + "bootstrap-agent", "deployer-server", "deployer-ctl", "node-agent", diff --git a/nix/iso/ultracloud-iso.nix b/nix/iso/ultracloud-iso.nix index 7d33d9e..6905c8a 100644 --- a/nix/iso/ultracloud-iso.nix +++ b/nix/iso/ultracloud-iso.nix @@ -310,9 +310,8 @@ ''; }; - # Auto-install service - partitions disk and runs nixos-install - systemd.services.ultracloud-install = { - description = "UltraCloud Auto-Install to Disk"; + systemd.services.ultracloud-install-contract = { + description = "UltraCloud Install Contract Resolution"; wantedBy = [ "multi-user.target" ]; after = [ "ultracloud-bootstrap.service" ]; requires = [ "ultracloud-bootstrap.service" ]; @@ -324,6 +323,33 @@ StandardError = "journal+console"; }; + script = '' + set -euo pipefail + + ${pkgs.bootstrap-agent}/bin/bootstrap-agent resolve-install-context \ + --node-config /etc/ultracloud/node-config.json \ + --disko-script-paths /etc/ultracloud/disko-script-paths.json \ + --system-paths /etc/ultracloud/system-paths.json \ + --format env \ + --write /run/ultracloud/install-contract.env + ''; + }; + + # Auto-install service - partitions disk and runs nixos-install + systemd.services.ultracloud-install = { + description = "UltraCloud Auto-Install to Disk"; + wantedBy = [ "multi-user.target" ]; + after = [ "ultracloud-install-contract.service" ]; + requires = [ "ultracloud-install-contract.service" ]; + + serviceConfig = { + Type = "oneshot"; + RemainAfterExit = true; + StandardOutput = "journal+console"; + StandardError = "journal+console"; + EnvironmentFile = "/run/ultracloud/install-contract.env"; + }; + script = '' set -euo pipefail export PATH="${pkgs.nix}/bin:${config.system.build.nixos-install}/bin:$PATH" @@ -376,41 +402,16 @@ return 1 } - if [ ! -s /etc/ultracloud/node-config.json ]; then - echo "ERROR: node-config.json missing (bootstrap not complete?)" - exit 1 - fi - - NODE_ID=$(${pkgs.jq}/bin/jq -r '.assignment.hostname // .assignment.node_id // empty' /etc/ultracloud/node-config.json) - NODE_IP=$(${pkgs.jq}/bin/jq -r '.assignment.ip // empty' /etc/ultracloud/node-config.json) - NIXOS_CONFIGURATION=$(${pkgs.jq}/bin/jq -r '.bootstrap_plan.install_plan.nixos_configuration // .assignment.hostname // empty' /etc/ultracloud/node-config.json) - INSTALL_PLAN_DISKO_CONFIG_PATH=$(${pkgs.jq}/bin/jq -r '.bootstrap_plan.install_plan.disko_config_path // empty' /etc/ultracloud/node-config.json) - DISKO_SCRIPT_PATH=$(${pkgs.jq}/bin/jq -r '.bootstrap_plan.install_plan.disko_script_path // empty' /etc/ultracloud/node-config.json) - if [ -z "$DISKO_SCRIPT_PATH" ] && [ -r /etc/ultracloud/disko-script-paths.json ]; then - DISKO_SCRIPT_PATH=$(${pkgs.jq}/bin/jq -r --arg cfg "$NIXOS_CONFIGURATION" '.[$cfg] // empty' /etc/ultracloud/disko-script-paths.json) - if [ -n "$DISKO_SCRIPT_PATH" ]; then - echo "Resolved pre-built Disko script for install profile $NIXOS_CONFIGURATION from the ISO profile map" - fi - fi - TARGET_SYSTEM_PATH=$(${pkgs.jq}/bin/jq -r '.bootstrap_plan.install_plan.target_system_path // empty' /etc/ultracloud/node-config.json) - if [ -z "$TARGET_SYSTEM_PATH" ] && [ -r /etc/ultracloud/system-paths.json ]; then - TARGET_SYSTEM_PATH=$(${pkgs.jq}/bin/jq -r --arg cfg "$NIXOS_CONFIGURATION" '.[$cfg] // empty' /etc/ultracloud/system-paths.json) - if [ -n "$TARGET_SYSTEM_PATH" ]; then - echo "Resolved pre-built target system for install profile $NIXOS_CONFIGURATION from the ISO profile map" - fi - fi - TARGET_DISK=$(${pkgs.jq}/bin/jq -r '.bootstrap_plan.install_plan.target_disk // empty' /etc/ultracloud/node-config.json) - TARGET_DISK_BY_ID=$(${pkgs.jq}/bin/jq -r '.bootstrap_plan.install_plan.target_disk_by_id // empty' /etc/ultracloud/node-config.json) DEPLOYER_URL="$(resolve_deployer_url)" SRC_ROOT="/opt/ultracloud-src" if [ -z "$NODE_ID" ] || [ -z "$NODE_IP" ]; then - echo "ERROR: node-config.json missing hostname/ip" + echo "ERROR: install contract missing hostname/ip" exit 1 fi if [ -z "$NIXOS_CONFIGURATION" ]; then - echo "ERROR: node-config.json missing install_plan.nixos_configuration" + echo "ERROR: install contract missing nixos configuration" exit 1 fi @@ -594,6 +595,7 @@ vim htop nix + bootstrap-agent gawk gnugrep util-linux diff --git a/nix/test-cluster/README.md b/nix/test-cluster/README.md index 8f244e9..902bf28 100644 --- a/nix/test-cluster/README.md +++ b/nix/test-cluster/README.md @@ -12,11 +12,12 @@ The hardware bridge now has its own canonical wrapper: `nix run ./nix/test-clust The harness keeps the install contract reusable by pushing install details into classes and pools. `verify-baremetal-iso.sh` now publishes node classes whose `install_plan` owns the install profile and stable disk selector, while node records carry only identity plus any desired-system override that is genuinely host-specific. In the canonical QEMU proof that means the node record carries the prebuilt `desired_system.target_system` plus the health check, and the class carries the install plan. The chassis emulates the preferred hardware-style disk selection by attaching explicit virtio serials and installing against `/dev/disk/by-id/virtio-uc-control-root` and `/dev/disk/by-id/virtio-uc-worker-root`. When `/dev/kvm` is absent, the portable fallback is not another harness subcommand. Use the root-flake non-KVM lane instead: `nix build .#checks.x86_64-linux.portable-control-plane-regressions`. -When `/dev/kvm` and nested virtualization are available, the reproducible publishable lane is `./nix/test-cluster/run-publishable-kvm-suite.sh`, which records environment metadata and then runs `fresh-smoke`, `fresh-demo-vm-webapp`, and `fresh-matrix` in order. +When `/dev/kvm` and nested virtualization are available, the reproducible publishable lane is `./nix/test-cluster/run-publishable-kvm-suite.sh`, which records environment metadata and then runs `fresh-smoke`, `fresh-demo-vm-webapp`, `fresh-matrix`, and `chainfire-live-membership-proof` in order. `nix run ./nix/test-cluster#cluster -- durability-proof` is the canonical chainfire flaredb deployer backup/restore lane. It persists artifacts under `./work/durability-proof/latest`, proves logical backup/restore for ChainFire keys and FlareDB SQL rows, uses the canonical Deployer admin pre-register request itself as the backup artifact, verifies that the pre-registered node survives a `deployer.service` restart, replays the same request idempotently, and injects CoronaFS plus LightningStor failures against the live KVM cluster. -`nix run ./nix/test-cluster#cluster -- rollout-soak` is the longer-running KVM companion lane for the rollout bundle and fixed-membership control plane. It rebuilds from clean local runtime state, writes dated artifacts under `./work/rollout-soak/latest`, validates exactly one planned `draining` maintenance cycle and one fail-stop worker-loss cycle on the two native-runtime workers, holds each degraded state for 30 seconds, then restarts `deployer`, `fleet-scheduler`, `node-agent`, `chainfire`, and `flaredb` before revalidating the live cluster. The same proof root includes `scope-fixed-contract.json`, `deployer-scope-fixed.txt`, and `fleet-scheduler-scope-fixed.txt` so the supported release boundary is recorded with the runtime evidence. The steady-state KVM nodes do not run `nix-agent.service`, so the lane records `nix-agent` scope markers instead of pretending a live-cluster `nix-agent` restart happened. +`nix run ./nix/test-cluster#cluster -- rollout-soak` is the longer-running KVM companion lane for the rollout bundle and canonical control plane. It rebuilds from clean local runtime state, writes dated artifacts under `./work/rollout-soak/latest`, validates exactly one planned `draining` maintenance cycle and one fail-stop worker-loss cycle on the two native-runtime workers, holds each degraded state for 30 seconds, then restarts `deployer`, `fleet-scheduler`, `node-agent`, `chainfire`, and `flaredb` before revalidating the live cluster. The same proof root includes `scope-fixed-contract.json`, `deployer-scope-fixed.txt`, and `fleet-scheduler-scope-fixed.txt` so the supported release boundary is recorded with the runtime evidence. The steady-state KVM nodes do not run `nix-agent.service`, so the lane records `nix-agent` scope markers instead of pretending a live-cluster `nix-agent` restart happened. +`nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof` is the focused local-KVM live-reconfiguration lane for the ChainFire control plane. It rebuilds from clean local runtime state, launches a temporary ChainFire replica on `node04`, proves learner add plus local replication, voter promotion, live leader transfer to another voting member, temporary-voter restart and rejoin, current-leader removal followed by re-election, removed-leader re-add, and final scale-in back to the canonical 3-node control-plane shape, and stores artifacts under `./work/chainfire-live-membership-proof/latest`. `nix run ./nix/test-cluster#cluster -- provider-vm-reality-proof` is the focused local-KVM reality lane for `prismnet`, `flashdns`, `fiberlb`, and `plasmavmc`. It writes authoritative DNS answers, FiberLB backend drain or restore artifacts, and PlasmaVMC migration or storage-handoff state under `./work/provider-vm-reality-proof/latest`. -`./nix/test-cluster/run-core-control-plane-ops-proof.sh` is the focused operator lifecycle proof for `chainfire`, `flaredb`, and `iam`. It records the ChainFire fixed-membership boundary, the FlareDB additive-first migration and destructive-DDL boundary, and the standalone IAM bootstrap hardening plus signing-key, credential, and mTLS rotation proof under `./work/core-control-plane-ops-proof`. +`./nix/test-cluster/run-core-control-plane-ops-proof.sh` is the focused operator lifecycle proof for `chainfire`, `flaredb`, and `iam`. It records the published ChainFire live-membership API boundary, the FlareDB additive-first migration and destructive-DDL boundary, and the standalone IAM bootstrap hardening plus signing-key, credential, and mTLS rotation proof under `./work/core-control-plane-ops-proof`. `./nix/test-cluster/work-root-budget.sh` is the checked helper for local disk budget reporting, stronger local enforcement, and safer cleanup guidance under `./work`. The dated 2026-04-10 artifact root for the focused control-plane proof is `./work/core-control-plane-ops-proof/20260410T172148+09:00`. Runner-specific workflow wiring from `task/f5c70db0-baseline-profiles` is intentionally excluded from this re-aggregated baseline; the checked-in artifact here is the local wrapper. @@ -32,6 +33,7 @@ Runner-specific workflow wiring from `task/f5c70db0-baseline-profiles` is intent - gateway-node `apigateway`, `nightlight`, and `creditservice` quota, wallet, reservation, and admission flows - host-forwarded access to the API gateway and NightLight HTTP surfaces - cross-node data replication smoke tests for `chainfire` and `flaredb` +- live ChainFire scale-out, learner promotion, leader transfer, temporary-voter restart, current-leader removal, re-add, and scale-in on the canonical control-plane shape - deployer-seeded native runtime scheduling from declarative Nix service definitions, including drain/failover recovery - ISO-based bare-metal bootstrap from `nixosConfigurations.ultracloud-iso` through phone-home, flake bundle fetch, Disko install, reboot, and desired-system activation - durability and restore coverage for `chainfire`, `flaredb`, `deployer`, `coronafs`, and `lightningstor` @@ -78,6 +80,7 @@ nix run ./nix/test-cluster#cluster -- serve-vm-webapp nix run ./nix/test-cluster#cluster -- fresh-serve-vm-webapp nix run ./nix/test-cluster#cluster -- matrix nix run ./nix/test-cluster#cluster -- fresh-matrix +nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof nix run ./nix/test-cluster#cluster -- provider-vm-reality-proof nix run ./nix/test-cluster#cluster -- rollout-soak nix run ./nix/test-cluster#cluster -- durability-proof @@ -121,6 +124,8 @@ Preferred entrypoint for safer dated-proof cleanup dry-runs: `./nix/test-cluster Preferred entrypoint for publishable matrix verification: `nix run ./nix/test-cluster#cluster -- fresh-matrix` +Preferred entrypoint for focused ChainFire live membership verification: `nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof` + Preferred entrypoint for focused provider and VM-hosting reality verification: `nix run ./nix/test-cluster#cluster -- provider-vm-reality-proof` Preferred entrypoint for longer-running rollout maintenance and DR verification: `nix run ./nix/test-cluster#cluster -- rollout-soak` @@ -137,7 +142,7 @@ The supported operator contract for `deployer`, `fleet-scheduler`, `nix-agent`, - `deployer` is supported as one active writer with restart or cold-standby restore. Automatic ChainFire-backed multi-instance failover is outside the supported product contract for this release. - `nix-agent` health-check and rollback behavior is proven by `nix build .#checks.x86_64-linux.deployer-vm-rollback`, while `baremetal-iso` and `baremetal-iso-e2e` prove the same desired-system handoff with the installer in front. - `fresh-smoke` is the canonical KVM proof for `fleet-scheduler` drain, maintenance, and failover semantics. It drains `node04`, checks relocation to `node05`, restores `node04`, then stops `node05` and verifies failover plus replica restoration when the worker returns. -- `rollout-soak` is the longer-running companion for that same contract. It proves the current release boundary of one planned drain cycle, one fail-stop worker-loss cycle, and 30-second held degraded states on the two native-runtime workers, then restarts the rollout services and the fixed-membership control-plane services before rechecking the live runtime state. The dated 2026-04-10 release-grade artifact root is `./work/rollout-soak/20260410T164549+0900`. +- `rollout-soak` is the longer-running companion for that same contract. It proves the current release boundary of one planned drain cycle, one fail-stop worker-loss cycle, and 30-second held degraded states on the two native-runtime workers, then restarts the rollout services and the canonical control-plane services before rechecking the live runtime state. The dated 2026-04-10 release-grade artifact root is `./work/rollout-soak/20260410T164549+0900`. - `node-agent` product scope is host-local runtime reconcile only. Logs and pid metadata live under `${stateDir}/pids`, secrets must already exist in the rendered spec or mounted files, host-path volumes are pass-through only, and upgrades are replace-and-reconcile operations. `nix run ./nix/test-cluster#cluster -- bench-storage` benchmarks CoronaFS controller-export vs node-local-export I/O, worker-side materialization latency, and LightningStor large/small-object S3 throughput, then writes a report to `docs/storage-benchmarks.md`. diff --git a/nix/test-cluster/common.nix b/nix/test-cluster/common.nix index 384b556..12886bc 100644 --- a/nix/test-cluster/common.nix +++ b/nix/test-cluster/common.nix @@ -130,6 +130,7 @@ in environment.systemPackages = with pkgs; [ awscli2 + chainfire-server curl dnsutils ethtool diff --git a/nix/test-cluster/run-cluster.sh b/nix/test-cluster/run-cluster.sh index bd3eb3c..241247d 100755 --- a/nix/test-cluster/run-cluster.sh +++ b/nix/test-cluster/run-cluster.sh @@ -141,6 +141,8 @@ NIGHTLIGHT_QUERY_PROTO="${NIGHTLIGHT_PROTO_DIR}/query.proto" NIGHTLIGHT_ADMIN_PROTO="${NIGHTLIGHT_PROTO_DIR}/admin.proto" PLASMAVMC_PROTO_DIR="${REPO_ROOT}/plasmavmc/proto" PLASMAVMC_PROTO="${PLASMAVMC_PROTO_DIR}/plasmavmc.proto" +CHAINFIRE_PROTO_DIR="${REPO_ROOT}/chainfire/proto" +CHAINFIRE_PROTO="${CHAINFIRE_PROTO_DIR}/chainfire.proto" FLAREDB_PROTO_DIR="${REPO_ROOT}/flaredb/crates/flaredb-proto/src" FLAREDB_PROTO="${FLAREDB_PROTO_DIR}/kvrpc.proto" FLAREDB_SQL_PROTO="${FLAREDB_PROTO_DIR}/sqlrpc.proto" @@ -554,6 +556,20 @@ prepare_rollout_soak_dir() { printf '%s\n' "${proof_dir}" } +chainfire_live_membership_proof_root() { + printf '%s/%s\n' "${ULTRACLOUD_WORK_ROOT}" "chainfire-live-membership-proof" +} + +prepare_chainfire_live_membership_proof_dir() { + local proof_root proof_dir timestamp + proof_root="$(chainfire_live_membership_proof_root)" + timestamp="$(date '+%Y%m%dT%H%M%S%z')" + proof_dir="${proof_root}/${timestamp}" + mkdir -p "${proof_dir}" + ln -sfn "${proof_dir}" "${proof_root}/latest" + printf '%s\n' "${proof_dir}" +} + provider_vm_reality_proof_root() { printf '%s/%s\n' "${ULTRACLOUD_WORK_ROOT}" "provider-vm-reality-proof" } @@ -9943,7 +9959,7 @@ run_rollout_soak() { ssh_node node05 "journalctl -u node-agent -b --since '${started_at}' --no-pager" \ >"${proof_dir}/node05-node-agent-journal.log" - log "Rollout soak: restarting fixed-membership ChainFire and FlareDB members" + log "Rollout soak: restarting canonical ChainFire and FlareDB members" ssh_node node02 "systemctl restart chainfire.service" wait_for_unit node02 chainfire wait_for_http node02 "http://127.0.0.1:8081/health" @@ -9960,7 +9976,7 @@ run_rollout_soak() { >"${proof_dir}/chainfire-post-restart.json" jq -e --arg expected "${chainfire_value}" '.data.value == $expected' \ "${proof_dir}/chainfire-post-restart.json" >/dev/null \ - || die "ChainFire fixed-membership restart proof did not reproduce the expected value" + || die "ChainFire restart proof did not reproduce the expected value" ssh_node node02 "journalctl -u chainfire -b --since '${started_at}' --no-pager" \ >"${proof_dir}/chainfire-node02-journal.log" @@ -10010,13 +10026,705 @@ run_rollout_soak() { --argjson validated_maintenance_cycles "${validated_maintenance_cycles}" \ --argjson validated_power_loss_cycles "${validated_power_loss_cycles}" \ --argjson soak_hold_secs "${soak_hold_secs}" \ - --arg summary "validated one planned drain cycle and one fail-stop worker-loss cycle on the two-node native-runtime lab, held each degraded state for the configured soak window, restarted deployer or scheduler or agent services, and revalidated fixed-membership control-plane restarts while keeping deployer HA scope-fixed to single-writer recovery" \ + --arg summary "validated one planned drain cycle and one fail-stop worker-loss cycle on the two-node native-runtime lab, held each degraded state for the configured soak window, restarted deployer or scheduler or agent services, and revalidated canonical control-plane restarts while keeping deployer HA scope-fixed to single-writer recovery" \ '{started_at:$started_at, finished_at:$finished_at, artifact_root:$artifact_root, deployer_supported_writer_mode:$deployer_supported_writer_mode, fleet_supported_native_runtime_nodes:$fleet_supported_native_runtime_nodes, validated_maintenance_cycles:$validated_maintenance_cycles, validated_power_loss_cycles:$validated_power_loss_cycles, soak_hold_secs:$soak_hold_secs, summary:$summary, success:true}' \ >"${proof_dir}/result.json" log "Long-run rollout soak succeeded; artifacts are in ${proof_dir}" } +run_chainfire_live_membership_proof() { + local proof_dir started_at finished_at + local chainfire_tunnel_node01="" chainfire_tunnel_node02="" chainfire_tunnel_node03="" chainfire_tunnel_node04="" + local chainfire_rest_tunnel_node01="" chainfire_rest_tunnel_node02="" chainfire_rest_tunnel_node03="" chainfire_rest_tunnel_node04="" + local learner_key learner_value promoted_key promoted_value transfer_key transfer_value restart_key restart_value removed_key removed_value final_key final_value + local leader_before_transfer_id="" transfer_target_id="" removed_leader_id="" new_leader_id="" + + proof_dir="$(prepare_chainfire_live_membership_proof_dir)" + started_at="$(date -Iseconds)" + + cleanup_chainfire_live_membership_proof() { + set +e + set +u + ssh_node_script node04 <<'EOS' >/dev/null 2>&1 || true +set +e +runtime_dir="/run/chainfire-live-membership-proof" +pid_path="${runtime_dir}/chainfire.pid" +if [[ -f "${pid_path}" ]]; then + kill "$(cat "${pid_path}")" >/dev/null 2>&1 || true + rm -f "${pid_path}" +fi +pkill -f '/run/current-system/sw/bin/chainfire --config /run/chainfire-live-membership-proof/config.toml' >/dev/null 2>&1 || true +EOS + stop_ssh_tunnel node04 "${chainfire_rest_tunnel_node04}" >/dev/null 2>&1 || true + stop_ssh_tunnel node03 "${chainfire_rest_tunnel_node03}" >/dev/null 2>&1 || true + stop_ssh_tunnel node02 "${chainfire_rest_tunnel_node02}" >/dev/null 2>&1 || true + stop_ssh_tunnel node01 "${chainfire_rest_tunnel_node01}" >/dev/null 2>&1 || true + stop_ssh_tunnel node04 "${chainfire_tunnel_node04}" >/dev/null 2>&1 || true + stop_ssh_tunnel node03 "${chainfire_tunnel_node03}" >/dev/null 2>&1 || true + stop_ssh_tunnel node02 "${chainfire_tunnel_node02}" >/dev/null 2>&1 || true + stop_ssh_tunnel node01 "${chainfire_tunnel_node01}" >/dev/null 2>&1 || true + } + + trap cleanup_chainfire_live_membership_proof RETURN + + jq -n \ + --arg command "nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof" \ + --arg proof_dir "${proof_dir}" \ + --arg started_at "${started_at}" \ + --arg ultracloud_work_root "${ULTRACLOUD_WORK_ROOT}" \ + --arg photon_cluster_work_root "${WORK_ROOT}" \ + --arg build_profile "${BUILD_PROFILE}" \ + '{command:$command, proof_dir:$proof_dir, started_at:$started_at, ultracloud_work_root:$ultracloud_work_root, photon_cluster_work_root:$photon_cluster_work_root, build_profile:$build_profile}' \ + >"${proof_dir}/meta.json" + + chainfire_cluster_rpc() { + local endpoint="$1" + local method="$2" + local payload="${3-}" + + if [[ -n "${payload}" ]]; then + grpcurl -plaintext \ + -import-path "${CHAINFIRE_PROTO_DIR}" \ + -proto "${CHAINFIRE_PROTO}" \ + -d "${payload}" \ + "${endpoint}" \ + "${method}" + return + fi + + grpcurl -plaintext \ + -import-path "${CHAINFIRE_PROTO_DIR}" \ + -proto "${CHAINFIRE_PROTO}" \ + "${endpoint}" \ + "${method}" + } + + chainfire_member_list_json() { + local endpoint="${1:-127.0.0.1:12379}" + chainfire_cluster_rpc "${endpoint}" "chainfire.v1.Cluster/MemberList" + } + + chainfire_status_json() { + local endpoint="${1:-127.0.0.1:12379}" + chainfire_cluster_rpc "${endpoint}" "chainfire.v1.Cluster/Status" + } + + chainfire_rest_url_for_id() { + case "$1" in + 1) printf '%s\n' "http://127.0.0.1:18081" ;; + 2) printf '%s\n' "http://127.0.0.1:18082" ;; + 3) printf '%s\n' "http://127.0.0.1:18083" ;; + 4) printf '%s\n' "http://127.0.0.1:18084" ;; + *) return 1 ;; + esac + } + + chainfire_grpc_endpoint_for_id() { + case "$1" in + 1) printf '%s\n' "127.0.0.1:12379" ;; + 2) printf '%s\n' "127.0.0.1:12380" ;; + 3) printf '%s\n' "127.0.0.1:12381" ;; + 4) printf '%s\n' "127.0.0.1:12382" ;; + *) return 1 ;; + esac + } + + chainfire_node_name_for_id() { + case "$1" in + 1) printf '%s\n' "node01" ;; + 2) printf '%s\n' "node02" ;; + 3) printf '%s\n' "node03" ;; + 4) printf '%s\n' "node04" ;; + *) return 1 ;; + esac + } + + chainfire_raft_addr_for_id() { + case "$1" in + 1) printf '%s\n' "10.100.0.11:2380" ;; + 2) printf '%s\n' "10.100.0.12:2380" ;; + 3) printf '%s\n' "10.100.0.13:2380" ;; + 4) printf '%s\n' "10.100.0.21:2380" ;; + *) return 1 ;; + esac + } + + chainfire_client_url_for_id() { + case "$1" in + 1) printf '%s\n' "http://10.100.0.11:2379" ;; + 2) printf '%s\n' "http://10.100.0.12:2379" ;; + 3) printf '%s\n' "http://10.100.0.13:2379" ;; + 4) printf '%s\n' "http://10.100.0.21:2379" ;; + *) return 1 ;; + esac + } + + chainfire_status_from_any_endpoint() { + local endpoint output leader + for endpoint in 127.0.0.1:12379 127.0.0.1:12380 127.0.0.1:12381 127.0.0.1:12382; do + output="$(chainfire_status_json "${endpoint}" 2>/dev/null || true)" + leader="$(printf '%s' "${output}" | jq -r '.leader // 0' 2>/dev/null || printf '0')" + if [[ -n "${output}" ]] && [[ "${leader}" =~ ^[0-9]+$ ]] && (( leader > 0 )); then + printf '%s' "${output}" + return 0 + fi + done + return 1 + } + + chainfire_current_leader_id() { + local output leader + output="$(chainfire_status_from_any_endpoint 2>/dev/null || true)" + leader="$(printf '%s' "${output}" | jq -r '.leader // 0' 2>/dev/null || printf '0')" + if [[ "${leader}" =~ ^[0-9]+$ ]] && (( leader > 0 )); then + printf '%s\n' "${leader}" + return 0 + fi + return 1 + } + + chainfire_wait_membership() { + local jq_expr="$1" + local timeout="${2:-180}" + local endpoint="${3:-127.0.0.1:12379}" + local deadline=$((SECONDS + timeout)) + local output="" + + while true; do + output="$(chainfire_member_list_json "${endpoint}" 2>/dev/null || true)" + if [[ -n "${output}" ]] && printf '%s' "${output}" | jq -e "${jq_expr}" >/dev/null 2>&1; then + printf '%s' "${output}" + return 0 + fi + if (( SECONDS >= deadline )); then + die "timed out waiting for ChainFire membership to satisfy ${jq_expr}" + fi + sleep 2 + done + } + + chainfire_wait_for_new_leader() { + local old_leader="$1" + local timeout="${2:-180}" + local deadline=$((SECONDS + timeout)) + local leader="" + + while true; do + leader="$(chainfire_current_leader_id 2>/dev/null || true)" + if [[ "${leader}" =~ ^[0-9]+$ ]] && (( leader > 0 )) && [[ "${leader}" != "${old_leader}" ]]; then + printf '%s' "${leader}" + return 0 + fi + if (( SECONDS >= deadline )); then + die "timed out waiting for ChainFire leader change away from ${old_leader}" + fi + sleep 2 + done + } + + chainfire_wait_for_specific_leader() { + local expected_leader="$1" + local timeout="${2:-180}" + local deadline=$((SECONDS + timeout)) + local leader="" + + while true; do + leader="$(chainfire_current_leader_id 2>/dev/null || true)" + if [[ "${leader}" =~ ^[0-9]+$ ]] && (( leader > 0 )) && [[ "${leader}" == "${expected_leader}" ]]; then + printf '%s' "${leader}" + return 0 + fi + if (( SECONDS >= deadline )); then + die "timed out waiting for ChainFire leader ${expected_leader}" + fi + sleep 2 + done + } + + chainfire_put_key() { + local key="$1" + local value="$2" + local output_path="$3" + local timeout="${4:-180}" + local deadline=$((SECONDS + timeout)) + local leader_id="" rest_url="" + + while true; do + leader_id="$(chainfire_current_leader_id 2>/dev/null || true)" + if [[ "${leader_id}" =~ ^[0-9]+$ ]] && (( leader_id > 0 )); then + rest_url="$(chainfire_rest_url_for_id "${leader_id}" 2>/dev/null || true)" + if [[ -n "${rest_url}" ]] && curl -fsS \ + -X PUT \ + -H 'content-type: application/json' \ + -d "$(jq -cn --arg value "${value}" '{value:$value}')" \ + "${rest_url}/api/v1/kv/${key}" \ + >"${output_path}" 2>/dev/null; then + return 0 + fi + fi + if (( SECONDS >= deadline )); then + die "timed out writing ChainFire key ${key} through the current leader" + fi + sleep 2 + done + } + + chainfire_serializable_get_status() { + local node_id="$1" + local key="$2" + local output_path="$3" + local rest_url + rest_url="$(chainfire_rest_url_for_id "${node_id}")" + curl -sS -o "${output_path}" -w '%{http_code}' \ + "${rest_url}/api/v1/kv/${key}?consistency=serializable" || true + } + + chainfire_wait_local_value() { + local node_id="$1" + local key="$2" + local expected_value="$3" + local output_path="$4" + local timeout="${5:-180}" + local deadline=$((SECONDS + timeout)) + local status="" + + while true; do + status="$(chainfire_serializable_get_status "${node_id}" "${key}" "${output_path}")" + if [[ "${status}" == "200" ]] && jq -e --arg expected "${expected_value}" '.data.value == $expected' "${output_path}" >/dev/null 2>&1; then + return 0 + fi + if (( SECONDS >= deadline )); then + die "timed out waiting for ChainFire node${node_id} to serve ${key} locally" + fi + sleep 2 + done + } + + chainfire_wait_local_absent() { + local node_id="$1" + local key="$2" + local output_path="$3" + local timeout="${4:-180}" + local deadline=$((SECONDS + timeout)) + local status="" + + while true; do + status="$(chainfire_serializable_get_status "${node_id}" "${key}" "${output_path}")" + if [[ "${status}" == "404" ]]; then + return 0 + fi + if (( SECONDS >= deadline )); then + die "timed out waiting for ChainFire node${node_id} to stop serving ${key} locally" + fi + sleep 2 + done + } + + chainfire_wait_member_visible_locally() { + local endpoint="$1" + local member_id="$2" + local expected_is_learner="$3" + local timeout="${4:-180}" + chainfire_wait_membership "any(.members[]; (.id | tostring) == \"${member_id}\" and ((.isLearner // .is_learner // false) == ${expected_is_learner}))" "${timeout}" "${endpoint}" >/dev/null + } + + chainfire_post_member_request() { + local request_json="$1" + local output_path="$2" + local timeout="${3:-180}" + local deadline=$((SECONDS + timeout)) + local status="" leader_id="" rest_url="" + + while true; do + leader_id="$(chainfire_current_leader_id 2>/dev/null || true)" + rest_url="$(chainfire_rest_url_for_id "${leader_id}" 2>/dev/null || true)" + status="$(curl -sS -o "${output_path}" -w '%{http_code}' \ + -X POST \ + -H 'content-type: application/json' \ + -d "${request_json}" \ + "${rest_url}/api/v1/cluster/members" || true)" + if [[ "${status}" == "201" ]]; then + return 0 + fi + if (( SECONDS >= deadline )); then + cat "${output_path}" >&2 || true + die "ChainFire member add request did not succeed (status ${status})" + fi + sleep 2 + done + } + + chainfire_delete_member_request() { + local member_id="$1" + local output_path="$2" + local timeout="${3:-180}" + local deadline=$((SECONDS + timeout)) + local status="" leader_id="" rest_url="" + + while true; do + leader_id="$(chainfire_current_leader_id 2>/dev/null || true)" + rest_url="$(chainfire_rest_url_for_id "${leader_id}" 2>/dev/null || true)" + status="$(curl -sS -o "${output_path}" -w '%{http_code}' \ + -X DELETE \ + "${rest_url}/api/v1/cluster/members/${member_id}" || true)" + if [[ "${status}" == "200" ]]; then + return 0 + fi + if (( SECONDS >= deadline )); then + cat "${output_path}" >&2 || true + die "ChainFire member remove request for ${member_id} did not succeed (status ${status})" + fi + sleep 2 + done + } + + chainfire_transfer_leader_request() { + local target_id="$1" + local output_path="$2" + local timeout="${3:-180}" + local deadline=$((SECONDS + timeout)) + local status="" leader_id="" rest_url="" + + while true; do + leader_id="$(chainfire_current_leader_id 2>/dev/null || true)" + rest_url="$(chainfire_rest_url_for_id "${leader_id}" 2>/dev/null || true)" + status="$(curl -sS -o "${output_path}" -w '%{http_code}' \ + -X POST \ + -H 'content-type: application/json' \ + -d "$(jq -cn --argjson target_id "${target_id}" '{target_id:$target_id}')" \ + "${rest_url}/api/v1/cluster/leader/transfer" || true)" + if [[ "${status}" == "200" ]]; then + return 0 + fi + if (( SECONDS >= deadline )); then + cat "${output_path}" >&2 || true + die "ChainFire leader transfer request to ${target_id} did not succeed (status ${status})" + fi + sleep 2 + done + } + + chainfire_wait_internal_http_from_node01() { + local timeout="${1:-120}" + local deadline=$((SECONDS + timeout)) + + while true; do + if ssh_node_script node01 <<'EOS' >/dev/null 2>&1 +set -euo pipefail +for ip in 10.100.0.11 10.100.0.12 10.100.0.13; do + curl -fsS "http://${ip}:8081/health" >/dev/null +done +EOS + then + return 0 + fi + if (( SECONDS >= deadline )); then + die "timed out waiting for internal ChainFire HTTP reachability from node01" + fi + sleep 2 + done + } + + chainfire_wait_internal_replication_from_node01() { + local timeout="${1:-120}" + local deadline=$((SECONDS + timeout)) + + while true; do + if ssh_node_script node01 <<'EOS' >/tmp/chainfire-internal-ready.out 2>/tmp/chainfire-internal-ready.err +set -euo pipefail +key="validation-chainfire-final-$(date +%s)-$RANDOM" +value="ok-$RANDOM" +nodes=(10.100.0.11 10.100.0.12 10.100.0.13) +leader="" +for ip in "${nodes[@]}"; do + code="$(curl -sS -o /tmp/chainfire-final-put.out -w '%{http_code}' \ + -X PUT "http://${ip}:8081/api/v1/kv/${key}" \ + -H 'Content-Type: application/json' \ + -d "{\"value\":\"${value}\"}" || true)" + if [[ "${code}" == "200" ]]; then + leader="${ip}" + break + fi +done +[[ -n "${leader}" ]] +for ip in "${nodes[@]}"; do + actual="$(curl -fsS "http://${ip}:8081/api/v1/kv/${key}" | jq -r '.data.value')" + [[ "${actual}" == "${value}" ]] +done +printf '{"key":"%s","value":"%s","leader":"%s"}\n' "${key}" "${value}" "${leader}" +EOS + then + cat /tmp/chainfire-internal-ready.out + return 0 + fi + if (( SECONDS >= deadline )); then + cat /tmp/chainfire-internal-ready.err >&2 || true + die "timed out waiting for internal ChainFire replication to stabilize from node01" + fi + sleep 2 + done + } + + restart_temporary_chainfire_node04() { + ssh_node_script node04 <<'EOS' +set -euo pipefail +runtime_dir="/run/chainfire-live-membership-proof" +pid_path="${runtime_dir}/chainfire.pid" +log_path="${runtime_dir}/chainfire.log" +config_path="${runtime_dir}/config.toml" +mkdir -p "${runtime_dir}" +if [[ ! -f "${config_path}" ]]; then + echo "temporary ChainFire config missing at ${config_path}" >&2 + exit 1 +fi +if [[ -f "${pid_path}" ]]; then + kill "$(cat "${pid_path}")" >/dev/null 2>&1 || true + rm -f "${pid_path}" +fi +pkill -f '/run/current-system/sw/bin/chainfire --config /run/chainfire-live-membership-proof/config.toml' >/dev/null 2>&1 || true +printf '\n[chainfire-live-membership-proof] restarting temporary voter at %s\n' "$(date -Is)" >>"${log_path}" +nohup /run/current-system/sw/bin/chainfire --config "${config_path}" --metrics-port 9194 >>"${log_path}" 2>&1 & +echo $! >"${pid_path}" +for _ in $(seq 1 60); do + if curl -fsS http://10.100.0.21:8081/health >/dev/null 2>&1; then + exit 0 + fi + sleep 1 +done +echo "restarted temporary ChainFire on node04 did not become healthy" >&2 +exit 1 +EOS + } + + log "Running ChainFire live membership proof; artifacts will be written to ${proof_dir}" + + validate_control_plane + validate_workers + + chainfire_tunnel_node01="$(start_ssh_tunnel node01 12379 2379 "${NODE_IPS[node01]}")" + chainfire_tunnel_node02="$(start_ssh_tunnel node02 12380 2379 "${NODE_IPS[node02]}")" + chainfire_tunnel_node03="$(start_ssh_tunnel node03 12381 2379 "${NODE_IPS[node03]}")" + chainfire_rest_tunnel_node01="$(start_ssh_tunnel node01 18081 8081 "${NODE_IPS[node01]}")" + chainfire_rest_tunnel_node02="$(start_ssh_tunnel node02 18082 8081 "${NODE_IPS[node02]}")" + chainfire_rest_tunnel_node03="$(start_ssh_tunnel node03 18083 8081 "${NODE_IPS[node03]}")" + + chainfire_member_list_json "127.0.0.1:12379" >"${proof_dir}/baseline-membership.json" + chainfire_status_json "127.0.0.1:12379" >"${proof_dir}/baseline-status.json" + curl -fsS http://127.0.0.1:18081/api/v1/cluster/status >"${proof_dir}/baseline-node01-rest-status.json" + curl -fsS http://127.0.0.1:18082/api/v1/cluster/status >"${proof_dir}/baseline-node02-rest-status.json" + curl -fsS http://127.0.0.1:18083/api/v1/cluster/status >"${proof_dir}/baseline-node03-rest-status.json" + jq -e '(.members | length) == 3 and all(.members[]; (.isLearner // .is_learner // false) == false)' "${proof_dir}/baseline-membership.json" >/dev/null \ + || die "expected baseline ChainFire membership to contain exactly three voters" + + log "ChainFire live membership proof: starting temporary learner on node04" + ssh_node_script node04 <<'EOS' +set -euo pipefail +runtime_dir="/run/chainfire-live-membership-proof" +data_dir="/var/lib/chainfire-live-membership-proof" +pid_path="${runtime_dir}/chainfire.pid" +log_path="${runtime_dir}/chainfire.log" +config_path="${runtime_dir}/config.toml" +mkdir -p "${runtime_dir}" +if [[ -f "${pid_path}" ]]; then + kill "$(cat "${pid_path}")" >/dev/null 2>&1 || true + rm -f "${pid_path}" +fi +pkill -f '/run/current-system/sw/bin/chainfire --config /run/chainfire-live-membership-proof/config.toml' >/dev/null 2>&1 || true +rm -rf "${data_dir}" +mkdir -p "${data_dir}" +cat >"${config_path}" <<'EOF' +[node] +id = 4 +name = "node04" +role = "control_plane" + +[storage] +data_dir = "/var/lib/chainfire-live-membership-proof" + +[network] +api_addr = "10.100.0.21:2379" +http_addr = "10.100.0.21:8081" +raft_addr = "10.100.0.21:2380" +gossip_addr = "10.100.0.21:2381" + +[cluster] +id = 1 +initial_members = [ + { id = 1, raft_addr = "10.100.0.11:2380" }, + { id = 2, raft_addr = "10.100.0.12:2380" }, + { id = 3, raft_addr = "10.100.0.13:2380" }, +] +bootstrap = false + +[raft] +role = "learner" +EOF +nohup /run/current-system/sw/bin/chainfire --config "${config_path}" --metrics-port 9194 >"${log_path}" 2>&1 & +echo $! >"${pid_path}" +for _ in $(seq 1 60); do + if curl -fsS http://10.100.0.21:8081/health >/dev/null 2>&1; then + exit 0 + fi + sleep 1 +done +echo "temporary ChainFire on node04 did not become healthy" >&2 +exit 1 +EOS + + chainfire_tunnel_node04="$(start_ssh_tunnel node04 12382 2379 "${NODE_IPS[node04]}")" + chainfire_rest_tunnel_node04="$(start_ssh_tunnel node04 18084 8081 "${NODE_IPS[node04]}")" + chainfire_status_json "127.0.0.1:12382" >"${proof_dir}/node04-temporary-status.json" + curl -fsS http://127.0.0.1:18084/api/v1/cluster/status >"${proof_dir}/node04-temporary-rest-status.json" + + log "ChainFire live membership proof: adding node04 as learner" + chainfire_post_member_request \ + "$(jq -cn --argjson node_id 4 --arg raft_addr "10.100.0.21:2380" --arg client_url "http://10.100.0.21:2379" --arg name "node04" '{node_id:$node_id, raft_addr:$raft_addr, client_url:$client_url, name:$name, is_learner:true}')" \ + "${proof_dir}/member-add-node04-learner.json" + chainfire_wait_membership '(.members | length) == 4 and any(.members[]; (.id | tostring) == "4" and ((.isLearner // .is_learner // false) == true))' 180 >"${proof_dir}/membership-after-node04-learner.json" + chainfire_wait_member_visible_locally "127.0.0.1:12382" "4" "true" 180 + + learner_key="chainfire-live-proof-learner-$(date +%s)-$RANDOM" + learner_value="learner-${RANDOM}" + chainfire_put_key "${learner_key}" "${learner_value}" "${proof_dir}/learner-put.json" + chainfire_wait_local_value 4 "${learner_key}" "${learner_value}" "${proof_dir}/learner-node04-local-read.json" 180 + + log "ChainFire live membership proof: promoting node04 to voter" + chainfire_post_member_request \ + "$(jq -cn --argjson node_id 4 --arg raft_addr "10.100.0.21:2380" --arg client_url "http://10.100.0.21:2379" --arg name "node04" '{node_id:$node_id, raft_addr:$raft_addr, client_url:$client_url, name:$name, is_learner:false}')" \ + "${proof_dir}/member-promote-node04.json" + chainfire_wait_membership '(.members | length) == 4 and any(.members[]; (.id | tostring) == "4" and ((.isLearner // .is_learner // false) == false))' 180 >"${proof_dir}/membership-after-node04-promotion.json" + chainfire_wait_member_visible_locally "127.0.0.1:12382" "4" "false" 180 + + promoted_key="chainfire-live-proof-voter-$(date +%s)-$RANDOM" + promoted_value="voter-${RANDOM}" + chainfire_put_key "${promoted_key}" "${promoted_value}" "${proof_dir}/promoted-put.json" + chainfire_wait_local_value 1 "${promoted_key}" "${promoted_value}" "${proof_dir}/promoted-node01-local-read.json" 180 + chainfire_wait_local_value 2 "${promoted_key}" "${promoted_value}" "${proof_dir}/promoted-node02-local-read.json" 180 + chainfire_wait_local_value 3 "${promoted_key}" "${promoted_value}" "${proof_dir}/promoted-node03-local-read.json" 180 + chainfire_wait_local_value 4 "${promoted_key}" "${promoted_value}" "${proof_dir}/promoted-node04-local-read.json" 180 + + chainfire_status_from_any_endpoint >"${proof_dir}/status-after-node04-promotion.json" + leader_before_transfer_id="$(jq -r '.leader' "${proof_dir}/status-after-node04-promotion.json")" + [[ "${leader_before_transfer_id}" =~ ^[0-9]+$ ]] && (( leader_before_transfer_id > 0 )) \ + || die "could not determine current ChainFire leader before live transfer" + printf '%s\n' "${leader_before_transfer_id}" >"${proof_dir}/leader-before-transfer.txt" + + if [[ "${leader_before_transfer_id}" != "2" ]]; then + transfer_target_id="2" + else + transfer_target_id="3" + fi + printf '%s\n' "${transfer_target_id}" >"${proof_dir}/leader-transfer-target.txt" + + log "ChainFire live membership proof: transferring leader ${leader_before_transfer_id} to ${transfer_target_id}" + chainfire_transfer_leader_request "${transfer_target_id}" "${proof_dir}/leader-transfer.json" + chainfire_wait_for_specific_leader "${transfer_target_id}" 180 >"${proof_dir}/leader-after-transfer.txt" + chainfire_status_from_any_endpoint >"${proof_dir}/status-after-leader-transfer.json" + jq -e --arg target "${transfer_target_id}" '(.leader | tostring) == $target' "${proof_dir}/status-after-leader-transfer.json" >/dev/null \ + || die "expected ChainFire leader transfer to settle on ${transfer_target_id}" + + transfer_key="chainfire-live-proof-transfer-$(date +%s)-$RANDOM" + transfer_value="transfer-${RANDOM}" + chainfire_put_key "${transfer_key}" "${transfer_value}" "${proof_dir}/transfer-put.json" + chainfire_wait_local_value 1 "${transfer_key}" "${transfer_value}" "${proof_dir}/transfer-node01-local-read.json" 180 + chainfire_wait_local_value 2 "${transfer_key}" "${transfer_value}" "${proof_dir}/transfer-node02-local-read.json" 180 + chainfire_wait_local_value 3 "${transfer_key}" "${transfer_value}" "${proof_dir}/transfer-node03-local-read.json" 180 + chainfire_wait_local_value 4 "${transfer_key}" "${transfer_value}" "${proof_dir}/transfer-node04-local-read.json" 180 + + log "ChainFire live membership proof: restarting temporary voter on node04" + restart_temporary_chainfire_node04 + chainfire_wait_membership '(.members | length) == 4 and any(.members[]; (.id | tostring) == "4" and ((.isLearner // .is_learner // false) == false))' 180 >"${proof_dir}/membership-after-node04-restart.json" + chainfire_wait_member_visible_locally "127.0.0.1:12382" "4" "false" 180 + chainfire_status_json "127.0.0.1:12382" >"${proof_dir}/node04-status-after-restart.json" + chainfire_status_from_any_endpoint >"${proof_dir}/status-after-node04-restart.json" + + restart_key="chainfire-live-proof-restart-$(date +%s)-$RANDOM" + restart_value="restart-${RANDOM}" + chainfire_put_key "${restart_key}" "${restart_value}" "${proof_dir}/restart-put.json" + chainfire_wait_local_value 1 "${restart_key}" "${restart_value}" "${proof_dir}/restart-node01-local-read.json" 180 + chainfire_wait_local_value 2 "${restart_key}" "${restart_value}" "${proof_dir}/restart-node02-local-read.json" 180 + chainfire_wait_local_value 3 "${restart_key}" "${restart_value}" "${proof_dir}/restart-node03-local-read.json" 180 + chainfire_wait_local_value 4 "${restart_key}" "${restart_value}" "${proof_dir}/restart-node04-local-read.json" 180 + + removed_leader_id="$(jq -r '.leader' "${proof_dir}/status-after-node04-restart.json")" + [[ "${removed_leader_id}" =~ ^[0-9]+$ ]] && (( removed_leader_id > 0 )) \ + || die "could not determine current ChainFire leader before live removal" + printf '%s\n' "${removed_leader_id}" >"${proof_dir}/leader-before-removal.txt" + + log "ChainFire live membership proof: removing current leader ${removed_leader_id}" + chainfire_delete_member_request "${removed_leader_id}" "${proof_dir}/member-remove-leader.json" + chainfire_wait_membership "(.members | length) == 3 and (all(.members[]; (.id | tostring) != \"${removed_leader_id}\"))" 180 >"${proof_dir}/membership-after-leader-removal.json" + new_leader_id="$(chainfire_wait_for_new_leader "${removed_leader_id}" 180)" + printf '%s\n' "${new_leader_id}" >"${proof_dir}/leader-after-removal.txt" + chainfire_status_json "127.0.0.1:12379" >"${proof_dir}/status-after-leader-removal.json" + + removed_key="chainfire-live-proof-removed-$(date +%s)-$RANDOM" + removed_value="removed-${RANDOM}" + chainfire_put_key "${removed_key}" "${removed_value}" "${proof_dir}/post-removal-put.json" + + if [[ "${removed_leader_id}" != "1" ]]; then + chainfire_wait_local_value 1 "${removed_key}" "${removed_value}" "${proof_dir}/post-removal-node01-local-read.json" 180 + fi + if [[ "${removed_leader_id}" != "2" ]]; then + chainfire_wait_local_value 2 "${removed_key}" "${removed_value}" "${proof_dir}/post-removal-node02-local-read.json" 180 + fi + if [[ "${removed_leader_id}" != "3" ]]; then + chainfire_wait_local_value 3 "${removed_key}" "${removed_value}" "${proof_dir}/post-removal-node03-local-read.json" 180 + fi + if [[ "${removed_leader_id}" != "4" ]]; then + chainfire_wait_local_value 4 "${removed_key}" "${removed_value}" "${proof_dir}/post-removal-node04-local-read.json" 180 + fi + chainfire_wait_local_absent "${removed_leader_id}" "${removed_key}" "${proof_dir}/post-removal-removed-leader-local-read.out" 180 + + log "ChainFire live membership proof: re-adding removed leader ${removed_leader_id}" + chainfire_post_member_request \ + "$(jq -cn \ + --argjson node_id "${removed_leader_id}" \ + --arg raft_addr "$(chainfire_raft_addr_for_id "${removed_leader_id}")" \ + --arg name "$(chainfire_node_name_for_id "${removed_leader_id}")" \ + --arg client_url "$(chainfire_client_url_for_id "${removed_leader_id}")" \ + '{node_id:$node_id, raft_addr:$raft_addr, client_url:$client_url, name:$name, is_learner:false}')" \ + "${proof_dir}/member-readd-leader.json" + chainfire_wait_membership "(.members | length) == 4 and any(.members[]; (.id | tostring) == \"${removed_leader_id}\" and ((.isLearner // .is_learner // false) == false))" 180 >"${proof_dir}/membership-after-leader-readd.json" + chainfire_wait_local_value "${removed_leader_id}" "${removed_key}" "${removed_value}" "${proof_dir}/post-readd-restored-leader-local-read.json" 180 + + log "ChainFire live membership proof: removing temporary node04 and restoring canonical shape" + chainfire_delete_member_request "4" "${proof_dir}/member-remove-node04.json" + chainfire_wait_membership '(.members | length) == 3 and (all(.members[]; (.id | tostring) != "4"))' 180 >"${proof_dir}/final-membership.json" + + final_key="chainfire-live-proof-final-$(date +%s)-$RANDOM" + final_value="final-${RANDOM}" + chainfire_put_key "${final_key}" "${final_value}" "${proof_dir}/final-put.json" + chainfire_wait_local_value 1 "${final_key}" "${final_value}" "${proof_dir}/final-node01-local-read.json" 180 + chainfire_wait_local_value 2 "${final_key}" "${final_value}" "${proof_dir}/final-node02-local-read.json" 180 + chainfire_wait_local_value 3 "${final_key}" "${final_value}" "${proof_dir}/final-node03-local-read.json" 180 + chainfire_wait_local_absent 4 "${final_key}" "${proof_dir}/final-node04-local-read.out" 180 + chainfire_wait_internal_http_from_node01 120 + chainfire_wait_internal_replication_from_node01 120 >"${proof_dir}/final-internal-replication.json" + + ssh_node node04 "cat /run/chainfire-live-membership-proof/chainfire.log" >"${proof_dir}/node04-temporary-chainfire.log" || true + curl -fsS http://127.0.0.1:18081/api/v1/cluster/status >"${proof_dir}/final-node01-rest-status.json" + curl -fsS http://127.0.0.1:18082/api/v1/cluster/status >"${proof_dir}/final-node02-rest-status.json" + curl -fsS http://127.0.0.1:18083/api/v1/cluster/status >"${proof_dir}/final-node03-rest-status.json" + validate_control_plane + + finished_at="$(date -Iseconds)" + jq -n \ + --arg started_at "${started_at}" \ + --arg finished_at "${finished_at}" \ + --arg artifact_root "${proof_dir}" \ + --arg leader_before_transfer_id "${leader_before_transfer_id}" \ + --arg transfer_target_id "${transfer_target_id}" \ + --arg removed_leader_id "${removed_leader_id}" \ + --arg new_leader_id "${new_leader_id}" \ + --arg summary "started from the canonical 3-node ChainFire control plane, scaled out by adding node04 as a learner then voter, transferred leadership to another live voter, restarted the temporary voter and revalidated local reads, removed the live leader and waited for re-election, re-added the removed leader, then scaled back in to the canonical 3-node shape while proving local serializable reads on every membership transition" \ + '{started_at:$started_at, finished_at:$finished_at, artifact_root:$artifact_root, leader_before_transfer_id:$leader_before_transfer_id, transfer_target_id:$transfer_target_id, removed_leader_id:$removed_leader_id, new_leader_id:$new_leader_id, summary:$summary, success:true}' \ + >"${proof_dir}/result.json" + + log "ChainFire live membership proof succeeded; artifacts are in ${proof_dir}" +} + validate_cluster() { preflight wait_requested @@ -10157,6 +10865,12 @@ rollout_soak_requested() { run_rollout_soak } +chainfire_live_membership_proof_requested() { + clean_requested "$@" + start_requested "$@" + run_chainfire_live_membership_proof +} + durability_proof_requested() { start_requested "$@" run_durability_proof @@ -10448,6 +11162,7 @@ Commands: fresh-matrix clean local runtime state, rebuild on the host, start, and validate composed service configurations provider-vm-reality-proof start the cluster if needed, then persist provider and VM-hosting interop artifacts under ./work/provider-vm-reality-proof rollout-soak clean local runtime state, rebuild on the host, start, and persist a longer-run control-plane and rollout soak under ./work/rollout-soak + chainfire-live-membership-proof clean local runtime state, rebuild on the host, start, and persist live ChainFire scale-out/replace artifacts under ./work/chainfire-live-membership-proof durability-proof start the cluster if needed, then persist durability and restore artifacts under ./work/durability-proof bench-storage start the cluster and benchmark CoronaFS plus LightningStor against the current running VMs fresh-bench-storage clean local runtime state, rebuild on the host, start, and benchmark CoronaFS plus LightningStor @@ -10482,6 +11197,7 @@ Examples: $0 fresh-matrix $0 provider-vm-reality-proof $0 rollout-soak + $0 chainfire-live-membership-proof $0 durability-proof $0 bench-storage $0 fresh-bench-storage @@ -10526,6 +11242,7 @@ main() { fresh-matrix) fresh_matrix_requested "$@" ;; provider-vm-reality-proof) provider_vm_reality_proof_requested "$@" ;; rollout-soak) rollout_soak_requested "$@" ;; + chainfire-live-membership-proof) chainfire_live_membership_proof_requested "$@" ;; durability-proof) durability_proof_requested "$@" ;; bench-storage) bench_storage_requested "$@" ;; fresh-bench-storage) fresh_bench_storage_requested "$@" ;; diff --git a/nix/test-cluster/run-core-control-plane-ops-proof.sh b/nix/test-cluster/run-core-control-plane-ops-proof.sh index b397a10..d7ebd23 100755 --- a/nix/test-cluster/run-core-control-plane-ops-proof.sh +++ b/nix/test-cluster/run-core-control-plane-ops-proof.sh @@ -95,8 +95,8 @@ main() { fi if (( rc == 0 )); then run_case chainfire-membership-contract \ - rg -n 'fixed-membership|replace-node|scale-out|unsupported on the supported surface' \ - README.md docs TODO.md chainfire/crates/chainfire-api/src/cluster_service.rs || rc=$? + rg -n 'MemberAdd|MemberRemove|MemberList|LeaderTransfer|TimeoutNow|chainfire-live-membership-proof|current-leader removal|leader transfer|temporary-voter restart|one-voter transitions|joint consensus|live membership' \ + README.md docs/control-plane-ops.md docs/testing.md nix/test-cluster/README.md chainfire/proto/chainfire.proto chainfire/crates/chainfire-api/src/cluster_service.rs || rc=$? fi if (( rc == 0 )); then run_case flaredb-migration-contract \ diff --git a/nix/test-cluster/run-publishable-kvm-suite.sh b/nix/test-cluster/run-publishable-kvm-suite.sh index af4a947..ededdbc 100755 --- a/nix/test-cluster/run-publishable-kvm-suite.sh +++ b/nix/test-cluster/run-publishable-kvm-suite.sh @@ -224,6 +224,7 @@ main() { run_case fresh-smoke nix run ./nix/test-cluster#cluster -- fresh-smoke run_case fresh-demo-vm-webapp nix run ./nix/test-cluster#cluster -- fresh-demo-vm-webapp run_case fresh-matrix nix run ./nix/test-cluster#cluster -- fresh-matrix + run_case chainfire-live-membership-proof nix run ./nix/test-cluster#cluster -- chainfire-live-membership-proof log "publishable KVM suite passed; logs in ${LOG_DIR}" }