- netboot-base.nix with SSH key auth - Launch scripts for node01/02/03 - Node configuration.nix and disko.nix - Nix modules for first-boot automation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
433 lines
11 KiB
Markdown
433 lines
11 KiB
Markdown
# Chainfire Specification
|
|
|
|
> Version: 1.0 | Status: Draft | Last Updated: 2025-12-08
|
|
|
|
## 1. Overview
|
|
|
|
### 1.1 Purpose
|
|
Chainfire is a distributed key-value store designed for cluster management with etcd-compatible semantics. It provides strongly consistent storage with MVCC (Multi-Version Concurrency Control), watch notifications, and transaction support.
|
|
|
|
### 1.2 Scope
|
|
- **In scope**: Distributed KV storage, consensus (Raft), watch/subscribe, transactions, cluster membership
|
|
- **Out of scope**: SQL queries, secondary indexes, full-text search
|
|
|
|
### 1.3 Design Goals
|
|
- etcd API compatibility for ecosystem tooling
|
|
- High availability via Raft consensus
|
|
- Low latency for configuration management workloads
|
|
- Simple deployment (single binary)
|
|
|
|
## 2. Architecture
|
|
|
|
### 2.1 Crate Structure
|
|
```
|
|
chainfire/
|
|
├── crates/
|
|
│ ├── chainfire-api/ # gRPC service implementations
|
|
│ ├── chainfire-core/ # Embeddable cluster library, config, callbacks
|
|
│ ├── chainfire-gossip/ # SWIM gossip protocol (foca)
|
|
│ ├── chainfire-raft/ # OpenRaft integration
|
|
│ ├── chainfire-server/ # Server binary, config
|
|
│ ├── chainfire-storage/ # RocksDB state machine
|
|
│ ├── chainfire-types/ # Shared types (KV, Watch, Command)
|
|
│ └── chainfire-watch/ # Watch registry
|
|
├── chainfire-client/ # Rust client library
|
|
└── proto/
|
|
├── chainfire.proto # Public API (KV, Watch, Cluster)
|
|
└── internal.proto # Raft internal RPCs (Vote, AppendEntries)
|
|
```
|
|
|
|
### 2.2 Data Flow
|
|
```
|
|
[Client gRPC] → [API Layer] → [Raft Node] → [State Machine] → [RocksDB]
|
|
↓ ↓
|
|
[Watch Registry] ← [Events]
|
|
```
|
|
|
|
### 2.3 Dependencies
|
|
| Crate | Version | Purpose |
|
|
|-------|---------|---------|
|
|
| tokio | 1.40 | Async runtime |
|
|
| tonic | 0.12 | gRPC framework |
|
|
| openraft | 0.9 | Raft consensus |
|
|
| rocksdb | 0.24 | Storage engine |
|
|
| foca | 1.0 | SWIM gossip protocol |
|
|
| prost | 0.13 | Protocol buffers |
|
|
| dashmap | 6 | Concurrent hash maps |
|
|
|
|
## 3. API
|
|
|
|
### 3.1 gRPC Services
|
|
|
|
#### KV Service (`chainfire.v1.KV`)
|
|
```protobuf
|
|
service KV {
|
|
rpc Range(RangeRequest) returns (RangeResponse);
|
|
rpc Put(PutRequest) returns (PutResponse);
|
|
rpc Delete(DeleteRangeRequest) returns (DeleteRangeResponse);
|
|
rpc Txn(TxnRequest) returns (TxnResponse);
|
|
}
|
|
```
|
|
|
|
**Range (Get/Scan)**
|
|
- Single key lookup: `key` set, `range_end` empty
|
|
- Range scan: `key` = start, `range_end` = end (exclusive)
|
|
- Prefix scan: `range_end` = prefix + 1
|
|
- Options: `limit`, `revision` (point-in-time), `keys_only`, `count_only`
|
|
|
|
**Put**
|
|
- Writes key-value pair
|
|
- Optional: `lease` (TTL), `prev_kv` (return previous)
|
|
|
|
**Delete**
|
|
- Single key or range delete
|
|
- Optional: `prev_kv` (return deleted values)
|
|
|
|
**Transaction (Txn)**
|
|
- Atomic compare-and-swap operations
|
|
- `compare`: Conditions to check
|
|
- `success`: Operations if all conditions pass
|
|
- `failure`: Operations if any condition fails
|
|
|
|
#### Watch Service (`chainfire.v1.Watch`)
|
|
```protobuf
|
|
service Watch {
|
|
rpc Watch(stream WatchRequest) returns (stream WatchResponse);
|
|
}
|
|
```
|
|
- Bidirectional streaming
|
|
- Supports: single key, prefix, range watches
|
|
- Historical replay via `start_revision`
|
|
- Progress notifications
|
|
|
|
#### Cluster Service (`chainfire.v1.Cluster`)
|
|
```protobuf
|
|
service Cluster {
|
|
rpc MemberAdd(MemberAddRequest) returns (MemberAddResponse);
|
|
rpc MemberRemove(MemberRemoveRequest) returns (MemberRemoveResponse);
|
|
rpc MemberList(MemberListRequest) returns (MemberListResponse);
|
|
rpc Status(StatusRequest) returns (StatusResponse);
|
|
}
|
|
```
|
|
|
|
### 3.2 Client Library
|
|
```rust
|
|
use chainfire_client::Client;
|
|
|
|
let mut client = Client::connect("http://127.0.0.1:2379").await?;
|
|
|
|
// Put
|
|
let revision = client.put("key", "value").await?;
|
|
|
|
// Get
|
|
let value = client.get("key").await?; // Option<Vec<u8>>
|
|
|
|
// Get with string convenience
|
|
let value = client.get_str("key").await?; // Option<String>
|
|
|
|
// Prefix scan
|
|
let kvs = client.get_prefix("prefix/").await?; // Vec<(key, value, revision)>
|
|
|
|
// Delete
|
|
let deleted = client.delete("key").await?; // bool
|
|
|
|
// Status
|
|
let status = client.status().await?;
|
|
println!("Leader: {}, Term: {}", status.leader, status.raft_term);
|
|
```
|
|
|
|
### 3.3 Public Traits (chainfire-core)
|
|
|
|
#### ClusterEventHandler
|
|
```rust
|
|
#[async_trait]
|
|
pub trait ClusterEventHandler: Send + Sync {
|
|
async fn on_node_joined(&self, node: &NodeInfo) {}
|
|
async fn on_node_left(&self, node_id: u64, reason: LeaveReason) {}
|
|
async fn on_leader_changed(&self, old: Option<u64>, new: u64) {}
|
|
async fn on_became_leader(&self) {}
|
|
async fn on_lost_leadership(&self) {}
|
|
async fn on_membership_changed(&self, members: &[NodeInfo]) {}
|
|
async fn on_partition_detected(&self, reachable: &[u64], unreachable: &[u64]) {}
|
|
async fn on_cluster_ready(&self) {}
|
|
}
|
|
```
|
|
|
|
#### KvEventHandler
|
|
```rust
|
|
#[async_trait]
|
|
pub trait KvEventHandler: Send + Sync {
|
|
async fn on_key_changed(&self, namespace: &str, key: &[u8], value: &[u8], revision: u64) {}
|
|
async fn on_key_deleted(&self, namespace: &str, key: &[u8], revision: u64) {}
|
|
async fn on_prefix_changed(&self, namespace: &str, prefix: &[u8], entries: &[KvEntry]) {}
|
|
}
|
|
```
|
|
|
|
#### StorageBackend
|
|
```rust
|
|
#[async_trait]
|
|
pub trait StorageBackend: Send + Sync {
|
|
async fn get(&self, key: &[u8]) -> io::Result<Option<Vec<u8>>>;
|
|
async fn put(&self, key: &[u8], value: &[u8]) -> io::Result<()>;
|
|
async fn delete(&self, key: &[u8]) -> io::Result<bool>;
|
|
}
|
|
```
|
|
|
|
### 3.4 Embeddable Library (chainfire-core)
|
|
```rust
|
|
use chainfire_core::{ClusterBuilder, ClusterEventHandler};
|
|
|
|
let cluster = ClusterBuilder::new(node_id)
|
|
.name("node-1")
|
|
.gossip_addr("0.0.0.0:7946".parse()?)
|
|
.raft_addr("0.0.0.0:2380".parse()?)
|
|
.on_cluster_event(MyHandler)
|
|
.build()
|
|
.await?;
|
|
|
|
// Use the KVS
|
|
cluster.kv().put("key", b"value").await?;
|
|
```
|
|
|
|
## 4. Data Models
|
|
|
|
### 4.1 Core Types
|
|
|
|
#### KeyValue Entry
|
|
```rust
|
|
pub struct KvEntry {
|
|
pub key: Vec<u8>,
|
|
pub value: Vec<u8>,
|
|
pub create_revision: u64, // Revision when created (immutable)
|
|
pub mod_revision: u64, // Last modification revision
|
|
pub version: u64, // Update count (1, 2, 3, ...)
|
|
pub lease_id: Option<i64>, // Lease ID for TTL expiration
|
|
}
|
|
```
|
|
|
|
#### Read Consistency Levels
|
|
```rust
|
|
pub enum ReadConsistency {
|
|
Local, // Read from local storage (may be stale)
|
|
Serializable, // Verify with leader's committed index
|
|
Linearizable, // Read only from leader (default, strongest)
|
|
}
|
|
```
|
|
|
|
#### Watch Event
|
|
```rust
|
|
pub enum WatchEventType {
|
|
Put,
|
|
Delete,
|
|
}
|
|
|
|
pub struct WatchEvent {
|
|
pub event_type: WatchEventType,
|
|
pub kv: KvEntry,
|
|
pub prev_kv: Option<KvEntry>,
|
|
}
|
|
```
|
|
|
|
#### Response Header
|
|
```rust
|
|
pub struct ResponseHeader {
|
|
pub cluster_id: u64,
|
|
pub member_id: u64,
|
|
pub revision: u64, // Current store revision
|
|
pub raft_term: u64,
|
|
}
|
|
```
|
|
|
|
### 4.2 Transaction Types
|
|
```rust
|
|
pub struct Compare {
|
|
pub key: Vec<u8>,
|
|
pub target: CompareTarget,
|
|
pub result: CompareResult,
|
|
}
|
|
|
|
pub enum CompareTarget {
|
|
Version(u64),
|
|
CreateRevision(u64),
|
|
ModRevision(u64),
|
|
Value(Vec<u8>),
|
|
}
|
|
|
|
pub enum CompareResult {
|
|
Equal,
|
|
NotEqual,
|
|
Greater,
|
|
Less,
|
|
}
|
|
```
|
|
|
|
### 4.3 Storage Format
|
|
- **Engine**: RocksDB
|
|
- **Column Families**:
|
|
- `raft_logs`: Raft log entries
|
|
- `raft_meta`: Raft metadata (vote, term, membership)
|
|
- `key_value`: KV data (key bytes → serialized KvEntry)
|
|
- `snapshot`: Snapshot metadata
|
|
- **Metadata Keys**: `vote`, `last_applied`, `membership`, `revision`, `last_snapshot`
|
|
- **Serialization**: bincode for Raft, Protocol Buffers for gRPC
|
|
- **MVCC**: Global revision counter, per-key create/mod revisions
|
|
|
|
## 5. Configuration
|
|
|
|
### 5.1 Config File Format (TOML)
|
|
```toml
|
|
[node]
|
|
id = 1
|
|
name = "chainfire-1"
|
|
role = "control_plane" # or "worker"
|
|
|
|
[storage]
|
|
data_dir = "/var/lib/chainfire"
|
|
|
|
[network]
|
|
api_addr = "0.0.0.0:2379"
|
|
raft_addr = "0.0.0.0:2380"
|
|
gossip_addr = "0.0.0.0:2381"
|
|
|
|
[cluster]
|
|
id = 1
|
|
bootstrap = true
|
|
initial_members = []
|
|
|
|
[raft]
|
|
role = "voter" # "voter", "learner", or "none"
|
|
```
|
|
|
|
### 5.2 Environment Variables
|
|
| Variable | Default | Description |
|
|
|----------|---------|-------------|
|
|
| CHAINFIRE_DATA_DIR | ./data | Data directory |
|
|
| CHAINFIRE_API_ADDR | 127.0.0.1:2379 | Client API address |
|
|
| CHAINFIRE_RAFT_ADDR | 127.0.0.1:2380 | Raft peer address |
|
|
|
|
### 5.3 Raft Tuning
|
|
```rust
|
|
heartbeat_interval: 150ms // Leader heartbeat
|
|
election_timeout_min: 300ms // Min election timeout
|
|
election_timeout_max: 600ms // Max election timeout
|
|
snapshot_policy: LogsSinceLast(5000)
|
|
snapshot_max_chunk_size: 3MB
|
|
max_payload_entries: 300
|
|
```
|
|
|
|
## 6. Security
|
|
|
|
### 6.1 Authentication
|
|
- **Current**: None (development mode)
|
|
- **Planned**: mTLS for peer communication, token-based client auth
|
|
|
|
### 6.2 Authorization
|
|
- **Current**: All operations permitted
|
|
- **Planned**: RBAC integration with IAM (aegis)
|
|
|
|
### 6.3 Multi-tenancy
|
|
- **Namespace isolation**: Key prefix per tenant
|
|
- **Planned**: Per-namespace quotas, ACLs via IAM
|
|
|
|
## 7. Operations
|
|
|
|
### 7.1 Deployment
|
|
|
|
**Single Node (Bootstrap)**
|
|
```bash
|
|
chainfire-server --config config.toml
|
|
# With bootstrap = true in config
|
|
```
|
|
|
|
**Cluster (3-node)**
|
|
```bash
|
|
# Node 1 (bootstrap)
|
|
chainfire-server --config node1.toml
|
|
|
|
# Node 2, 3 (join)
|
|
# Set bootstrap = false, add node1 to initial_members
|
|
chainfire-server --config node2.toml
|
|
```
|
|
|
|
### 7.2 Monitoring
|
|
- **Health**: gRPC health check service
|
|
- **Metrics**: Prometheus endpoint (planned)
|
|
- `chainfire_kv_operations_total`
|
|
- `chainfire_raft_term`
|
|
- `chainfire_storage_bytes`
|
|
- `chainfire_watch_active`
|
|
|
|
### 7.3 Backup & Recovery
|
|
- **Snapshot**: Automatic via Raft (every 5000 log entries)
|
|
- **Manual backup**: Copy data_dir while stopped
|
|
- **Point-in-time**: Use revision parameter in Range requests
|
|
|
|
## 8. Compatibility
|
|
|
|
### 8.1 API Versioning
|
|
- gRPC package: `chainfire.v1`
|
|
- Breaking changes: New major version (v2, v3)
|
|
- Backward compatible: Add fields, new RPCs
|
|
|
|
### 8.2 Wire Protocol
|
|
- Protocol Buffers 3
|
|
- tonic/prost for Rust
|
|
- Compatible with any gRPC client
|
|
|
|
### 8.3 etcd Compatibility
|
|
- **Compatible**: KV operations, Watch, basic transactions
|
|
- **Different**: gRPC package names, some field names
|
|
- **Not implemented**: Lease service, Auth service (planned)
|
|
|
|
## Appendix
|
|
|
|
### A. Error Codes
|
|
| Error | Meaning |
|
|
|-------|---------|
|
|
| NOT_LEADER | Node is not the Raft leader |
|
|
| KEY_NOT_FOUND | Key does not exist |
|
|
| REVISION_COMPACTED | Requested revision no longer available |
|
|
| TXN_FAILED | Transaction condition not met |
|
|
|
|
### B. Raft Commands
|
|
```rust
|
|
pub enum RaftCommand {
|
|
Put { key, value, lease_id, prev_kv },
|
|
Delete { key, prev_kv },
|
|
DeleteRange { start, end, prev_kv },
|
|
Txn { compare, success, failure },
|
|
Noop, // Leadership establishment
|
|
}
|
|
```
|
|
|
|
### C. Port Assignments
|
|
| Port | Protocol | Purpose |
|
|
|------|----------|---------|
|
|
| 2379 | gRPC | Client API |
|
|
| 2380 | gRPC | Raft peer |
|
|
| 2381 | UDP | SWIM gossip |
|
|
|
|
### D. Node Roles
|
|
```rust
|
|
/// Role in cluster gossip
|
|
pub enum NodeRole {
|
|
ControlPlane, // Participates in Raft consensus
|
|
Worker, // Gossip only, watches Control Plane
|
|
}
|
|
|
|
/// Role in Raft consensus
|
|
pub enum RaftRole {
|
|
Voter, // Full voting member
|
|
Learner, // Non-voting replica (receives log replication)
|
|
None, // No Raft participation (agent/proxy only)
|
|
}
|
|
```
|
|
|
|
### E. Internal Raft RPCs (internal.proto)
|
|
```protobuf
|
|
service RaftService {
|
|
rpc Vote(VoteRequest) returns (VoteResponse);
|
|
rpc AppendEntries(AppendEntriesRequest) returns (AppendEntriesResponse);
|
|
rpc InstallSnapshot(stream InstallSnapshotRequest) returns (InstallSnapshotResponse);
|
|
}
|
|
```
|