photoncloud-monorepo/specifications/chainfire/README.md
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

433 lines
11 KiB
Markdown

# Chainfire Specification
> Version: 1.0 | Status: Draft | Last Updated: 2025-12-08
## 1. Overview
### 1.1 Purpose
Chainfire is a distributed key-value store designed for cluster management with etcd-compatible semantics. It provides strongly consistent storage with MVCC (Multi-Version Concurrency Control), watch notifications, and transaction support.
### 1.2 Scope
- **In scope**: Distributed KV storage, consensus (Raft), watch/subscribe, transactions, cluster membership
- **Out of scope**: SQL queries, secondary indexes, full-text search
### 1.3 Design Goals
- etcd API compatibility for ecosystem tooling
- High availability via Raft consensus
- Low latency for configuration management workloads
- Simple deployment (single binary)
## 2. Architecture
### 2.1 Crate Structure
```
chainfire/
├── crates/
│ ├── chainfire-api/ # gRPC service implementations
│ ├── chainfire-core/ # Embeddable cluster library, config, callbacks
│ ├── chainfire-gossip/ # SWIM gossip protocol (foca)
│ ├── chainfire-raft/ # OpenRaft integration
│ ├── chainfire-server/ # Server binary, config
│ ├── chainfire-storage/ # RocksDB state machine
│ ├── chainfire-types/ # Shared types (KV, Watch, Command)
│ └── chainfire-watch/ # Watch registry
├── chainfire-client/ # Rust client library
└── proto/
├── chainfire.proto # Public API (KV, Watch, Cluster)
└── internal.proto # Raft internal RPCs (Vote, AppendEntries)
```
### 2.2 Data Flow
```
[Client gRPC] → [API Layer] → [Raft Node] → [State Machine] → [RocksDB]
↓ ↓
[Watch Registry] ← [Events]
```
### 2.3 Dependencies
| Crate | Version | Purpose |
|-------|---------|---------|
| tokio | 1.40 | Async runtime |
| tonic | 0.12 | gRPC framework |
| openraft | 0.9 | Raft consensus |
| rocksdb | 0.24 | Storage engine |
| foca | 1.0 | SWIM gossip protocol |
| prost | 0.13 | Protocol buffers |
| dashmap | 6 | Concurrent hash maps |
## 3. API
### 3.1 gRPC Services
#### KV Service (`chainfire.v1.KV`)
```protobuf
service KV {
rpc Range(RangeRequest) returns (RangeResponse);
rpc Put(PutRequest) returns (PutResponse);
rpc Delete(DeleteRangeRequest) returns (DeleteRangeResponse);
rpc Txn(TxnRequest) returns (TxnResponse);
}
```
**Range (Get/Scan)**
- Single key lookup: `key` set, `range_end` empty
- Range scan: `key` = start, `range_end` = end (exclusive)
- Prefix scan: `range_end` = prefix + 1
- Options: `limit`, `revision` (point-in-time), `keys_only`, `count_only`
**Put**
- Writes key-value pair
- Optional: `lease` (TTL), `prev_kv` (return previous)
**Delete**
- Single key or range delete
- Optional: `prev_kv` (return deleted values)
**Transaction (Txn)**
- Atomic compare-and-swap operations
- `compare`: Conditions to check
- `success`: Operations if all conditions pass
- `failure`: Operations if any condition fails
#### Watch Service (`chainfire.v1.Watch`)
```protobuf
service Watch {
rpc Watch(stream WatchRequest) returns (stream WatchResponse);
}
```
- Bidirectional streaming
- Supports: single key, prefix, range watches
- Historical replay via `start_revision`
- Progress notifications
#### Cluster Service (`chainfire.v1.Cluster`)
```protobuf
service Cluster {
rpc MemberAdd(MemberAddRequest) returns (MemberAddResponse);
rpc MemberRemove(MemberRemoveRequest) returns (MemberRemoveResponse);
rpc MemberList(MemberListRequest) returns (MemberListResponse);
rpc Status(StatusRequest) returns (StatusResponse);
}
```
### 3.2 Client Library
```rust
use chainfire_client::Client;
let mut client = Client::connect("http://127.0.0.1:2379").await?;
// Put
let revision = client.put("key", "value").await?;
// Get
let value = client.get("key").await?; // Option<Vec<u8>>
// Get with string convenience
let value = client.get_str("key").await?; // Option<String>
// Prefix scan
let kvs = client.get_prefix("prefix/").await?; // Vec<(key, value, revision)>
// Delete
let deleted = client.delete("key").await?; // bool
// Status
let status = client.status().await?;
println!("Leader: {}, Term: {}", status.leader, status.raft_term);
```
### 3.3 Public Traits (chainfire-core)
#### ClusterEventHandler
```rust
#[async_trait]
pub trait ClusterEventHandler: Send + Sync {
async fn on_node_joined(&self, node: &NodeInfo) {}
async fn on_node_left(&self, node_id: u64, reason: LeaveReason) {}
async fn on_leader_changed(&self, old: Option<u64>, new: u64) {}
async fn on_became_leader(&self) {}
async fn on_lost_leadership(&self) {}
async fn on_membership_changed(&self, members: &[NodeInfo]) {}
async fn on_partition_detected(&self, reachable: &[u64], unreachable: &[u64]) {}
async fn on_cluster_ready(&self) {}
}
```
#### KvEventHandler
```rust
#[async_trait]
pub trait KvEventHandler: Send + Sync {
async fn on_key_changed(&self, namespace: &str, key: &[u8], value: &[u8], revision: u64) {}
async fn on_key_deleted(&self, namespace: &str, key: &[u8], revision: u64) {}
async fn on_prefix_changed(&self, namespace: &str, prefix: &[u8], entries: &[KvEntry]) {}
}
```
#### StorageBackend
```rust
#[async_trait]
pub trait StorageBackend: Send + Sync {
async fn get(&self, key: &[u8]) -> io::Result<Option<Vec<u8>>>;
async fn put(&self, key: &[u8], value: &[u8]) -> io::Result<()>;
async fn delete(&self, key: &[u8]) -> io::Result<bool>;
}
```
### 3.4 Embeddable Library (chainfire-core)
```rust
use chainfire_core::{ClusterBuilder, ClusterEventHandler};
let cluster = ClusterBuilder::new(node_id)
.name("node-1")
.gossip_addr("0.0.0.0:7946".parse()?)
.raft_addr("0.0.0.0:2380".parse()?)
.on_cluster_event(MyHandler)
.build()
.await?;
// Use the KVS
cluster.kv().put("key", b"value").await?;
```
## 4. Data Models
### 4.1 Core Types
#### KeyValue Entry
```rust
pub struct KvEntry {
pub key: Vec<u8>,
pub value: Vec<u8>,
pub create_revision: u64, // Revision when created (immutable)
pub mod_revision: u64, // Last modification revision
pub version: u64, // Update count (1, 2, 3, ...)
pub lease_id: Option<i64>, // Lease ID for TTL expiration
}
```
#### Read Consistency Levels
```rust
pub enum ReadConsistency {
Local, // Read from local storage (may be stale)
Serializable, // Verify with leader's committed index
Linearizable, // Read only from leader (default, strongest)
}
```
#### Watch Event
```rust
pub enum WatchEventType {
Put,
Delete,
}
pub struct WatchEvent {
pub event_type: WatchEventType,
pub kv: KvEntry,
pub prev_kv: Option<KvEntry>,
}
```
#### Response Header
```rust
pub struct ResponseHeader {
pub cluster_id: u64,
pub member_id: u64,
pub revision: u64, // Current store revision
pub raft_term: u64,
}
```
### 4.2 Transaction Types
```rust
pub struct Compare {
pub key: Vec<u8>,
pub target: CompareTarget,
pub result: CompareResult,
}
pub enum CompareTarget {
Version(u64),
CreateRevision(u64),
ModRevision(u64),
Value(Vec<u8>),
}
pub enum CompareResult {
Equal,
NotEqual,
Greater,
Less,
}
```
### 4.3 Storage Format
- **Engine**: RocksDB
- **Column Families**:
- `raft_logs`: Raft log entries
- `raft_meta`: Raft metadata (vote, term, membership)
- `key_value`: KV data (key bytes → serialized KvEntry)
- `snapshot`: Snapshot metadata
- **Metadata Keys**: `vote`, `last_applied`, `membership`, `revision`, `last_snapshot`
- **Serialization**: bincode for Raft, Protocol Buffers for gRPC
- **MVCC**: Global revision counter, per-key create/mod revisions
## 5. Configuration
### 5.1 Config File Format (TOML)
```toml
[node]
id = 1
name = "chainfire-1"
role = "control_plane" # or "worker"
[storage]
data_dir = "/var/lib/chainfire"
[network]
api_addr = "0.0.0.0:2379"
raft_addr = "0.0.0.0:2380"
gossip_addr = "0.0.0.0:2381"
[cluster]
id = 1
bootstrap = true
initial_members = []
[raft]
role = "voter" # "voter", "learner", or "none"
```
### 5.2 Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| CHAINFIRE_DATA_DIR | ./data | Data directory |
| CHAINFIRE_API_ADDR | 127.0.0.1:2379 | Client API address |
| CHAINFIRE_RAFT_ADDR | 127.0.0.1:2380 | Raft peer address |
### 5.3 Raft Tuning
```rust
heartbeat_interval: 150ms // Leader heartbeat
election_timeout_min: 300ms // Min election timeout
election_timeout_max: 600ms // Max election timeout
snapshot_policy: LogsSinceLast(5000)
snapshot_max_chunk_size: 3MB
max_payload_entries: 300
```
## 6. Security
### 6.1 Authentication
- **Current**: None (development mode)
- **Planned**: mTLS for peer communication, token-based client auth
### 6.2 Authorization
- **Current**: All operations permitted
- **Planned**: RBAC integration with IAM (aegis)
### 6.3 Multi-tenancy
- **Namespace isolation**: Key prefix per tenant
- **Planned**: Per-namespace quotas, ACLs via IAM
## 7. Operations
### 7.1 Deployment
**Single Node (Bootstrap)**
```bash
chainfire-server --config config.toml
# With bootstrap = true in config
```
**Cluster (3-node)**
```bash
# Node 1 (bootstrap)
chainfire-server --config node1.toml
# Node 2, 3 (join)
# Set bootstrap = false, add node1 to initial_members
chainfire-server --config node2.toml
```
### 7.2 Monitoring
- **Health**: gRPC health check service
- **Metrics**: Prometheus endpoint (planned)
- `chainfire_kv_operations_total`
- `chainfire_raft_term`
- `chainfire_storage_bytes`
- `chainfire_watch_active`
### 7.3 Backup & Recovery
- **Snapshot**: Automatic via Raft (every 5000 log entries)
- **Manual backup**: Copy data_dir while stopped
- **Point-in-time**: Use revision parameter in Range requests
## 8. Compatibility
### 8.1 API Versioning
- gRPC package: `chainfire.v1`
- Breaking changes: New major version (v2, v3)
- Backward compatible: Add fields, new RPCs
### 8.2 Wire Protocol
- Protocol Buffers 3
- tonic/prost for Rust
- Compatible with any gRPC client
### 8.3 etcd Compatibility
- **Compatible**: KV operations, Watch, basic transactions
- **Different**: gRPC package names, some field names
- **Not implemented**: Lease service, Auth service (planned)
## Appendix
### A. Error Codes
| Error | Meaning |
|-------|---------|
| NOT_LEADER | Node is not the Raft leader |
| KEY_NOT_FOUND | Key does not exist |
| REVISION_COMPACTED | Requested revision no longer available |
| TXN_FAILED | Transaction condition not met |
### B. Raft Commands
```rust
pub enum RaftCommand {
Put { key, value, lease_id, prev_kv },
Delete { key, prev_kv },
DeleteRange { start, end, prev_kv },
Txn { compare, success, failure },
Noop, // Leadership establishment
}
```
### C. Port Assignments
| Port | Protocol | Purpose |
|------|----------|---------|
| 2379 | gRPC | Client API |
| 2380 | gRPC | Raft peer |
| 2381 | UDP | SWIM gossip |
### D. Node Roles
```rust
/// Role in cluster gossip
pub enum NodeRole {
ControlPlane, // Participates in Raft consensus
Worker, // Gossip only, watches Control Plane
}
/// Role in Raft consensus
pub enum RaftRole {
Voter, // Full voting member
Learner, // Non-voting replica (receives log replication)
None, // No Raft participation (agent/proxy only)
}
```
### E. Internal Raft RPCs (internal.proto)
```protobuf
service RaftService {
rpc Vote(VoteRequest) returns (VoteResponse);
rpc AppendEntries(AppendEntriesRequest) returns (AppendEntriesResponse);
rpc InstallSnapshot(stream InstallSnapshotRequest) returns (InstallSnapshotResponse);
}
```