- netboot-base.nix with SSH key auth - Launch scripts for node01/02/03 - Node configuration.nix and disko.nix - Nix modules for first-boot automation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| README.md | ||
FlareDB Specification
Version: 1.0 | Status: Draft | Last Updated: 2025-12-08
1. Overview
1.1 Purpose
FlareDB is a distributed key-value store designed for DBaaS (Database as a Service) workloads. It provides dual consistency modes: eventual consistency with LWW (Last-Write-Wins) for high throughput, and strong consistency via Raft for transactional operations.
1.2 Scope
- In scope: Multi-region KV storage, dual consistency modes, CAS operations, TSO (Timestamp Oracle), namespace isolation
- Out of scope: SQL queries (layer above), secondary indexes, full-text search
1.3 Design Goals
- TiKV-inspired multi-Raft architecture
- Tsurugi-like high performance
- Flexible per-namespace consistency modes
- Horizontal scalability via region splitting
2. Architecture
2.1 Crate Structure
flaredb/
├── crates/
│ ├── flaredb-cli/ # CLI tool (flaredb-cli)
│ ├── flaredb-client/ # Rust client library
│ ├── flaredb-pd/ # Placement Driver server
│ ├── flaredb-proto/ # gRPC definitions (proto files)
│ ├── flaredb-raft/ # OpenRaft integration, Multi-Raft
│ ├── flaredb-server/ # KV server binary, services
│ ├── flaredb-storage/ # RocksDB engine
│ └── flaredb-types/ # Shared types (RegionMeta, commands)
└── proto/ (symlink to flaredb-proto/src/)
2.2 Data Flow
[Client] → [KvRaw/KvCas Service] → [Namespace Router]
↓
[Eventual Mode] [Strong Mode]
↓ ↓
[Local RocksDB] [Raft Consensus]
[Async Replication] ↓
[State Machine → RocksDB]
2.3 Multi-Raft Architecture
┌─────────────────────────────────────────────────────┐
│ PD Cluster │
│ (Region metadata, TSO, Store registration) │
└─────────────────────────────────────────────────────┘
↓ ↓ ↓
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Store 1 │ │ Store 2 │ │ Store 3 │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Region 1 │ │ │ │ Region 1 │ │ │ │ Region 1 │ │
│ │ (Leader) │ │ │ │ (Follower)│ │ │ │ (Follower)│ │
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Region 2 │ │ │ │ Region 2 │ │ │ │ Region 2 │ │
│ │ (Follower)│ │ │ │ (Leader) │ │ │ │ (Follower)│ │
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
└───────────────┘ └───────────────┘ └───────────────┘
2.4 Dependencies
| Crate | Version | Purpose |
|---|---|---|
| tokio | 1.40 | Async runtime |
| tonic | 0.12 | gRPC framework |
| openraft | 0.9 | Raft consensus |
| rocksdb | 0.24 | Storage engine |
| prost | 0.13 | Protocol buffers |
| clap | 4.5 | CLI argument parsing |
| sha2 | 0.10 | Merkle tree hashing |
3. API
3.1 gRPC Services
KvRaw Service (Eventual Consistency)
service KvRaw {
rpc RawPut(RawPutRequest) returns (RawPutResponse);
rpc RawGet(RawGetRequest) returns (RawGetResponse);
rpc RawScan(RawScanRequest) returns (RawScanResponse);
}
message RawPutRequest {
bytes key = 1;
bytes value = 2;
string namespace = 3; // Empty = default namespace
}
message RawScanRequest {
bytes start_key = 1; // Inclusive
bytes end_key = 2; // Exclusive (empty = no upper bound)
uint32 limit = 3; // Max entries (0 = default 100)
string namespace = 4;
}
message RawScanResponse {
repeated bytes keys = 1;
repeated bytes values = 2;
bool has_more = 3;
bytes next_key = 4; // For pagination
}
KvCas Service (Strong Consistency)
service KvCas {
rpc CompareAndSwap(CasRequest) returns (CasResponse);
rpc Get(GetRequest) returns (GetResponse);
rpc Scan(ScanRequest) returns (ScanResponse);
}
message CasRequest {
bytes key = 1;
bytes value = 2;
uint64 expected_version = 3; // 0 = create if not exists
string namespace = 4;
}
message CasResponse {
bool success = 1;
uint64 current_version = 2;
uint64 new_version = 3;
}
message GetResponse {
bool found = 1;
bytes value = 2;
uint64 version = 3;
}
message ScanResponse {
repeated VersionedKv entries = 1;
bool has_more = 2;
bytes next_key = 3;
}
message VersionedKv {
bytes key = 1;
bytes value = 2;
uint64 version = 3;
}
PD Service (Placement Driver)
service Pd {
rpc RegisterStore(RegisterStoreRequest) returns (RegisterStoreResponse);
rpc GetRegion(GetRegionRequest) returns (GetRegionResponse);
rpc ListRegions(ListRegionsRequest) returns (ListRegionsResponse);
}
service Tso {
rpc GetTimestamp(TsoRequest) returns (TsoResponse);
}
message Region {
uint64 id = 1;
bytes start_key = 2; // Inclusive (empty = start of keyspace)
bytes end_key = 3; // Exclusive (empty = infinity)
repeated uint64 peers = 4;
uint64 leader_id = 5;
}
message Store {
uint64 id = 1;
string addr = 2;
}
3.2 Client Library
use flaredb_client::RdbClient;
// Connect with PD for region routing
let mut client = RdbClient::connect_with_pd(
"127.0.0.1:50051", // KV server (unused, routing via PD)
"127.0.0.1:2379", // PD server
).await?;
// Or with namespace isolation
let mut client = RdbClient::connect_with_pd_namespace(
"127.0.0.1:50051",
"127.0.0.1:2379",
"my_namespace",
).await?;
// TSO (Timestamp Oracle)
let ts = client.get_tso().await?;
// Raw API (Eventual Consistency)
client.raw_put(b"key".to_vec(), b"value".to_vec()).await?;
let value = client.raw_get(b"key".to_vec()).await?; // Option<Vec<u8>>
// CAS API (Strong Consistency)
let (success, current, new_ver) = client.cas(
b"key".to_vec(),
b"value".to_vec(),
0, // expected_version: 0 = create if not exists
).await?;
let entry = client.cas_get(b"key".to_vec()).await?; // Option<(version, value)>
// Scan with pagination
let (entries, next_key) = client.cas_scan(
b"start".to_vec(),
b"end".to_vec(),
100, // limit
).await?;
4. Data Models
4.1 Core Types
Region Metadata
pub struct RegionMeta {
pub id: u64,
pub start_key: Vec<u8>, // Inclusive, empty = start of keyspace
pub end_key: Vec<u8>, // Exclusive, empty = infinity
}
Namespace Configuration
pub struct NamespaceConfig {
pub id: u32,
pub name: String,
pub mode: ConsistencyMode,
pub explicit: bool, // User-defined vs auto-created
}
pub enum ConsistencyMode {
Strong, // CAS API, Raft consensus
Eventual, // Raw API, LWW replication
}
Raft Log Entry
pub enum FlareRequest {
KvWrite {
namespace_id: u32,
key: Vec<u8>,
value: Vec<u8>,
ts: u64,
},
Split {
region_id: u64,
split_key: Vec<u8>,
new_region_id: u64,
},
Noop,
}
4.2 Key Encoding
Raw/CAS Key Format:
┌──────────────────┬────────────────────────┐
│ namespace_id (4B)│ user_key (var) │
│ big-endian │ │
└──────────────────┴────────────────────────┘
Raw Value Format:
┌──────────────────┬────────────────────────┐
│ timestamp (8B) │ user_value (var) │
│ big-endian │ │
└──────────────────┴────────────────────────┘
CAS Value Format:
┌──────────────────┬──────────────────┬────────────────────────┐
│ version (8B) │ timestamp (8B) │ user_value (var) │
│ big-endian │ big-endian │ │
└──────────────────┴──────────────────┴────────────────────────┘
4.3 Reserved Namespaces
| Namespace | Mode | Purpose |
|---|---|---|
| iam | Strong | IAM data (principals, roles) |
| metrics | Strong | System metrics |
| _system | Strong | Internal metadata |
4.4 Storage Format
- Engine: RocksDB
- Column Families:
default: Raw KV datacas: Versioned CAS dataraft_log: Raft log entriesraft_state: Raft metadata (hard_state, vote)
- Serialization: Protocol Buffers
5. Configuration
5.1 Namespace Configuration
ServerConfig {
namespaces: HashMap<String, NamespaceConfig>,
default_mode: ConsistencyMode, // For auto-created namespaces
reserved_namespaces: ["iam", "metrics", "_system"],
}
5.2 Raft Configuration
Config {
heartbeat_interval: 100, // ms
election_timeout_min: 300, // ms
election_timeout_max: 600, // ms
snapshot_policy: LogsSinceLast(1000),
max_in_snapshot_log_to_keep: 100,
}
5.3 CLI Arguments
flaredb-server [OPTIONS]
--store-id <ID> Store ID (default: 1)
--addr <ADDR> KV server address (default: 127.0.0.1:50051)
--data-dir <PATH> Data directory (default: data)
--pd-addr <ADDR> PD server address (default: 127.0.0.1:2379)
--peer <ID=ADDR> Peer addresses (repeatable)
--namespace-mode <NS=MODE> Namespace modes (repeatable, e.g., myns=eventual)
flaredb-pd [OPTIONS]
--addr <ADDR> PD server address (default: 127.0.0.1:2379)
6. Consistency Models
6.1 Eventual Consistency (Raw API)
- Write Path: Local RocksDB → Async Raft replication
- Read Path: Local RocksDB read
- Conflict Resolution: Last-Write-Wins (LWW) using TSO timestamps
- Guarantees: Eventually consistent, high throughput
Write: Client → Local Store → RocksDB
↓ (async)
Raft Replication → Other Stores
6.2 Strong Consistency (CAS API)
- Write Path: Raft consensus → Apply to state machine
- Read Path: ensure_linearizable() → Leader read
- Guarantees: Linearizable reads and writes
Write: Client → Leader → Raft Consensus → All Stores
Read: Client → Leader → Verify leadership → Return
6.3 TSO (Timestamp Oracle)
// 64-bit timestamp format
┌────────────────────────────────┬─────────────────┐
│ Physical time (48 bits) │ Logical (16 bits)│
│ milliseconds since epoch │ 0-65535 │
└────────────────────────────────┴─────────────────┘
impl TsoOracle {
fn get_timestamp(count: u32) -> u64;
fn physical_time(ts: u64) -> u64; // Upper 48 bits
fn logical_counter(ts: u64) -> u16; // Lower 16 bits
fn compose(physical: u64, logical: u16) -> u64;
}
Properties:
- Monotonically increasing
- Thread-safe (AtomicU64)
- Batch allocation support
- ~65536 timestamps per millisecond
7. Anti-Entropy & Replication
7.1 Merkle Tree Synchronization
For eventual consistency mode, FlareDB uses Merkle trees for anti-entropy:
rpc GetMerkle(GetMerkleRequest) returns (GetMerkleResponse);
rpc FetchRange(FetchRangeRequest) returns (FetchRangeResponse);
message GetMerkleResponse {
bytes root = 1; // sha256 root hash
repeated MerkleRange leaves = 2; // per-chunk hashes
}
Anti-entropy flow:
- Replica requests Merkle root from leader
- Compare leaf hashes to identify divergent ranges
- Fetch divergent ranges via
FetchRange - Apply LWW merge using timestamps
7.2 Chainfire Integration
FlareDB integrates with Chainfire as its Placement Driver backend:
- Store registration and heartbeat
- Region metadata watch notifications
- Leader reporting for region routing
// Server connects to Chainfire PD
PdClient::connect(pd_addr).await?;
pd_client.register_store(store_id, addr).await?;
pd_client.start_watch().await?; // Watch for metadata changes
7.3 Namespace Mode Updates (Runtime)
rpc UpdateNamespaceMode(UpdateNamespaceModeRequest) returns (UpdateNamespaceModeResponse);
rpc ListNamespaceModes(ListNamespaceModesRequest) returns (ListNamespaceModesResponse);
Namespaces can be switched between strong and eventual modes at runtime (except reserved namespaces).
8. Operations
8.1 Cluster Bootstrap
Single Node
flaredb-server --pd-addr 127.0.0.1:2379 --data-dir ./data
# First node auto-creates region covering entire keyspace
Multi-Node (3+ nodes)
# Node 1 (bootstrap)
flaredb-server --pd-addr 127.0.0.1:2379 --bootstrap
# Nodes 2, 3
flaredb-server --pd-addr 127.0.0.1:2379 --join
# Auto-creates 3-replica Raft group
8.2 Region Operations
Region Split
// When region exceeds size threshold
FlareRequest::Split {
region_id: 1,
split_key: b"middle_key",
new_region_id: 2,
}
Region Discovery
// Client queries PD for routing
let (region, leader) = pd.get_region(key).await?;
let store_addr = leader.addr;
8.3 Monitoring
- Health: gRPC health check service
- Metrics (planned):
flaredb_kv_operations_total{type=raw|cas}flaredb_region_countflaredb_raft_proposals_totalflaredb_tso_requests_total
9. Security
9.1 Multi-tenancy
- Namespace isolation: Separate keyspace per namespace
- Reserved namespaces: System namespaces immutable
- Future: Per-namespace ACLs via IAM integration
9.2 Authentication
- Current: None (development mode)
- Planned: mTLS, token-based auth
10. Compatibility
10.1 API Versioning
- gRPC packages:
flaredb.kvrpc,flaredb.pdpb - Wire protocol: Protocol Buffers 3
10.2 TiKV Inspiration
- Multi-Raft per region (similar architecture)
- PD for metadata management
- TSO for timestamps
- Different: Dual consistency modes, simpler API
Appendix
A. Error Codes
| Error | Meaning |
|---|---|
| NOT_LEADER | Node is not region leader |
| REGION_NOT_FOUND | Key not in any region |
| VERSION_MISMATCH | CAS expected_version doesn't match |
| NAMESPACE_RESERVED | Cannot modify reserved namespace |
B. Scan Limits
| Constant | Value | Purpose |
|---|---|---|
| DEFAULT_SCAN_LIMIT | 100 | Default entries per scan |
| MAX_SCAN_LIMIT | 10000 | Maximum entries per scan |
C. Port Assignments
| Port | Protocol | Purpose |
|---|---|---|
| 50051 | gRPC | KV API (flaredb-server) |
| 2379 | gRPC | PD API (flaredb-pd) |
D. Raft Service (Internal)
service RaftService {
rpc VoteV2(OpenRaftVoteRequest) returns (OpenRaftVoteResponse);
rpc AppendEntriesV2(OpenRaftAppendEntriesRequest) returns (OpenRaftAppendEntriesResponse);
rpc InstallSnapshotV2(OpenRaftSnapshotRequest) returns (OpenRaftSnapshotResponse);
rpc ForwardEventual(ForwardEventualRequest) returns (RaftResponse);
}