- netboot-base.nix with SSH key auth - Launch scripts for node01/02/03 - Node configuration.nix and disko.nix - Nix modules for first-boot automation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
526 lines
16 KiB
Markdown
526 lines
16 KiB
Markdown
# FlareDB Specification
|
|
|
|
> Version: 1.0 | Status: Draft | Last Updated: 2025-12-08
|
|
|
|
## 1. Overview
|
|
|
|
### 1.1 Purpose
|
|
FlareDB is a distributed key-value store designed for DBaaS (Database as a Service) workloads. It provides dual consistency modes: eventual consistency with LWW (Last-Write-Wins) for high throughput, and strong consistency via Raft for transactional operations.
|
|
|
|
### 1.2 Scope
|
|
- **In scope**: Multi-region KV storage, dual consistency modes, CAS operations, TSO (Timestamp Oracle), namespace isolation
|
|
- **Out of scope**: SQL queries (layer above), secondary indexes, full-text search
|
|
|
|
### 1.3 Design Goals
|
|
- TiKV-inspired multi-Raft architecture
|
|
- Tsurugi-like high performance
|
|
- Flexible per-namespace consistency modes
|
|
- Horizontal scalability via region splitting
|
|
|
|
## 2. Architecture
|
|
|
|
### 2.1 Crate Structure
|
|
```
|
|
flaredb/
|
|
├── crates/
|
|
│ ├── flaredb-cli/ # CLI tool (flaredb-cli)
|
|
│ ├── flaredb-client/ # Rust client library
|
|
│ ├── flaredb-pd/ # Placement Driver server
|
|
│ ├── flaredb-proto/ # gRPC definitions (proto files)
|
|
│ ├── flaredb-raft/ # OpenRaft integration, Multi-Raft
|
|
│ ├── flaredb-server/ # KV server binary, services
|
|
│ ├── flaredb-storage/ # RocksDB engine
|
|
│ └── flaredb-types/ # Shared types (RegionMeta, commands)
|
|
└── proto/ (symlink to flaredb-proto/src/)
|
|
```
|
|
|
|
### 2.2 Data Flow
|
|
```
|
|
[Client] → [KvRaw/KvCas Service] → [Namespace Router]
|
|
↓
|
|
[Eventual Mode] [Strong Mode]
|
|
↓ ↓
|
|
[Local RocksDB] [Raft Consensus]
|
|
[Async Replication] ↓
|
|
[State Machine → RocksDB]
|
|
```
|
|
|
|
### 2.3 Multi-Raft Architecture
|
|
```
|
|
┌─────────────────────────────────────────────────────┐
|
|
│ PD Cluster │
|
|
│ (Region metadata, TSO, Store registration) │
|
|
└─────────────────────────────────────────────────────┘
|
|
↓ ↓ ↓
|
|
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
|
|
│ Store 1 │ │ Store 2 │ │ Store 3 │
|
|
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
|
|
│ │ Region 1 │ │ │ │ Region 1 │ │ │ │ Region 1 │ │
|
|
│ │ (Leader) │ │ │ │ (Follower)│ │ │ │ (Follower)│ │
|
|
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
|
|
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
|
|
│ │ Region 2 │ │ │ │ Region 2 │ │ │ │ Region 2 │ │
|
|
│ │ (Follower)│ │ │ │ (Leader) │ │ │ │ (Follower)│ │
|
|
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
|
|
└───────────────┘ └───────────────┘ └───────────────┘
|
|
```
|
|
|
|
### 2.4 Dependencies
|
|
| Crate | Version | Purpose |
|
|
|-------|---------|---------|
|
|
| tokio | 1.40 | Async runtime |
|
|
| tonic | 0.12 | gRPC framework |
|
|
| openraft | 0.9 | Raft consensus |
|
|
| rocksdb | 0.24 | Storage engine |
|
|
| prost | 0.13 | Protocol buffers |
|
|
| clap | 4.5 | CLI argument parsing |
|
|
| sha2 | 0.10 | Merkle tree hashing |
|
|
|
|
## 3. API
|
|
|
|
### 3.1 gRPC Services
|
|
|
|
#### KvRaw Service (Eventual Consistency)
|
|
```protobuf
|
|
service KvRaw {
|
|
rpc RawPut(RawPutRequest) returns (RawPutResponse);
|
|
rpc RawGet(RawGetRequest) returns (RawGetResponse);
|
|
rpc RawScan(RawScanRequest) returns (RawScanResponse);
|
|
}
|
|
|
|
message RawPutRequest {
|
|
bytes key = 1;
|
|
bytes value = 2;
|
|
string namespace = 3; // Empty = default namespace
|
|
}
|
|
|
|
message RawScanRequest {
|
|
bytes start_key = 1; // Inclusive
|
|
bytes end_key = 2; // Exclusive (empty = no upper bound)
|
|
uint32 limit = 3; // Max entries (0 = default 100)
|
|
string namespace = 4;
|
|
}
|
|
|
|
message RawScanResponse {
|
|
repeated bytes keys = 1;
|
|
repeated bytes values = 2;
|
|
bool has_more = 3;
|
|
bytes next_key = 4; // For pagination
|
|
}
|
|
```
|
|
|
|
#### KvCas Service (Strong Consistency)
|
|
```protobuf
|
|
service KvCas {
|
|
rpc CompareAndSwap(CasRequest) returns (CasResponse);
|
|
rpc Get(GetRequest) returns (GetResponse);
|
|
rpc Scan(ScanRequest) returns (ScanResponse);
|
|
}
|
|
|
|
message CasRequest {
|
|
bytes key = 1;
|
|
bytes value = 2;
|
|
uint64 expected_version = 3; // 0 = create if not exists
|
|
string namespace = 4;
|
|
}
|
|
|
|
message CasResponse {
|
|
bool success = 1;
|
|
uint64 current_version = 2;
|
|
uint64 new_version = 3;
|
|
}
|
|
|
|
message GetResponse {
|
|
bool found = 1;
|
|
bytes value = 2;
|
|
uint64 version = 3;
|
|
}
|
|
|
|
message ScanResponse {
|
|
repeated VersionedKv entries = 1;
|
|
bool has_more = 2;
|
|
bytes next_key = 3;
|
|
}
|
|
|
|
message VersionedKv {
|
|
bytes key = 1;
|
|
bytes value = 2;
|
|
uint64 version = 3;
|
|
}
|
|
```
|
|
|
|
#### PD Service (Placement Driver)
|
|
```protobuf
|
|
service Pd {
|
|
rpc RegisterStore(RegisterStoreRequest) returns (RegisterStoreResponse);
|
|
rpc GetRegion(GetRegionRequest) returns (GetRegionResponse);
|
|
rpc ListRegions(ListRegionsRequest) returns (ListRegionsResponse);
|
|
}
|
|
|
|
service Tso {
|
|
rpc GetTimestamp(TsoRequest) returns (TsoResponse);
|
|
}
|
|
|
|
message Region {
|
|
uint64 id = 1;
|
|
bytes start_key = 2; // Inclusive (empty = start of keyspace)
|
|
bytes end_key = 3; // Exclusive (empty = infinity)
|
|
repeated uint64 peers = 4;
|
|
uint64 leader_id = 5;
|
|
}
|
|
|
|
message Store {
|
|
uint64 id = 1;
|
|
string addr = 2;
|
|
}
|
|
```
|
|
|
|
### 3.2 Client Library
|
|
```rust
|
|
use flaredb_client::RdbClient;
|
|
|
|
// Connect with PD for region routing
|
|
let mut client = RdbClient::connect_with_pd(
|
|
"127.0.0.1:50051", // KV server (unused, routing via PD)
|
|
"127.0.0.1:2379", // PD server
|
|
).await?;
|
|
|
|
// Or with namespace isolation
|
|
let mut client = RdbClient::connect_with_pd_namespace(
|
|
"127.0.0.1:50051",
|
|
"127.0.0.1:2379",
|
|
"my_namespace",
|
|
).await?;
|
|
|
|
// TSO (Timestamp Oracle)
|
|
let ts = client.get_tso().await?;
|
|
|
|
// Raw API (Eventual Consistency)
|
|
client.raw_put(b"key".to_vec(), b"value".to_vec()).await?;
|
|
let value = client.raw_get(b"key".to_vec()).await?; // Option<Vec<u8>>
|
|
|
|
// CAS API (Strong Consistency)
|
|
let (success, current, new_ver) = client.cas(
|
|
b"key".to_vec(),
|
|
b"value".to_vec(),
|
|
0, // expected_version: 0 = create if not exists
|
|
).await?;
|
|
|
|
let entry = client.cas_get(b"key".to_vec()).await?; // Option<(version, value)>
|
|
|
|
// Scan with pagination
|
|
let (entries, next_key) = client.cas_scan(
|
|
b"start".to_vec(),
|
|
b"end".to_vec(),
|
|
100, // limit
|
|
).await?;
|
|
```
|
|
|
|
## 4. Data Models
|
|
|
|
### 4.1 Core Types
|
|
|
|
#### Region Metadata
|
|
```rust
|
|
pub struct RegionMeta {
|
|
pub id: u64,
|
|
pub start_key: Vec<u8>, // Inclusive, empty = start of keyspace
|
|
pub end_key: Vec<u8>, // Exclusive, empty = infinity
|
|
}
|
|
```
|
|
|
|
#### Namespace Configuration
|
|
```rust
|
|
pub struct NamespaceConfig {
|
|
pub id: u32,
|
|
pub name: String,
|
|
pub mode: ConsistencyMode,
|
|
pub explicit: bool, // User-defined vs auto-created
|
|
}
|
|
|
|
pub enum ConsistencyMode {
|
|
Strong, // CAS API, Raft consensus
|
|
Eventual, // Raw API, LWW replication
|
|
}
|
|
```
|
|
|
|
#### Raft Log Entry
|
|
```rust
|
|
pub enum FlareRequest {
|
|
KvWrite {
|
|
namespace_id: u32,
|
|
key: Vec<u8>,
|
|
value: Vec<u8>,
|
|
ts: u64,
|
|
},
|
|
Split {
|
|
region_id: u64,
|
|
split_key: Vec<u8>,
|
|
new_region_id: u64,
|
|
},
|
|
Noop,
|
|
}
|
|
```
|
|
|
|
### 4.2 Key Encoding
|
|
```
|
|
Raw/CAS Key Format:
|
|
┌──────────────────┬────────────────────────┐
|
|
│ namespace_id (4B)│ user_key (var) │
|
|
│ big-endian │ │
|
|
└──────────────────┴────────────────────────┘
|
|
|
|
Raw Value Format:
|
|
┌──────────────────┬────────────────────────┐
|
|
│ timestamp (8B) │ user_value (var) │
|
|
│ big-endian │ │
|
|
└──────────────────┴────────────────────────┘
|
|
|
|
CAS Value Format:
|
|
┌──────────────────┬──────────────────┬────────────────────────┐
|
|
│ version (8B) │ timestamp (8B) │ user_value (var) │
|
|
│ big-endian │ big-endian │ │
|
|
└──────────────────┴──────────────────┴────────────────────────┘
|
|
```
|
|
|
|
### 4.3 Reserved Namespaces
|
|
| Namespace | Mode | Purpose |
|
|
|-----------|------|---------|
|
|
| iam | Strong | IAM data (principals, roles) |
|
|
| metrics | Strong | System metrics |
|
|
| _system | Strong | Internal metadata |
|
|
|
|
### 4.4 Storage Format
|
|
- **Engine**: RocksDB
|
|
- **Column Families**:
|
|
- `default`: Raw KV data
|
|
- `cas`: Versioned CAS data
|
|
- `raft_log`: Raft log entries
|
|
- `raft_state`: Raft metadata (hard_state, vote)
|
|
- **Serialization**: Protocol Buffers
|
|
|
|
## 5. Configuration
|
|
|
|
### 5.1 Namespace Configuration
|
|
```rust
|
|
ServerConfig {
|
|
namespaces: HashMap<String, NamespaceConfig>,
|
|
default_mode: ConsistencyMode, // For auto-created namespaces
|
|
reserved_namespaces: ["iam", "metrics", "_system"],
|
|
}
|
|
```
|
|
|
|
### 5.2 Raft Configuration
|
|
```rust
|
|
Config {
|
|
heartbeat_interval: 100, // ms
|
|
election_timeout_min: 300, // ms
|
|
election_timeout_max: 600, // ms
|
|
snapshot_policy: LogsSinceLast(1000),
|
|
max_in_snapshot_log_to_keep: 100,
|
|
}
|
|
```
|
|
|
|
### 5.3 CLI Arguments
|
|
```bash
|
|
flaredb-server [OPTIONS]
|
|
--store-id <ID> Store ID (default: 1)
|
|
--addr <ADDR> KV server address (default: 127.0.0.1:50051)
|
|
--data-dir <PATH> Data directory (default: data)
|
|
--pd-addr <ADDR> PD server address (default: 127.0.0.1:2379)
|
|
--peer <ID=ADDR> Peer addresses (repeatable)
|
|
--namespace-mode <NS=MODE> Namespace modes (repeatable, e.g., myns=eventual)
|
|
|
|
flaredb-pd [OPTIONS]
|
|
--addr <ADDR> PD server address (default: 127.0.0.1:2379)
|
|
```
|
|
|
|
## 6. Consistency Models
|
|
|
|
### 6.1 Eventual Consistency (Raw API)
|
|
- **Write Path**: Local RocksDB → Async Raft replication
|
|
- **Read Path**: Local RocksDB read
|
|
- **Conflict Resolution**: Last-Write-Wins (LWW) using TSO timestamps
|
|
- **Guarantees**: Eventually consistent, high throughput
|
|
|
|
```
|
|
Write: Client → Local Store → RocksDB
|
|
↓ (async)
|
|
Raft Replication → Other Stores
|
|
```
|
|
|
|
### 6.2 Strong Consistency (CAS API)
|
|
- **Write Path**: Raft consensus → Apply to state machine
|
|
- **Read Path**: ensure_linearizable() → Leader read
|
|
- **Guarantees**: Linearizable reads and writes
|
|
|
|
```
|
|
Write: Client → Leader → Raft Consensus → All Stores
|
|
Read: Client → Leader → Verify leadership → Return
|
|
```
|
|
|
|
### 6.3 TSO (Timestamp Oracle)
|
|
```rust
|
|
// 64-bit timestamp format
|
|
┌────────────────────────────────┬─────────────────┐
|
|
│ Physical time (48 bits) │ Logical (16 bits)│
|
|
│ milliseconds since epoch │ 0-65535 │
|
|
└────────────────────────────────┴─────────────────┘
|
|
|
|
impl TsoOracle {
|
|
fn get_timestamp(count: u32) -> u64;
|
|
fn physical_time(ts: u64) -> u64; // Upper 48 bits
|
|
fn logical_counter(ts: u64) -> u16; // Lower 16 bits
|
|
fn compose(physical: u64, logical: u16) -> u64;
|
|
}
|
|
```
|
|
|
|
**Properties**:
|
|
- Monotonically increasing
|
|
- Thread-safe (AtomicU64)
|
|
- Batch allocation support
|
|
- ~65536 timestamps per millisecond
|
|
|
|
## 7. Anti-Entropy & Replication
|
|
|
|
### 7.1 Merkle Tree Synchronization
|
|
For eventual consistency mode, FlareDB uses Merkle trees for anti-entropy:
|
|
|
|
```protobuf
|
|
rpc GetMerkle(GetMerkleRequest) returns (GetMerkleResponse);
|
|
rpc FetchRange(FetchRangeRequest) returns (FetchRangeResponse);
|
|
|
|
message GetMerkleResponse {
|
|
bytes root = 1; // sha256 root hash
|
|
repeated MerkleRange leaves = 2; // per-chunk hashes
|
|
}
|
|
```
|
|
|
|
**Anti-entropy flow**:
|
|
1. Replica requests Merkle root from leader
|
|
2. Compare leaf hashes to identify divergent ranges
|
|
3. Fetch divergent ranges via `FetchRange`
|
|
4. Apply LWW merge using timestamps
|
|
|
|
### 7.2 Chainfire Integration
|
|
FlareDB integrates with Chainfire as its Placement Driver backend:
|
|
- Store registration and heartbeat
|
|
- Region metadata watch notifications
|
|
- Leader reporting for region routing
|
|
|
|
```rust
|
|
// Server connects to Chainfire PD
|
|
PdClient::connect(pd_addr).await?;
|
|
pd_client.register_store(store_id, addr).await?;
|
|
pd_client.start_watch().await?; // Watch for metadata changes
|
|
```
|
|
|
|
### 7.3 Namespace Mode Updates (Runtime)
|
|
```protobuf
|
|
rpc UpdateNamespaceMode(UpdateNamespaceModeRequest) returns (UpdateNamespaceModeResponse);
|
|
rpc ListNamespaceModes(ListNamespaceModesRequest) returns (ListNamespaceModesResponse);
|
|
```
|
|
|
|
Namespaces can be switched between `strong` and `eventual` modes at runtime (except reserved namespaces).
|
|
|
|
## 8. Operations
|
|
|
|
### 8.1 Cluster Bootstrap
|
|
|
|
**Single Node**
|
|
```bash
|
|
flaredb-server --pd-addr 127.0.0.1:2379 --data-dir ./data
|
|
# First node auto-creates region covering entire keyspace
|
|
```
|
|
|
|
**Multi-Node (3+ nodes)**
|
|
```bash
|
|
# Node 1 (bootstrap)
|
|
flaredb-server --pd-addr 127.0.0.1:2379 --bootstrap
|
|
|
|
# Nodes 2, 3
|
|
flaredb-server --pd-addr 127.0.0.1:2379 --join
|
|
# Auto-creates 3-replica Raft group
|
|
```
|
|
|
|
### 8.2 Region Operations
|
|
|
|
**Region Split**
|
|
```rust
|
|
// When region exceeds size threshold
|
|
FlareRequest::Split {
|
|
region_id: 1,
|
|
split_key: b"middle_key",
|
|
new_region_id: 2,
|
|
}
|
|
```
|
|
|
|
**Region Discovery**
|
|
```rust
|
|
// Client queries PD for routing
|
|
let (region, leader) = pd.get_region(key).await?;
|
|
let store_addr = leader.addr;
|
|
```
|
|
|
|
### 8.3 Monitoring
|
|
- **Health**: gRPC health check service
|
|
- **Metrics** (planned):
|
|
- `flaredb_kv_operations_total{type=raw|cas}`
|
|
- `flaredb_region_count`
|
|
- `flaredb_raft_proposals_total`
|
|
- `flaredb_tso_requests_total`
|
|
|
|
## 9. Security
|
|
|
|
### 9.1 Multi-tenancy
|
|
- **Namespace isolation**: Separate keyspace per namespace
|
|
- **Reserved namespaces**: System namespaces immutable
|
|
- **Future**: Per-namespace ACLs via IAM integration
|
|
|
|
### 9.2 Authentication
|
|
- **Current**: None (development mode)
|
|
- **Planned**: mTLS, token-based auth
|
|
|
|
## 10. Compatibility
|
|
|
|
### 10.1 API Versioning
|
|
- gRPC packages: `flaredb.kvrpc`, `flaredb.pdpb`
|
|
- Wire protocol: Protocol Buffers 3
|
|
|
|
### 10.2 TiKV Inspiration
|
|
- Multi-Raft per region (similar architecture)
|
|
- PD for metadata management
|
|
- TSO for timestamps
|
|
- **Different**: Dual consistency modes, simpler API
|
|
|
|
## Appendix
|
|
|
|
### A. Error Codes
|
|
| Error | Meaning |
|
|
|-------|---------|
|
|
| NOT_LEADER | Node is not region leader |
|
|
| REGION_NOT_FOUND | Key not in any region |
|
|
| VERSION_MISMATCH | CAS expected_version doesn't match |
|
|
| NAMESPACE_RESERVED | Cannot modify reserved namespace |
|
|
|
|
### B. Scan Limits
|
|
| Constant | Value | Purpose |
|
|
|----------|-------|---------|
|
|
| DEFAULT_SCAN_LIMIT | 100 | Default entries per scan |
|
|
| MAX_SCAN_LIMIT | 10000 | Maximum entries per scan |
|
|
|
|
### C. Port Assignments
|
|
| Port | Protocol | Purpose |
|
|
|------|----------|---------|
|
|
| 50051 | gRPC | KV API (flaredb-server) |
|
|
| 2379 | gRPC | PD API (flaredb-pd) |
|
|
|
|
### D. Raft Service (Internal)
|
|
```protobuf
|
|
service RaftService {
|
|
rpc VoteV2(OpenRaftVoteRequest) returns (OpenRaftVoteResponse);
|
|
rpc AppendEntriesV2(OpenRaftAppendEntriesRequest) returns (OpenRaftAppendEntriesResponse);
|
|
rpc InstallSnapshotV2(OpenRaftSnapshotRequest) returns (OpenRaftSnapshotResponse);
|
|
rpc ForwardEventual(ForwardEventualRequest) returns (RaftResponse);
|
|
}
|
|
```
|