# FlareDB Specification > Version: 1.0 | Status: Draft | Last Updated: 2025-12-08 ## 1. Overview ### 1.1 Purpose FlareDB is a distributed key-value store designed for DBaaS (Database as a Service) workloads. It provides dual consistency modes: eventual consistency with LWW (Last-Write-Wins) for high throughput, and strong consistency via Raft for transactional operations. ### 1.2 Scope - **In scope**: Multi-region KV storage, dual consistency modes, CAS operations, TSO (Timestamp Oracle), namespace isolation - **Out of scope**: SQL queries (layer above), secondary indexes, full-text search ### 1.3 Design Goals - TiKV-inspired multi-Raft architecture - Tsurugi-like high performance - Flexible per-namespace consistency modes - Horizontal scalability via region splitting ## 2. Architecture ### 2.1 Crate Structure ``` flaredb/ ├── crates/ │ ├── flaredb-cli/ # CLI tool (flaredb-cli) │ ├── flaredb-client/ # Rust client library │ ├── flaredb-pd/ # Placement Driver server │ ├── flaredb-proto/ # gRPC definitions (proto files) │ ├── flaredb-raft/ # OpenRaft integration, Multi-Raft │ ├── flaredb-server/ # KV server binary, services │ ├── flaredb-storage/ # RocksDB engine │ └── flaredb-types/ # Shared types (RegionMeta, commands) └── proto/ (symlink to flaredb-proto/src/) ``` ### 2.2 Data Flow ``` [Client] → [KvRaw/KvCas Service] → [Namespace Router] ↓ [Eventual Mode] [Strong Mode] ↓ ↓ [Local RocksDB] [Raft Consensus] [Async Replication] ↓ [State Machine → RocksDB] ``` ### 2.3 Multi-Raft Architecture ``` ┌─────────────────────────────────────────────────────┐ │ PD Cluster │ │ (Region metadata, TSO, Store registration) │ └─────────────────────────────────────────────────────┘ ↓ ↓ ↓ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ Store 1 │ │ Store 2 │ │ Store 3 │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ │ Region 1 │ │ │ │ Region 1 │ │ │ │ Region 1 │ │ │ │ (Leader) │ │ │ │ (Follower)│ │ │ │ (Follower)│ │ │ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ │ Region 2 │ │ │ │ Region 2 │ │ │ │ Region 2 │ │ │ │ (Follower)│ │ │ │ (Leader) │ │ │ │ (Follower)│ │ │ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │ └───────────────┘ └───────────────┘ └───────────────┘ ``` ### 2.4 Dependencies | Crate | Version | Purpose | |-------|---------|---------| | tokio | 1.40 | Async runtime | | tonic | 0.12 | gRPC framework | | openraft | 0.9 | Raft consensus | | rocksdb | 0.24 | Storage engine | | prost | 0.13 | Protocol buffers | | clap | 4.5 | CLI argument parsing | | sha2 | 0.10 | Merkle tree hashing | ## 3. API ### 3.1 gRPC Services #### KvRaw Service (Eventual Consistency) ```protobuf service KvRaw { rpc RawPut(RawPutRequest) returns (RawPutResponse); rpc RawGet(RawGetRequest) returns (RawGetResponse); rpc RawScan(RawScanRequest) returns (RawScanResponse); } message RawPutRequest { bytes key = 1; bytes value = 2; string namespace = 3; // Empty = default namespace } message RawScanRequest { bytes start_key = 1; // Inclusive bytes end_key = 2; // Exclusive (empty = no upper bound) uint32 limit = 3; // Max entries (0 = default 100) string namespace = 4; } message RawScanResponse { repeated bytes keys = 1; repeated bytes values = 2; bool has_more = 3; bytes next_key = 4; // For pagination } ``` #### KvCas Service (Strong Consistency) ```protobuf service KvCas { rpc CompareAndSwap(CasRequest) returns (CasResponse); rpc Get(GetRequest) returns (GetResponse); rpc Scan(ScanRequest) returns (ScanResponse); } message CasRequest { bytes key = 1; bytes value = 2; uint64 expected_version = 3; // 0 = create if not exists string namespace = 4; } message CasResponse { bool success = 1; uint64 current_version = 2; uint64 new_version = 3; } message GetResponse { bool found = 1; bytes value = 2; uint64 version = 3; } message ScanResponse { repeated VersionedKv entries = 1; bool has_more = 2; bytes next_key = 3; } message VersionedKv { bytes key = 1; bytes value = 2; uint64 version = 3; } ``` #### PD Service (Placement Driver) ```protobuf service Pd { rpc RegisterStore(RegisterStoreRequest) returns (RegisterStoreResponse); rpc GetRegion(GetRegionRequest) returns (GetRegionResponse); rpc ListRegions(ListRegionsRequest) returns (ListRegionsResponse); } service Tso { rpc GetTimestamp(TsoRequest) returns (TsoResponse); } message Region { uint64 id = 1; bytes start_key = 2; // Inclusive (empty = start of keyspace) bytes end_key = 3; // Exclusive (empty = infinity) repeated uint64 peers = 4; uint64 leader_id = 5; } message Store { uint64 id = 1; string addr = 2; } ``` ### 3.2 Client Library ```rust use flaredb_client::RdbClient; // Connect with PD for region routing let mut client = RdbClient::connect_with_pd( "127.0.0.1:50051", // KV server (unused, routing via PD) "127.0.0.1:2379", // PD server ).await?; // Or with namespace isolation let mut client = RdbClient::connect_with_pd_namespace( "127.0.0.1:50051", "127.0.0.1:2379", "my_namespace", ).await?; // TSO (Timestamp Oracle) let ts = client.get_tso().await?; // Raw API (Eventual Consistency) client.raw_put(b"key".to_vec(), b"value".to_vec()).await?; let value = client.raw_get(b"key".to_vec()).await?; // Option> // CAS API (Strong Consistency) let (success, current, new_ver) = client.cas( b"key".to_vec(), b"value".to_vec(), 0, // expected_version: 0 = create if not exists ).await?; let entry = client.cas_get(b"key".to_vec()).await?; // Option<(version, value)> // Scan with pagination let (entries, next_key) = client.cas_scan( b"start".to_vec(), b"end".to_vec(), 100, // limit ).await?; ``` ## 4. Data Models ### 4.1 Core Types #### Region Metadata ```rust pub struct RegionMeta { pub id: u64, pub start_key: Vec, // Inclusive, empty = start of keyspace pub end_key: Vec, // Exclusive, empty = infinity } ``` #### Namespace Configuration ```rust pub struct NamespaceConfig { pub id: u32, pub name: String, pub mode: ConsistencyMode, pub explicit: bool, // User-defined vs auto-created } pub enum ConsistencyMode { Strong, // CAS API, Raft consensus Eventual, // Raw API, LWW replication } ``` #### Raft Log Entry ```rust pub enum FlareRequest { KvWrite { namespace_id: u32, key: Vec, value: Vec, ts: u64, }, Split { region_id: u64, split_key: Vec, new_region_id: u64, }, Noop, } ``` ### 4.2 Key Encoding ``` Raw/CAS Key Format: ┌──────────────────┬────────────────────────┐ │ namespace_id (4B)│ user_key (var) │ │ big-endian │ │ └──────────────────┴────────────────────────┘ Raw Value Format: ┌──────────────────┬────────────────────────┐ │ timestamp (8B) │ user_value (var) │ │ big-endian │ │ └──────────────────┴────────────────────────┘ CAS Value Format: ┌──────────────────┬──────────────────┬────────────────────────┐ │ version (8B) │ timestamp (8B) │ user_value (var) │ │ big-endian │ big-endian │ │ └──────────────────┴──────────────────┴────────────────────────┘ ``` ### 4.3 Reserved Namespaces | Namespace | Mode | Purpose | |-----------|------|---------| | iam | Strong | IAM data (principals, roles) | | metrics | Strong | System metrics | | _system | Strong | Internal metadata | ### 4.4 Storage Format - **Engine**: RocksDB - **Column Families**: - `default`: Raw KV data - `cas`: Versioned CAS data - `raft_log`: Raft log entries - `raft_state`: Raft metadata (hard_state, vote) - **Serialization**: Protocol Buffers ## 5. Configuration ### 5.1 Namespace Configuration ```rust ServerConfig { namespaces: HashMap, default_mode: ConsistencyMode, // For auto-created namespaces reserved_namespaces: ["iam", "metrics", "_system"], } ``` ### 5.2 Raft Configuration ```rust Config { heartbeat_interval: 100, // ms election_timeout_min: 300, // ms election_timeout_max: 600, // ms snapshot_policy: LogsSinceLast(1000), max_in_snapshot_log_to_keep: 100, } ``` ### 5.3 CLI Arguments ```bash flaredb-server [OPTIONS] --store-id Store ID (default: 1) --addr KV server address (default: 127.0.0.1:50051) --data-dir Data directory (default: data) --pd-addr PD server address (default: 127.0.0.1:2379) --peer Peer addresses (repeatable) --namespace-mode Namespace modes (repeatable, e.g., myns=eventual) flaredb-pd [OPTIONS] --addr PD server address (default: 127.0.0.1:2379) ``` ## 6. Consistency Models ### 6.1 Eventual Consistency (Raw API) - **Write Path**: Local RocksDB → Async Raft replication - **Read Path**: Local RocksDB read - **Conflict Resolution**: Last-Write-Wins (LWW) using TSO timestamps - **Guarantees**: Eventually consistent, high throughput ``` Write: Client → Local Store → RocksDB ↓ (async) Raft Replication → Other Stores ``` ### 6.2 Strong Consistency (CAS API) - **Write Path**: Raft consensus → Apply to state machine - **Read Path**: ensure_linearizable() → Leader read - **Guarantees**: Linearizable reads and writes ``` Write: Client → Leader → Raft Consensus → All Stores Read: Client → Leader → Verify leadership → Return ``` ### 6.3 TSO (Timestamp Oracle) ```rust // 64-bit timestamp format ┌────────────────────────────────┬─────────────────┐ │ Physical time (48 bits) │ Logical (16 bits)│ │ milliseconds since epoch │ 0-65535 │ └────────────────────────────────┴─────────────────┘ impl TsoOracle { fn get_timestamp(count: u32) -> u64; fn physical_time(ts: u64) -> u64; // Upper 48 bits fn logical_counter(ts: u64) -> u16; // Lower 16 bits fn compose(physical: u64, logical: u16) -> u64; } ``` **Properties**: - Monotonically increasing - Thread-safe (AtomicU64) - Batch allocation support - ~65536 timestamps per millisecond ## 7. Anti-Entropy & Replication ### 7.1 Merkle Tree Synchronization For eventual consistency mode, FlareDB uses Merkle trees for anti-entropy: ```protobuf rpc GetMerkle(GetMerkleRequest) returns (GetMerkleResponse); rpc FetchRange(FetchRangeRequest) returns (FetchRangeResponse); message GetMerkleResponse { bytes root = 1; // sha256 root hash repeated MerkleRange leaves = 2; // per-chunk hashes } ``` **Anti-entropy flow**: 1. Replica requests Merkle root from leader 2. Compare leaf hashes to identify divergent ranges 3. Fetch divergent ranges via `FetchRange` 4. Apply LWW merge using timestamps ### 7.2 Chainfire Integration FlareDB integrates with Chainfire as its Placement Driver backend: - Store registration and heartbeat - Region metadata watch notifications - Leader reporting for region routing ```rust // Server connects to Chainfire PD PdClient::connect(pd_addr).await?; pd_client.register_store(store_id, addr).await?; pd_client.start_watch().await?; // Watch for metadata changes ``` ### 7.3 Namespace Mode Updates (Runtime) ```protobuf rpc UpdateNamespaceMode(UpdateNamespaceModeRequest) returns (UpdateNamespaceModeResponse); rpc ListNamespaceModes(ListNamespaceModesRequest) returns (ListNamespaceModesResponse); ``` Namespaces can be switched between `strong` and `eventual` modes at runtime (except reserved namespaces). ## 8. Operations ### 8.1 Cluster Bootstrap **Single Node** ```bash flaredb-server --pd-addr 127.0.0.1:2379 --data-dir ./data # First node auto-creates region covering entire keyspace ``` **Multi-Node (3+ nodes)** ```bash # Node 1 (bootstrap) flaredb-server --pd-addr 127.0.0.1:2379 --bootstrap # Nodes 2, 3 flaredb-server --pd-addr 127.0.0.1:2379 --join # Auto-creates 3-replica Raft group ``` ### 8.2 Region Operations **Region Split** ```rust // When region exceeds size threshold FlareRequest::Split { region_id: 1, split_key: b"middle_key", new_region_id: 2, } ``` **Region Discovery** ```rust // Client queries PD for routing let (region, leader) = pd.get_region(key).await?; let store_addr = leader.addr; ``` ### 8.3 Monitoring - **Health**: gRPC health check service - **Metrics** (planned): - `flaredb_kv_operations_total{type=raw|cas}` - `flaredb_region_count` - `flaredb_raft_proposals_total` - `flaredb_tso_requests_total` ## 9. Security ### 9.1 Multi-tenancy - **Namespace isolation**: Separate keyspace per namespace - **Reserved namespaces**: System namespaces immutable - **Future**: Per-namespace ACLs via IAM integration ### 9.2 Authentication - **Current**: None (development mode) - **Planned**: mTLS, token-based auth ## 10. Compatibility ### 10.1 API Versioning - gRPC packages: `flaredb.kvrpc`, `flaredb.pdpb` - Wire protocol: Protocol Buffers 3 ### 10.2 TiKV Inspiration - Multi-Raft per region (similar architecture) - PD for metadata management - TSO for timestamps - **Different**: Dual consistency modes, simpler API ## Appendix ### A. Error Codes | Error | Meaning | |-------|---------| | NOT_LEADER | Node is not region leader | | REGION_NOT_FOUND | Key not in any region | | VERSION_MISMATCH | CAS expected_version doesn't match | | NAMESPACE_RESERVED | Cannot modify reserved namespace | ### B. Scan Limits | Constant | Value | Purpose | |----------|-------|---------| | DEFAULT_SCAN_LIMIT | 100 | Default entries per scan | | MAX_SCAN_LIMIT | 10000 | Maximum entries per scan | ### C. Port Assignments | Port | Protocol | Purpose | |------|----------|---------| | 50051 | gRPC | KV API (flaredb-server) | | 2379 | gRPC | PD API (flaredb-pd) | ### D. Raft Service (Internal) ```protobuf service RaftService { rpc VoteV2(OpenRaftVoteRequest) returns (OpenRaftVoteResponse); rpc AppendEntriesV2(OpenRaftAppendEntriesRequest) returns (OpenRaftAppendEntriesResponse); rpc InstallSnapshotV2(OpenRaftSnapshotRequest) returns (OpenRaftSnapshotResponse); rpc ForwardEventual(ForwardEventualRequest) returns (RaftResponse); } ```