photoncloud-monorepo/specifications/flaredb/README.md
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

16 KiB

FlareDB Specification

Version: 1.0 | Status: Draft | Last Updated: 2025-12-08

1. Overview

1.1 Purpose

FlareDB is a distributed key-value store designed for DBaaS (Database as a Service) workloads. It provides dual consistency modes: eventual consistency with LWW (Last-Write-Wins) for high throughput, and strong consistency via Raft for transactional operations.

1.2 Scope

  • In scope: Multi-region KV storage, dual consistency modes, CAS operations, TSO (Timestamp Oracle), namespace isolation
  • Out of scope: SQL queries (layer above), secondary indexes, full-text search

1.3 Design Goals

  • TiKV-inspired multi-Raft architecture
  • Tsurugi-like high performance
  • Flexible per-namespace consistency modes
  • Horizontal scalability via region splitting

2. Architecture

2.1 Crate Structure

flaredb/
├── crates/
│   ├── flaredb-cli/      # CLI tool (flaredb-cli)
│   ├── flaredb-client/   # Rust client library
│   ├── flaredb-pd/       # Placement Driver server
│   ├── flaredb-proto/    # gRPC definitions (proto files)
│   ├── flaredb-raft/     # OpenRaft integration, Multi-Raft
│   ├── flaredb-server/   # KV server binary, services
│   ├── flaredb-storage/  # RocksDB engine
│   └── flaredb-types/    # Shared types (RegionMeta, commands)
└── proto/ (symlink to flaredb-proto/src/)

2.2 Data Flow

[Client] → [KvRaw/KvCas Service] → [Namespace Router]
                                          ↓
           [Eventual Mode]          [Strong Mode]
                 ↓                       ↓
        [Local RocksDB]          [Raft Consensus]
        [Async Replication]            ↓
                              [State Machine → RocksDB]

2.3 Multi-Raft Architecture

┌─────────────────────────────────────────────────────┐
│                     PD Cluster                       │
│  (Region metadata, TSO, Store registration)          │
└─────────────────────────────────────────────────────┘
        ↓                ↓                ↓
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│   Store 1     │ │   Store 2     │ │   Store 3     │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Region 1  │ │ │ │ Region 1  │ │ │ │ Region 1  │ │
│ │ (Leader)  │ │ │ │ (Follower)│ │ │ │ (Follower)│ │
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Region 2  │ │ │ │ Region 2  │ │ │ │ Region 2  │ │
│ │ (Follower)│ │ │ │ (Leader)  │ │ │ │ (Follower)│ │
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
└───────────────┘ └───────────────┘ └───────────────┘

2.4 Dependencies

Crate Version Purpose
tokio 1.40 Async runtime
tonic 0.12 gRPC framework
openraft 0.9 Raft consensus
rocksdb 0.24 Storage engine
prost 0.13 Protocol buffers
clap 4.5 CLI argument parsing
sha2 0.10 Merkle tree hashing

3. API

3.1 gRPC Services

KvRaw Service (Eventual Consistency)

service KvRaw {
  rpc RawPut(RawPutRequest) returns (RawPutResponse);
  rpc RawGet(RawGetRequest) returns (RawGetResponse);
  rpc RawScan(RawScanRequest) returns (RawScanResponse);
}

message RawPutRequest {
  bytes key = 1;
  bytes value = 2;
  string namespace = 3;  // Empty = default namespace
}

message RawScanRequest {
  bytes start_key = 1;   // Inclusive
  bytes end_key = 2;     // Exclusive (empty = no upper bound)
  uint32 limit = 3;      // Max entries (0 = default 100)
  string namespace = 4;
}

message RawScanResponse {
  repeated bytes keys = 1;
  repeated bytes values = 2;
  bool has_more = 3;
  bytes next_key = 4;    // For pagination
}

KvCas Service (Strong Consistency)

service KvCas {
  rpc CompareAndSwap(CasRequest) returns (CasResponse);
  rpc Get(GetRequest) returns (GetResponse);
  rpc Scan(ScanRequest) returns (ScanResponse);
}

message CasRequest {
  bytes key = 1;
  bytes value = 2;
  uint64 expected_version = 3;  // 0 = create if not exists
  string namespace = 4;
}

message CasResponse {
  bool success = 1;
  uint64 current_version = 2;
  uint64 new_version = 3;
}

message GetResponse {
  bool found = 1;
  bytes value = 2;
  uint64 version = 3;
}

message ScanResponse {
  repeated VersionedKv entries = 1;
  bool has_more = 2;
  bytes next_key = 3;
}

message VersionedKv {
  bytes key = 1;
  bytes value = 2;
  uint64 version = 3;
}

PD Service (Placement Driver)

service Pd {
  rpc RegisterStore(RegisterStoreRequest) returns (RegisterStoreResponse);
  rpc GetRegion(GetRegionRequest) returns (GetRegionResponse);
  rpc ListRegions(ListRegionsRequest) returns (ListRegionsResponse);
}

service Tso {
  rpc GetTimestamp(TsoRequest) returns (TsoResponse);
}

message Region {
  uint64 id = 1;
  bytes start_key = 2;   // Inclusive (empty = start of keyspace)
  bytes end_key = 3;     // Exclusive (empty = infinity)
  repeated uint64 peers = 4;
  uint64 leader_id = 5;
}

message Store {
  uint64 id = 1;
  string addr = 2;
}

3.2 Client Library

use flaredb_client::RdbClient;

// Connect with PD for region routing
let mut client = RdbClient::connect_with_pd(
    "127.0.0.1:50051",  // KV server (unused, routing via PD)
    "127.0.0.1:2379",   // PD server
).await?;

// Or with namespace isolation
let mut client = RdbClient::connect_with_pd_namespace(
    "127.0.0.1:50051",
    "127.0.0.1:2379",
    "my_namespace",
).await?;

// TSO (Timestamp Oracle)
let ts = client.get_tso().await?;

// Raw API (Eventual Consistency)
client.raw_put(b"key".to_vec(), b"value".to_vec()).await?;
let value = client.raw_get(b"key".to_vec()).await?;  // Option<Vec<u8>>

// CAS API (Strong Consistency)
let (success, current, new_ver) = client.cas(
    b"key".to_vec(),
    b"value".to_vec(),
    0,  // expected_version: 0 = create if not exists
).await?;

let entry = client.cas_get(b"key".to_vec()).await?;  // Option<(version, value)>

// Scan with pagination
let (entries, next_key) = client.cas_scan(
    b"start".to_vec(),
    b"end".to_vec(),
    100,  // limit
).await?;

4. Data Models

4.1 Core Types

Region Metadata

pub struct RegionMeta {
    pub id: u64,
    pub start_key: Vec<u8>,  // Inclusive, empty = start of keyspace
    pub end_key: Vec<u8>,    // Exclusive, empty = infinity
}

Namespace Configuration

pub struct NamespaceConfig {
    pub id: u32,
    pub name: String,
    pub mode: ConsistencyMode,
    pub explicit: bool,  // User-defined vs auto-created
}

pub enum ConsistencyMode {
    Strong,    // CAS API, Raft consensus
    Eventual,  // Raw API, LWW replication
}

Raft Log Entry

pub enum FlareRequest {
    KvWrite {
        namespace_id: u32,
        key: Vec<u8>,
        value: Vec<u8>,
        ts: u64,
    },
    Split {
        region_id: u64,
        split_key: Vec<u8>,
        new_region_id: u64,
    },
    Noop,
}

4.2 Key Encoding

Raw/CAS Key Format:
┌──────────────────┬────────────────────────┐
│ namespace_id (4B)│     user_key (var)     │
│   big-endian     │                        │
└──────────────────┴────────────────────────┘

Raw Value Format:
┌──────────────────┬────────────────────────┐
│  timestamp (8B)  │    user_value (var)    │
│   big-endian     │                        │
└──────────────────┴────────────────────────┘

CAS Value Format:
┌──────────────────┬──────────────────┬────────────────────────┐
│  version (8B)    │  timestamp (8B)  │    user_value (var)    │
│   big-endian     │   big-endian     │                        │
└──────────────────┴──────────────────┴────────────────────────┘

4.3 Reserved Namespaces

Namespace Mode Purpose
iam Strong IAM data (principals, roles)
metrics Strong System metrics
_system Strong Internal metadata

4.4 Storage Format

  • Engine: RocksDB
  • Column Families:
    • default: Raw KV data
    • cas: Versioned CAS data
    • raft_log: Raft log entries
    • raft_state: Raft metadata (hard_state, vote)
  • Serialization: Protocol Buffers

5. Configuration

5.1 Namespace Configuration

ServerConfig {
    namespaces: HashMap<String, NamespaceConfig>,
    default_mode: ConsistencyMode,  // For auto-created namespaces
    reserved_namespaces: ["iam", "metrics", "_system"],
}

5.2 Raft Configuration

Config {
    heartbeat_interval: 100,          // ms
    election_timeout_min: 300,        // ms
    election_timeout_max: 600,        // ms
    snapshot_policy: LogsSinceLast(1000),
    max_in_snapshot_log_to_keep: 100,
}

5.3 CLI Arguments

flaredb-server [OPTIONS]
  --store-id <ID>          Store ID (default: 1)
  --addr <ADDR>            KV server address (default: 127.0.0.1:50051)
  --data-dir <PATH>        Data directory (default: data)
  --pd-addr <ADDR>         PD server address (default: 127.0.0.1:2379)
  --peer <ID=ADDR>         Peer addresses (repeatable)
  --namespace-mode <NS=MODE>  Namespace modes (repeatable, e.g., myns=eventual)

flaredb-pd [OPTIONS]
  --addr <ADDR>            PD server address (default: 127.0.0.1:2379)

6. Consistency Models

6.1 Eventual Consistency (Raw API)

  • Write Path: Local RocksDB → Async Raft replication
  • Read Path: Local RocksDB read
  • Conflict Resolution: Last-Write-Wins (LWW) using TSO timestamps
  • Guarantees: Eventually consistent, high throughput
Write: Client → Local Store → RocksDB
                    ↓ (async)
              Raft Replication → Other Stores

6.2 Strong Consistency (CAS API)

  • Write Path: Raft consensus → Apply to state machine
  • Read Path: ensure_linearizable() → Leader read
  • Guarantees: Linearizable reads and writes
Write: Client → Leader → Raft Consensus → All Stores
Read:  Client → Leader → Verify leadership → Return

6.3 TSO (Timestamp Oracle)

// 64-bit timestamp format
┌────────────────────────────────┬─────────────────┐
     Physical time (48 bits)     Logical (16 bits)
   milliseconds since epoch        0-65535        
└────────────────────────────────┴─────────────────┘

impl TsoOracle {
    fn get_timestamp(count: u32) -> u64;
    fn physical_time(ts: u64) -> u64;   // Upper 48 bits
    fn logical_counter(ts: u64) -> u16; // Lower 16 bits
    fn compose(physical: u64, logical: u16) -> u64;
}

Properties:

  • Monotonically increasing
  • Thread-safe (AtomicU64)
  • Batch allocation support
  • ~65536 timestamps per millisecond

7. Anti-Entropy & Replication

7.1 Merkle Tree Synchronization

For eventual consistency mode, FlareDB uses Merkle trees for anti-entropy:

rpc GetMerkle(GetMerkleRequest) returns (GetMerkleResponse);
rpc FetchRange(FetchRangeRequest) returns (FetchRangeResponse);

message GetMerkleResponse {
    bytes root = 1;              // sha256 root hash
    repeated MerkleRange leaves = 2;  // per-chunk hashes
}

Anti-entropy flow:

  1. Replica requests Merkle root from leader
  2. Compare leaf hashes to identify divergent ranges
  3. Fetch divergent ranges via FetchRange
  4. Apply LWW merge using timestamps

7.2 Chainfire Integration

FlareDB integrates with Chainfire as its Placement Driver backend:

  • Store registration and heartbeat
  • Region metadata watch notifications
  • Leader reporting for region routing
// Server connects to Chainfire PD
PdClient::connect(pd_addr).await?;
pd_client.register_store(store_id, addr).await?;
pd_client.start_watch().await?;  // Watch for metadata changes

7.3 Namespace Mode Updates (Runtime)

rpc UpdateNamespaceMode(UpdateNamespaceModeRequest) returns (UpdateNamespaceModeResponse);
rpc ListNamespaceModes(ListNamespaceModesRequest) returns (ListNamespaceModesResponse);

Namespaces can be switched between strong and eventual modes at runtime (except reserved namespaces).

8. Operations

8.1 Cluster Bootstrap

Single Node

flaredb-server --pd-addr 127.0.0.1:2379 --data-dir ./data
# First node auto-creates region covering entire keyspace

Multi-Node (3+ nodes)

# Node 1 (bootstrap)
flaredb-server --pd-addr 127.0.0.1:2379 --bootstrap

# Nodes 2, 3
flaredb-server --pd-addr 127.0.0.1:2379 --join
# Auto-creates 3-replica Raft group

8.2 Region Operations

Region Split

// When region exceeds size threshold
FlareRequest::Split {
    region_id: 1,
    split_key: b"middle_key",
    new_region_id: 2,
}

Region Discovery

// Client queries PD for routing
let (region, leader) = pd.get_region(key).await?;
let store_addr = leader.addr;

8.3 Monitoring

  • Health: gRPC health check service
  • Metrics (planned):
    • flaredb_kv_operations_total{type=raw|cas}
    • flaredb_region_count
    • flaredb_raft_proposals_total
    • flaredb_tso_requests_total

9. Security

9.1 Multi-tenancy

  • Namespace isolation: Separate keyspace per namespace
  • Reserved namespaces: System namespaces immutable
  • Future: Per-namespace ACLs via IAM integration

9.2 Authentication

  • Current: None (development mode)
  • Planned: mTLS, token-based auth

10. Compatibility

10.1 API Versioning

  • gRPC packages: flaredb.kvrpc, flaredb.pdpb
  • Wire protocol: Protocol Buffers 3

10.2 TiKV Inspiration

  • Multi-Raft per region (similar architecture)
  • PD for metadata management
  • TSO for timestamps
  • Different: Dual consistency modes, simpler API

Appendix

A. Error Codes

Error Meaning
NOT_LEADER Node is not region leader
REGION_NOT_FOUND Key not in any region
VERSION_MISMATCH CAS expected_version doesn't match
NAMESPACE_RESERVED Cannot modify reserved namespace

B. Scan Limits

Constant Value Purpose
DEFAULT_SCAN_LIMIT 100 Default entries per scan
MAX_SCAN_LIMIT 10000 Maximum entries per scan

C. Port Assignments

Port Protocol Purpose
50051 gRPC KV API (flaredb-server)
2379 gRPC PD API (flaredb-pd)

D. Raft Service (Internal)

service RaftService {
  rpc VoteV2(OpenRaftVoteRequest) returns (OpenRaftVoteResponse);
  rpc AppendEntriesV2(OpenRaftAppendEntriesRequest) returns (OpenRaftAppendEntriesResponse);
  rpc InstallSnapshotV2(OpenRaftSnapshotRequest) returns (OpenRaftSnapshotResponse);
  rpc ForwardEventual(ForwardEventualRequest) returns (RaftResponse);
}