photoncloud-monorepo/docs/por/T033-metricstor/DESIGN.md
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

121 KiB

Metricstor Design Document

Project: Metricstor - VictoriaMetrics OSS Replacement Task: T033.S1 Research & Architecture Version: 1.0 Date: 2025-12-10 Author: PeerB


Table of Contents

  1. Executive Summary
  2. Requirements
  3. Time-Series Storage Model
  4. Push Ingestion API
  5. PromQL Query Engine
  6. Storage Backend Architecture
  7. Integration Points
  8. Implementation Plan
  9. Open Questions
  10. References

1. Executive Summary

1.1 Overview

Metricstor is a fully open-source, distributed time-series database designed as a replacement for VictoriaMetrics, addressing the critical requirement that VictoriaMetrics' mTLS support is a paid feature. As the final component (Item 12/12) of PROJECT.md, Metricstor completes the observability stack for the Japanese cloud platform.

1.2 High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Service Mesh                              │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐           │
│  │ FlareDB  │ │ ChainFire│ │ PlasmaVMC│ │  IAM     │  ...      │
│  │ :9092    │ │ :9091    │ │ :9093    │ │ :9094    │           │
│  └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘           │
│       │            │            │            │                   │
│       └────────────┴────────────┴────────────┘                   │
│                         │                                        │
│                         │ Push (remote_write)                    │
│                         │ mTLS                                   │
│                         ▼                                        │
│              ┌──────────────────────┐                            │
│              │   Metricstor Server  │                            │
│              │  ┌────────────────┐  │                            │
│              │  │  Ingestion API │  │ ← Prometheus remote_write  │
│              │  │  (gRPC/HTTP)   │  │                            │
│              │  └────────┬───────┘  │                            │
│              │           │          │                            │
│              │  ┌────────▼───────┐  │                            │
│              │  │  Write Buffer  │  │                            │
│              │  │  (In-Memory)   │  │                            │
│              │  └────────┬───────┘  │                            │
│              │           │          │                            │
│              │  ┌────────▼───────┐  │                            │
│              │  │  Storage Engine│  │                            │
│              │  │  ┌──────────┐  │  │                            │
│              │  │  │ Head     │  │  │ ← WAL + In-Memory Index    │
│              │  │  │ (Active) │  │  │                            │
│              │  │  └────┬─────┘  │  │                            │
│              │  │       │        │  │                            │
│              │  │  ┌────▼─────┐  │  │                            │
│              │  │  │ Blocks   │  │  │ ← Immutable, Compressed    │
│              │  │  │ (TSDB)   │  │  │                            │
│              │  │  └──────────┘  │  │                            │
│              │  └────────────────┘  │                            │
│              │           │          │                            │
│              │  ┌────────▼───────┐  │                            │
│              │  │  Query Engine  │  │ ← PromQL Execution         │
│              │  │  (PromQL AST)  │  │                            │
│              │  └────────┬───────┘  │                            │
│              │           │          │                            │
│              └───────────┼──────────┘                            │
│                          │                                        │
│                          │ Query (HTTP)                           │
│                          │ mTLS                                   │
│                          ▼                                        │
│              ┌──────────────────────┐                            │
│              │   Grafana / Clients  │                            │
│              └──────────────────────┘                            │
└─────────────────────────────────────────────────────────────────┘

                  ┌─────────────────────┐
                  │  FlareDB Cluster    │  ← Metadata (optional)
                  │  (Metadata Store)   │
                  └─────────────────────┘

                  ┌─────────────────────┐
                  │  S3-Compatible      │  ← Cold Storage (future)
                  │  Object Storage     │
                  └─────────────────────┘

1.3 Key Design Decisions

  1. Storage Format: Hybrid approach using Prometheus TSDB block design with Gorilla compression

    • Rationale: Battle-tested, excellent compression (1-2 bytes/sample), widely understood
  2. Storage Backend: Dedicated time-series engine with optional FlareDB metadata integration

    • Rationale: Time-series workloads have unique access patterns; KV stores not optimal for sample storage
    • FlareDB reserved for metadata (series labels, index) in distributed scenarios
  3. PromQL Subset: Support 80% of common use cases (instant/range queries, basic aggregations, rate/increase)

    • Rationale: Full PromQL compatibility is complex; focus on practical operator needs
  4. Push Model: Prometheus remote_write v1.0 protocol via HTTP + gRPC APIs

    • Rationale: Standard protocol, Snappy compression built-in, client library availability
  5. mTLS Integration: Consistent with T027/T031 patterns (cert_file, key_file, ca_file, require_client_cert)

    • Rationale: Unified security model across all platform services

1.4 Success Criteria

  • Accept metrics from 8+ services (ports 9091-9099) via remote_write
  • Query latency <100ms for instant queries (p95)
  • Compression ratio ≥10:1 (target: 1.5-2 bytes/sample)
  • Support 100K samples/sec write throughput per instance
  • PromQL queries cover 80% of Grafana dashboard use cases
  • Zero vendor lock-in (100% OSS, no paid features)

2. Requirements

2.1 Functional Requirements

FR-1: Push-Based Metric Ingestion

  • FR-1.1: Accept Prometheus remote_write v1.0 protocol (HTTP POST)
  • FR-1.2: Support Snappy-compressed protobuf payloads
  • FR-1.3: Validate metric names and labels per Prometheus naming conventions
  • FR-1.4: Handle out-of-order samples within a configurable time window (default: 1h)
  • FR-1.5: Deduplicate duplicate samples (same timestamp + labels)
  • FR-1.6: Return backpressure signals (HTTP 429/503) when buffer is full

FR-2: PromQL Query Engine

  • FR-2.1: Support instant queries (/api/v1/query)
  • FR-2.2: Support range queries (/api/v1/query_range)
  • FR-2.3: Support label queries (/api/v1/label/<name>/values, /api/v1/labels)
  • FR-2.4: Support series metadata queries (/api/v1/series)
  • FR-2.5: Implement core PromQL functions (see Section 5.2)
  • FR-2.6: Support Prometheus HTTP API JSON response format

FR-3: Time-Series Storage

  • FR-3.1: Store samples with millisecond timestamp precision
  • FR-3.2: Support configurable retention periods (default: 15 days, configurable 1-365 days)
  • FR-3.3: Automatic background compaction of blocks
  • FR-3.4: Crash recovery via Write-Ahead Log (WAL)
  • FR-3.5: Series cardinality limits to prevent explosion (default: 10M series)

FR-4: Security & Authentication

  • FR-4.1: mTLS support for ingestion and query APIs
  • FR-4.2: Optional basic authentication for HTTP endpoints
  • FR-4.3: Rate limiting per client (based on mTLS certificate CN or IP)

FR-5: Operational Features

  • FR-5.1: Prometheus-compatible /metrics endpoint for self-monitoring
  • FR-5.2: Health check endpoints (/health, /ready)
  • FR-5.3: Admin API for series deletion, compaction trigger
  • FR-5.4: TOML configuration file support
  • FR-5.5: Environment variable overrides

2.2 Non-Functional Requirements

NFR-1: Performance

  • NFR-1.1: Ingestion throughput: ≥100K samples/sec per instance
  • NFR-1.2: Query latency (p95): <100ms for instant queries, <500ms for range queries (1h window)
  • NFR-1.3: Compression ratio: ≥10:1 (target: 1.5-2 bytes/sample)
  • NFR-1.4: Memory usage: <2GB for 1M active series

NFR-2: Scalability

  • NFR-2.1: Vertical scaling: Support 10M active series per instance
  • NFR-2.2: Horizontal scaling: Support sharding across multiple instances (future work)
  • NFR-2.3: Storage: Support local disk + optional S3-compatible backend for cold data

NFR-3: Reliability

  • NFR-3.1: No data loss for committed samples (WAL durability)
  • NFR-3.2: Graceful degradation under load (reject writes with backpressure, not crash)
  • NFR-3.3: Crash recovery time: <30s for 10M series

NFR-4: Maintainability

  • NFR-4.1: Codebase consistency with other platform services (FlareDB, ChainFire patterns)
  • NFR-4.2: 100% Rust, no CGO dependencies
  • NFR-4.3: Comprehensive unit and integration tests
  • NFR-4.4: Operator documentation with runbooks

NFR-5: Compatibility

  • NFR-5.1: Prometheus remote_write v1.0 protocol compatibility
  • NFR-5.2: Prometheus HTTP API compatibility (subset: query, query_range, labels, series)
  • NFR-5.3: Grafana data source compatibility

2.3 Out of Scope (Explicitly Not Supported in v1)

  • Prometheus remote_read protocol (pull-based; platform uses push)
  • Full PromQL compatibility (complex subqueries, advanced functions)
  • Multi-tenancy (single-tenant per instance; use multiple instances for multi-tenant)
  • Distributed query federation (single-instance queries only)
  • Recording rules and alerting (use separate Prometheus/Alertmanager for this)

3. Time-Series Storage Model

3.1 Data Model

3.1.1 Metric Structure

A time-series metric in Metricstor follows the Prometheus data model:

metric_name{label1="value1", label2="value2", ...} value timestamp

Example:

http_requests_total{method="GET", status="200", service="flaredb"} 1543 1733832000000

Components:

  • Metric Name: Identifier for the measurement (e.g., http_requests_total)

    • Must match regex: [a-zA-Z_:][a-zA-Z0-9_:]*
  • Labels: Key-value pairs for dimensionality (e.g., {method="GET", status="200"})

    • Label names: [a-zA-Z_][a-zA-Z0-9_]*
    • Label values: Any UTF-8 string
    • Reserved labels: __name__ (stores metric name), labels starting with __ are internal
  • Value: Float64 sample value

  • Timestamp: Millisecond precision (int64 milliseconds since Unix epoch)

3.1.2 Series Identification

A series is uniquely identified by its metric name + label set:

// Pseudo-code representation
struct SeriesID {
    hash: u64,  // FNV-1a hash of sorted labels
}

struct Series {
    id: SeriesID,
    labels: BTreeMap<String, String>,  // Sorted for consistent hashing
    chunks: Vec<ChunkRef>,
}

Series ID calculation:

  1. Sort labels lexicographically (including __name__ label)
  2. Concatenate as: label1_name + \0 + label1_value + \0 + label2_name + \0 + ...
  3. Compute FNV-1a 64-bit hash

3.2 Storage Format

3.2.1 Architecture Overview

Metricstor uses a hybrid storage architecture inspired by Prometheus TSDB and Gorilla:

┌─────────────────────────────────────────────────────────────────┐
│                        Memory Layer (Head)                       │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
│  │  Series Map  │  │  WAL Segment │  │ Write Buffer │          │
│  │  (In-Memory  │  │  (Disk)      │  │ (MPSC Queue) │          │
│  │   Index)     │  │              │  │              │          │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘          │
│         │                 │                 │                    │
│         └─────────────────┴─────────────────┘                    │
│                           │                                      │
│                           ▼                                      │
│              ┌─────────────────────────┐                         │
│              │    Active Chunks        │                         │
│              │  (Gorilla-compressed)   │                         │
│              │   - 2h time windows     │                         │
│              │   - Delta-of-delta TS   │                         │
│              │   - XOR float encoding  │                         │
│              └─────────────────────────┘                         │
└─────────────────────────────────────────────────────────────────┘
                           │
                           │ Compaction Trigger
                           │ (every 2h or on shutdown)
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│                        Disk Layer (Blocks)                       │
│                                                                   │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐ │
│  │  Block 1        │  │  Block 2        │  │  Block N        │ │
│  │  [0h - 2h)      │  │  [2h - 4h)      │  │  [Nh - (N+2)h)  │ │
│  │                 │  │                 │  │                 │ │
│  │  ├─ meta.json   │  │  ├─ meta.json   │  │  ├─ meta.json   │ │
│  │  ├─ index       │  │  ├─ index       │  │  ├─ index       │ │
│  │  ├─ chunks/000  │  │  ├─ chunks/000  │  │  ├─ chunks/000  │ │
│  │  └─ tombstones  │  │  └─ tombstones  │  │  └─ tombstones  │ │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

3.2.2 Write-Ahead Log (WAL)

Purpose: Durability and crash recovery

Format: Append-only log segments (128MB default size)

WAL Structure:
data/
  wal/
    00000001  ← Segment 1 (128MB)
    00000002  ← Segment 2 (active)

WAL Record Format (inspired by LevelDB):

┌───────────────────────────────────────────────────┐
│  CRC32 (4 bytes)                                  │
├───────────────────────────────────────────────────┤
│  Length (4 bytes, little-endian)                  │
├───────────────────────────────────────────────────┤
│  Type (1 byte): FULL | FIRST | MIDDLE | LAST      │
├───────────────────────────────────────────────────┤
│  Payload (variable):                              │
│    - Record Type (1 byte): Series | Samples       │
│    - Series ID (8 bytes)                          │
│    - Labels (length-prefixed strings)             │
│    - Samples (varint timestamp, float64 value)    │
└───────────────────────────────────────────────────┘

WAL Operations:

  • Append: Every write appends to active segment
  • Checkpoint: Snapshot of in-memory state to disk blocks
  • Truncate: Delete segments older than oldest in-memory data
  • Replay: On startup, replay WAL segments to rebuild in-memory state

Rust Implementation Sketch:

struct WAL {
    dir: PathBuf,
    segment_size: usize,  // 128MB default
    active_segment: File,
    active_segment_num: u64,
}

impl WAL {
    fn append(&mut self, record: &WALRecord) -> Result<()> {
        let encoded = record.encode();
        let crc = crc32(&encoded);

        // Rotate segment if needed
        if self.active_segment.metadata()?.len() + encoded.len() > self.segment_size {
            self.rotate_segment()?;
        }

        self.active_segment.write_all(&crc.to_le_bytes())?;
        self.active_segment.write_all(&(encoded.len() as u32).to_le_bytes())?;
        self.active_segment.write_all(&encoded)?;
        self.active_segment.sync_all()?;  // fsync for durability
        Ok(())
    }

    fn replay(&self) -> Result<Vec<WALRecord>> {
        // Read all segments and decode records
        // Used on startup for crash recovery
    }
}

3.2.3 In-Memory Head Block

Purpose: Accept recent writes, maintain hot data for fast queries

Structure:

struct Head {
    series: RwLock<HashMap<SeriesID, Arc<Series>>>,
    min_time: AtomicI64,
    max_time: AtomicI64,
    chunk_size: Duration,  // 2h default
    wal: Arc<WAL>,
}

struct Series {
    id: SeriesID,
    labels: BTreeMap<String, String>,
    chunks: RwLock<Vec<Chunk>>,
}

struct Chunk {
    min_time: i64,
    max_time: i64,
    samples: CompressedSamples,  // Gorilla encoding
}

Chunk Lifecycle:

  1. Creation: New chunk created when first sample arrives or previous chunk is full
  2. Active: Chunk accepts samples in time window [min_time, min_time + 2h)
  3. Full: Chunk reaches 2h window, new chunk created for subsequent samples
  4. Compaction: Full chunks compacted to disk blocks

Memory Limits:

  • Max series: 10M (configurable)
  • Max chunks per series: 2 (active + previous, covering 4h)
  • Eviction: LRU eviction of inactive series (no samples in 4h)

3.2.4 Disk Blocks (Immutable)

Purpose: Long-term storage of compacted time-series data

Block Structure (inspired by Prometheus TSDB):

data/
  01HQZQZQZQZQZQZQZQZQZQ/  ← Block directory (ULID)
    meta.json              ← Metadata
    index                  ← Inverted index
    chunks/
      000001               ← Chunk file
      000002
      ...
    tombstones             ← Deleted series/samples

meta.json Format:

{
  "ulid": "01HQZQZQZQZQZQZQZQZQZQ",
  "minTime": 1733832000000,
  "maxTime": 1733839200000,
  "stats": {
    "numSamples": 1500000,
    "numSeries": 5000,
    "numChunks": 10000
  },
  "compaction": {
    "level": 1,
    "sources": ["01HQZQZ..."]
  },
  "version": 1
}

Index File Format (simplified):

The index file provides fast lookups of series by labels.

┌────────────────────────────────────────────────┐
│  Magic Number (4 bytes): 0xBADA55A0           │
├────────────────────────────────────────────────┤
│  Version (1 byte): 1                           │
├────────────────────────────────────────────────┤
│  Symbol Table Section                          │
│    - Sorted strings (label names/values)       │
│    - Offset table for binary search            │
├────────────────────────────────────────────────┤
│  Series Section                                │
│    - SeriesID → Chunk Refs mapping             │
│    - (series_id, labels, chunk_offsets)        │
├────────────────────────────────────────────────┤
│  Label Index Section (Inverted Index)          │
│    - label_name → [series_ids]                 │
│    - (label_name, label_value) → [series_ids]  │
├────────────────────────────────────────────────┤
│  Postings Section                              │
│    - Sorted posting lists for label matchers   │
│    - Compressed with varint + bit packing      │
├────────────────────────────────────────────────┤
│  TOC (Table of Contents)                       │
│    - Offsets to each section                   │
└────────────────────────────────────────────────┘

Chunks File Format:

Chunk File (chunks/000001):
┌────────────────────────────────────────────────┐
│  Chunk 1:                                      │
│    ├─ Length (4 bytes)                         │
│    ├─ Encoding (1 byte): Gorilla = 0x01        │
│    ├─ MinTime (8 bytes)                        │
│    ├─ MaxTime (8 bytes)                        │
│    ├─ NumSamples (4 bytes)                     │
│    └─ Compressed Data (variable)               │
├────────────────────────────────────────────────┤
│  Chunk 2: ...                                  │
└────────────────────────────────────────────────┘

3.3 Compression Strategy

3.3.1 Gorilla Compression Algorithm

Metricstor uses Gorilla compression from Facebook's paper (VLDB 2015), achieving ~12x compression.

Timestamp Compression (Delta-of-Delta):

Example timestamps (ms):
  t0 = 1733832000000
  t1 = 1733832015000  (Δ1 = 15000)
  t2 = 1733832030000  (Δ2 = 15000)
  t3 = 1733832045000  (Δ3 = 15000)

Delta-of-delta:
  D1 = Δ1 - Δ0 = 15000 - 0 = 15000     → encode in 14 bits
  D2 = Δ2 - Δ1 = 15000 - 15000 = 0     → encode in 1 bit (0)
  D3 = Δ3 - Δ2 = 15000 - 15000 = 0     → encode in 1 bit (0)

Encoding:
  - If D = 0: write 1 bit "0"
  - If D in [-63, 64): write "10" + 7 bits
  - If D in [-255, 256): write "110" + 9 bits
  - If D in [-2047, 2048): write "1110" + 12 bits
  - Otherwise: write "1111" + 32 bits

96% of timestamps compress to 1 bit!

Value Compression (XOR Encoding):

Example values (float64):
  v0 = 1543.0
  v1 = 1543.5
  v2 = 1543.7

XOR compression:
  XOR(v0, v1) = 0x3FF0000000000000 XOR 0x3FF0800000000000
              = 0x0000800000000000
              → Leading zeros: 16, Trailing zeros: 47
              → Encode: control bit "1" + 5-bit leading + 6-bit length + 1 bit

  XOR(v1, v2) = 0x3FF0800000000000 XOR 0x3FF0CCCCCCCCCCD
              → Similar pattern, encode with control bits

Encoding:
  - If v_i == v_(i-1): write 1 bit "0"
  - If XOR has same leading/trailing zeros as previous: write "10" + significant bits
  - Otherwise: write "11" + 5-bit leading + 6-bit length + significant bits

51% of values compress to 1 bit!

Rust Implementation Sketch:

struct GorillaEncoder {
    bit_writer: BitWriter,
    prev_timestamp: i64,
    prev_delta: i64,
    prev_value: f64,
    prev_leading_zeros: u8,
    prev_trailing_zeros: u8,
}

impl GorillaEncoder {
    fn encode_timestamp(&mut self, timestamp: i64) -> Result<()> {
        let delta = timestamp - self.prev_timestamp;
        let delta_of_delta = delta - self.prev_delta;

        if delta_of_delta == 0 {
            self.bit_writer.write_bit(0)?;
        } else if delta_of_delta >= -63 && delta_of_delta < 64 {
            self.bit_writer.write_bits(0b10, 2)?;
            self.bit_writer.write_bits(delta_of_delta as u64, 7)?;
        } else if delta_of_delta >= -255 && delta_of_delta < 256 {
            self.bit_writer.write_bits(0b110, 3)?;
            self.bit_writer.write_bits(delta_of_delta as u64, 9)?;
        } else if delta_of_delta >= -2047 && delta_of_delta < 2048 {
            self.bit_writer.write_bits(0b1110, 4)?;
            self.bit_writer.write_bits(delta_of_delta as u64, 12)?;
        } else {
            self.bit_writer.write_bits(0b1111, 4)?;
            self.bit_writer.write_bits(delta_of_delta as u64, 32)?;
        }

        self.prev_timestamp = timestamp;
        self.prev_delta = delta;
        Ok(())
    }

    fn encode_value(&mut self, value: f64) -> Result<()> {
        let bits = value.to_bits();
        let xor = bits ^ self.prev_value.to_bits();

        if xor == 0 {
            self.bit_writer.write_bit(0)?;
        } else {
            let leading = xor.leading_zeros() as u8;
            let trailing = xor.trailing_zeros() as u8;
            let significant_bits = 64 - leading - trailing;

            if leading >= self.prev_leading_zeros && trailing >= self.prev_trailing_zeros {
                self.bit_writer.write_bits(0b10, 2)?;
                let mask = (1u64 << significant_bits) - 1;
                let significant = (xor >> trailing) & mask;
                self.bit_writer.write_bits(significant, significant_bits as usize)?;
            } else {
                self.bit_writer.write_bits(0b11, 2)?;
                self.bit_writer.write_bits(leading as u64, 5)?;
                self.bit_writer.write_bits(significant_bits as u64, 6)?;
                let mask = (1u64 << significant_bits) - 1;
                let significant = (xor >> trailing) & mask;
                self.bit_writer.write_bits(significant, significant_bits as usize)?;

                self.prev_leading_zeros = leading;
                self.prev_trailing_zeros = trailing;
            }
        }

        self.prev_value = value;
        Ok(())
    }
}

3.3.2 Compression Performance Targets

Based on research and production systems:

Metric Target Reference
Average bytes/sample 1.5-2.0 Prometheus (1-2), Gorilla (1.37), M3DB (1.45)
Compression ratio 10-12x Gorilla (12x), InfluxDB TSM (45x for specific workloads)
Encode throughput >500K samples/sec Gorilla paper: 700K/sec
Decode throughput >1M samples/sec Gorilla paper: 1.2M/sec

3.4 Retention and Compaction Policies

3.4.1 Retention Policy

Default Retention: 15 days

Configurable Parameters:

[storage]
retention_days = 15         # Keep data for 15 days
min_block_duration = "2h"   # Minimum block size
max_block_duration = "24h"  # Maximum block size after compaction

Retention Enforcement:

  • Background goroutine runs every 1h
  • Deletes blocks where max_time < now() - retention_duration
  • Deletes old WAL segments

3.4.2 Compaction Strategy

Purpose:

  1. Merge small blocks into larger blocks (reduce file count)
  2. Remove deleted samples (tombstones)
  3. Improve query performance (fewer blocks to scan)

Compaction Levels (inspired by LevelDB):

Level 0: 2h blocks (compacted from Head)
Level 1: 12h blocks (merge 6 L0 blocks)
Level 2: 24h blocks (merge 2 L1 blocks)

Compaction Trigger:

  • Time-based: Every 2h, compact Head → Level 0 block
  • Count-based: When L0 has >4 blocks, compact → L1
  • Manual: Admin API endpoint /api/v1/admin/compact

Compaction Algorithm:

1. Select blocks to compact (same level, adjacent time ranges)
2. Create new block directory (ULID)
3. Iterate all series in selected blocks:
   a. Merge chunks from all blocks
   b. Apply tombstones (skip deleted samples)
   c. Re-compress merged chunks
   d. Write to new block chunks file
4. Build new index (merge posting lists)
5. Write meta.json
6. Atomically rename block directory
7. Delete source blocks

Rust Implementation Sketch:

struct Compactor {
    data_dir: PathBuf,
    retention: Duration,
}

impl Compactor {
    async fn compact_head_to_l0(&self, head: &Head) -> Result<BlockID> {
        let block_id = ULID::new();
        let block_dir = self.data_dir.join(block_id.to_string());
        std::fs::create_dir_all(&block_dir)?;

        let mut index_writer = IndexWriter::new(&block_dir.join("index"))?;
        let mut chunk_writer = ChunkWriter::new(&block_dir.join("chunks/000001"))?;

        let series_map = head.series.read().await;
        for (series_id, series) in series_map.iter() {
            let chunks = series.chunks.read().await;
            for chunk in chunks.iter() {
                if chunk.is_full() {
                    let chunk_ref = chunk_writer.write_chunk(&chunk.samples)?;
                    index_writer.add_series(*series_id, &series.labels, chunk_ref)?;
                }
            }
        }

        index_writer.finalize()?;
        chunk_writer.finalize()?;

        let meta = BlockMeta {
            ulid: block_id,
            min_time: head.min_time.load(Ordering::Relaxed),
            max_time: head.max_time.load(Ordering::Relaxed),
            stats: compute_stats(&block_dir)?,
            compaction: CompactionMeta { level: 0, sources: vec![] },
            version: 1,
        };
        write_meta(&block_dir.join("meta.json"), &meta)?;

        Ok(block_id)
    }

    async fn compact_blocks(&self, source_blocks: Vec<BlockID>) -> Result<BlockID> {
        // Merge multiple blocks into one
        // Similar to compact_head_to_l0, but reads from existing blocks
    }

    async fn enforce_retention(&self) -> Result<()> {
        let cutoff = SystemTime::now() - self.retention;
        let cutoff_ms = cutoff.duration_since(UNIX_EPOCH)?.as_millis() as i64;

        for entry in std::fs::read_dir(&self.data_dir)? {
            let path = entry?.path();
            if !path.is_dir() { continue; }

            let meta_path = path.join("meta.json");
            if !meta_path.exists() { continue; }

            let meta: BlockMeta = serde_json::from_reader(File::open(meta_path)?)?;
            if meta.max_time < cutoff_ms {
                std::fs::remove_dir_all(&path)?;
                info!("Deleted expired block: {}", meta.ulid);
            }
        }
        Ok(())
    }
}

4. Push Ingestion API

4.1 Prometheus Remote Write Protocol

4.1.1 Protocol Overview

Specification: Prometheus Remote Write v1.0 Transport: HTTP/1.1 or HTTP/2 Encoding: Protocol Buffers (protobuf v3) Compression: Snappy (required)

Reference: Prometheus Remote Write Spec

4.1.2 HTTP Endpoint

POST /api/v1/write
Content-Type: application/x-protobuf
Content-Encoding: snappy
X-Prometheus-Remote-Write-Version: 0.1.0

Request Flow:

┌──────────────┐
│  Client      │
│ (Prometheus, │
│  FlareDB,    │
│  etc.)       │
└──────┬───────┘
       │
       │ 1. Collect samples
       │
       ▼
┌──────────────────────────────────┐
│  Encode to WriteRequest protobuf │
│  message                          │
└──────┬───────────────────────────┘
       │
       │ 2. Compress with Snappy
       │
       ▼
┌──────────────────────────────────┐
│  HTTP POST to /api/v1/write      │
│  with mTLS authentication         │
└──────┬───────────────────────────┘
       │
       │ 3. Send request
       │
       ▼
┌──────────────────────────────────┐
│  Metricstor Server               │
│  ├─ Validate mTLS cert           │
│  ├─ Decompress Snappy            │
│  ├─ Decode protobuf              │
│  ├─ Validate samples             │
│  ├─ Append to WAL                │
│  └─ Insert into Head             │
└──────┬───────────────────────────┘
       │
       │ 4. Response
       │
       ▼
┌──────────────────────────────────┐
│  HTTP Response:                  │
│  200 OK (success)                │
│  400 Bad Request (invalid)       │
│  429 Too Many Requests (backpressure) │
│  503 Service Unavailable (overload)   │
└──────────────────────────────────┘

4.1.3 Protobuf Schema

File: proto/remote_write.proto

syntax = "proto3";

package metricstor.remote;

// Prometheus remote_write compatible schema

message WriteRequest {
  repeated TimeSeries timeseries = 1;
  // Metadata is optional and not used in v1
  repeated MetricMetadata metadata = 2;
}

message TimeSeries {
  repeated Label labels = 1;
  repeated Sample samples = 2;
  // Exemplars are optional (not supported in v1)
  repeated Exemplar exemplars = 3;
}

message Label {
  string name = 1;
  string value = 2;
}

message Sample {
  double value = 1;
  int64 timestamp = 2;  // Unix timestamp in milliseconds
}

message Exemplar {
  repeated Label labels = 1;
  double value = 2;
  int64 timestamp = 3;
}

message MetricMetadata {
  enum MetricType {
    UNKNOWN = 0;
    COUNTER = 1;
    GAUGE = 2;
    HISTOGRAM = 3;
    GAUGEHISTOGRAM = 4;
    SUMMARY = 5;
    INFO = 6;
    STATESET = 7;
  }
  MetricType type = 1;
  string metric_family_name = 2;
  string help = 3;
  string unit = 4;
}

Generated Rust Code (using prost):

# Cargo.toml
[dependencies]
prost = "0.12"
prost-types = "0.12"

[build-dependencies]
prost-build = "0.12"
// build.rs
fn main() {
    prost_build::compile_protos(&["proto/remote_write.proto"], &["proto/"]).unwrap();
}

4.1.4 Ingestion Handler

Rust Implementation:

use axum::{
    Router,
    routing::post,
    extract::State,
    http::StatusCode,
    body::Bytes,
};
use prost::Message;
use snap::raw::Decoder as SnappyDecoder;

mod remote_write_pb {
    include!(concat!(env!("OUT_DIR"), "/metricstor.remote.rs"));
}

struct IngestionService {
    head: Arc<Head>,
    wal: Arc<WAL>,
    rate_limiter: Arc<RateLimiter>,
}

async fn handle_remote_write(
    State(service): State<Arc<IngestionService>>,
    body: Bytes,
) -> Result<StatusCode, (StatusCode, String)> {
    // 1. Decompress Snappy
    let mut decoder = SnappyDecoder::new();
    let decompressed = decoder
        .decompress_vec(&body)
        .map_err(|e| (StatusCode::BAD_REQUEST, format!("Snappy decompression failed: {}", e)))?;

    // 2. Decode protobuf
    let write_req = remote_write_pb::WriteRequest::decode(&decompressed[..])
        .map_err(|e| (StatusCode::BAD_REQUEST, format!("Protobuf decode failed: {}", e)))?;

    // 3. Validate and ingest
    let mut samples_ingested = 0;
    let mut samples_rejected = 0;

    for ts in write_req.timeseries.iter() {
        // Validate labels
        let labels = validate_labels(&ts.labels)
            .map_err(|e| (StatusCode::BAD_REQUEST, e))?;

        let series_id = compute_series_id(&labels);

        for sample in ts.samples.iter() {
            // Validate timestamp (not too old, not too far in future)
            if !is_valid_timestamp(sample.timestamp) {
                samples_rejected += 1;
                continue;
            }

            // Check rate limit
            if !service.rate_limiter.allow() {
                return Err((StatusCode::TOO_MANY_REQUESTS, "Rate limit exceeded".into()));
            }

            // Append to WAL
            let wal_record = WALRecord::Sample {
                series_id,
                timestamp: sample.timestamp,
                value: sample.value,
            };
            service.wal.append(&wal_record)
                .map_err(|e| (StatusCode::INTERNAL_SERVER_ERROR, format!("WAL append failed: {}", e)))?;

            // Insert into Head
            service.head.append(series_id, labels.clone(), sample.timestamp, sample.value)
                .await
                .map_err(|e| {
                    if e.to_string().contains("out of order") {
                        samples_rejected += 1;
                        Ok::<_, (StatusCode, String)>(())
                    } else if e.to_string().contains("buffer full") {
                        Err((StatusCode::SERVICE_UNAVAILABLE, "Write buffer full".into()))
                    } else {
                        Err((StatusCode::INTERNAL_SERVER_ERROR, format!("Insert failed: {}", e)))
                    }
                })?;

            samples_ingested += 1;
        }
    }

    info!("Ingested {} samples, rejected {}", samples_ingested, samples_rejected);
    Ok(StatusCode::NO_CONTENT)  // 204 No Content on success
}

fn validate_labels(labels: &[remote_write_pb::Label]) -> Result<BTreeMap<String, String>, String> {
    let mut label_map = BTreeMap::new();

    for label in labels {
        // Validate label name
        if !is_valid_label_name(&label.name) {
            return Err(format!("Invalid label name: {}", label.name));
        }

        // Validate label value (any UTF-8)
        if label.value.is_empty() {
            return Err(format!("Empty label value for label: {}", label.name));
        }

        label_map.insert(label.name.clone(), label.value.clone());
    }

    // Must have __name__ label
    if !label_map.contains_key("__name__") {
        return Err("Missing __name__ label".into());
    }

    Ok(label_map)
}

fn is_valid_label_name(name: &str) -> bool {
    // Must match [a-zA-Z_][a-zA-Z0-9_]*
    if name.is_empty() {
        return false;
    }

    let mut chars = name.chars();
    let first = chars.next().unwrap();
    if !first.is_ascii_alphabetic() && first != '_' {
        return false;
    }

    chars.all(|c| c.is_ascii_alphanumeric() || c == '_')
}

fn is_valid_timestamp(ts: i64) -> bool {
    let now = SystemTime::now().duration_since(UNIX_EPOCH).unwrap().as_millis() as i64;
    let min_valid = now - 24 * 3600 * 1000;  // Not older than 24h
    let max_valid = now + 5 * 60 * 1000;      // Not more than 5min in future
    ts >= min_valid && ts <= max_valid
}

4.2 gRPC API (Alternative/Additional)

In addition to HTTP, Metricstor MAY support a gRPC API for ingestion (more efficient for internal services).

Proto Definition:

syntax = "proto3";

package metricstor.ingest;

service IngestionService {
  rpc Write(WriteRequest) returns (WriteResponse);
  rpc WriteBatch(stream WriteRequest) returns (WriteResponse);
}

message WriteRequest {
  repeated TimeSeries timeseries = 1;
}

message WriteResponse {
  uint64 samples_ingested = 1;
  uint64 samples_rejected = 2;
  string error = 3;
}

// (Reuse TimeSeries, Label, Sample from remote_write.proto)

4.3 Label Validation and Normalization

4.3.1 Metric Name Validation

Metric names (stored in __name__ label) must match:

[a-zA-Z_:][a-zA-Z0-9_:]*

Examples:

  • http_requests_total
  • node_cpu_seconds:rate5m
  • 123_invalid (starts with digit)
  • invalid-metric (contains hyphen)

4.3.2 Label Name Validation

Label names must match:

[a-zA-Z_][a-zA-Z0-9_]*

Reserved prefixes:

  • __ (double underscore): Internal labels (e.g., __name__, __rollup__)

4.3.3 Label Normalization

Before inserting, labels are normalized:

  1. Sort labels lexicographically by key
  2. Ensure __name__ label is present
  3. Remove duplicate labels (keep last value)
  4. Limit label count (default: 30 labels max per series)
  5. Limit label value length (default: 1024 chars max)

4.4 Write Path Architecture

┌──────────────────────────────────────────────────────────────┐
│                   Ingestion Layer                             │
│                                                                │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │  HTTP/gRPC  │  │  mTLS Auth  │  │ Rate Limiter│          │
│  │   Handler   │─▶│  Validator  │─▶│             │          │
│  └─────────────┘  └─────────────┘  └──────┬──────┘          │
│                                             │                  │
│                                             ▼                  │
│                                  ┌─────────────────┐          │
│                                  │  Decompressor   │          │
│                                  │  (Snappy)       │          │
│                                  └────────┬────────┘          │
│                                           │                    │
│                                           ▼                    │
│                                  ┌─────────────────┐          │
│                                  │  Protobuf       │          │
│                                  │  Decoder        │          │
│                                  └────────┬────────┘          │
│                                           │                    │
└───────────────────────────────────────────┼──────────────────┘
                                            │
                                            ▼
┌──────────────────────────────────────────────────────────────┐
│                  Validation Layer                             │
│                                                                │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │  Label       │  │  Timestamp   │  │  Cardinality │       │
│  │  Validator   │  │  Validator   │  │  Limiter     │       │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘       │
│         │                 │                 │                 │
│         └─────────────────┴─────────────────┘                 │
│                           │                                    │
└───────────────────────────┼──────────────────────────────────┘
                            │
                            ▼
┌──────────────────────────────────────────────────────────────┐
│                   Write Buffer                                │
│                                                                │
│  ┌────────────────────────────────────────────────────┐      │
│  │  MPSC Channel (bounded)                             │      │
│  │  Capacity: 100K samples                             │      │
│  │  Backpressure: Block/Reject when full              │      │
│  └────────────────────────────────────────────────────┘      │
│                           │                                    │
└───────────────────────────┼──────────────────────────────────┘
                            │
                            ▼
┌──────────────────────────────────────────────────────────────┐
│                   Storage Layer                               │
│                                                                │
│  ┌─────────────┐         ┌─────────────┐                     │
│  │   WAL       │◀────────│  WAL Writer │                     │
│  │  (Disk)     │         │  (Thread)   │                     │
│  └─────────────┘         └─────────────┘                     │
│                                                                │
│  ┌─────────────┐         ┌─────────────┐                     │
│  │   Head      │◀────────│  Head Writer│                     │
│  │ (In-Memory) │         │  (Thread)   │                     │
│  └─────────────┘         └─────────────┘                     │
└──────────────────────────────────────────────────────────────┘

Concurrency Model:

  1. HTTP/gRPC handlers: Multi-threaded (tokio async)
  2. Write buffer: MPSC channel (bounded capacity)
  3. WAL writer: Single-threaded (sequential writes for consistency)
  4. Head writer: Single-threaded (lock-free inserts via sharding)

Backpressure Handling:

enum BackpressureStrategy {
    Block,   // Block until buffer has space (default)
    Reject,  // Return 503 immediately
}

impl IngestionService {
    async fn handle_backpressure(&self, samples: Vec<Sample>) -> Result<()> {
        match self.config.backpressure_strategy {
            BackpressureStrategy::Block => {
                // Try to send with timeout
                tokio::time::timeout(
                    Duration::from_secs(5),
                    self.write_buffer.send(samples)
                ).await
                    .map_err(|_| Error::Timeout)?
            }
            BackpressureStrategy::Reject => {
                // Try non-blocking send
                self.write_buffer.try_send(samples)
                    .map_err(|_| Error::BufferFull)?
            }
        }
    }
}

4.5 Out-of-Order Sample Handling

Problem: Samples may arrive out of timestamp order due to network delays, batching, etc.

Solution: Accept out-of-order samples within a configurable time window.

Configuration:

[storage]
out_of_order_time_window = "1h"  # Accept samples up to 1h old

Implementation:

impl Head {
    async fn append(
        &self,
        series_id: SeriesID,
        labels: BTreeMap<String, String>,
        timestamp: i64,
        value: f64,
    ) -> Result<()> {
        let now = SystemTime::now().duration_since(UNIX_EPOCH)?.as_millis() as i64;
        let min_valid_ts = now - self.config.out_of_order_time_window.as_millis() as i64;

        if timestamp < min_valid_ts {
            return Err(Error::OutOfOrder(format!(
                "Sample too old: ts={}, min={}",
                timestamp, min_valid_ts
            )));
        }

        // Get or create series
        let mut series_map = self.series.write().await;
        let series = series_map.entry(series_id).or_insert_with(|| {
            Arc::new(Series {
                id: series_id,
                labels: labels.clone(),
                chunks: RwLock::new(vec![]),
            })
        });

        // Append to appropriate chunk
        let mut chunks = series.chunks.write().await;

        // Find chunk that covers this timestamp
        let chunk = chunks.iter_mut()
            .find(|c| timestamp >= c.min_time && timestamp < c.max_time)
            .or_else(|| {
                // Create new chunk if needed
                let chunk_start = (timestamp / self.chunk_size.as_millis() as i64) * self.chunk_size.as_millis() as i64;
                let chunk_end = chunk_start + self.chunk_size.as_millis() as i64;
                let new_chunk = Chunk {
                    min_time: chunk_start,
                    max_time: chunk_end,
                    samples: CompressedSamples::new(),
                };
                chunks.push(new_chunk);
                chunks.last_mut()
            })
            .unwrap();

        chunk.samples.append(timestamp, value)?;

        Ok(())
    }
}

5. PromQL Query Engine

5.1 PromQL Overview

PromQL (Prometheus Query Language) is a functional query language for selecting and aggregating time-series data.

Query Types:

  1. Instant query: Evaluate expression at a single point in time
  2. Range query: Evaluate expression over a time range

5.2 Supported PromQL Subset

Metricstor v1 supports a pragmatic subset of PromQL covering 80% of common dashboard queries.

5.2.1 Instant Vector Selectors

# Select by metric name
http_requests_total

# Select with label matchers
http_requests_total{method="GET"}
http_requests_total{method="GET", status="200"}

# Label matcher operators
metric{label="value"}        # Exact match
metric{label!="value"}       # Not equal
metric{label=~"regex"}       # Regex match
metric{label!~"regex"}       # Regex not match

# Example
http_requests_total{method=~"GET|POST", status!="500"}

5.2.2 Range Vector Selectors

# Select last 5 minutes of data
http_requests_total[5m]

# With label matchers
http_requests_total{method="GET"}[1h]

# Time durations: s (seconds), m (minutes), h (hours), d (days), w (weeks), y (years)

5.2.3 Aggregation Operators

# sum: Sum over dimensions
sum(http_requests_total)
sum(http_requests_total) by (method)
sum(http_requests_total) without (instance)

# Supported aggregations:
sum         # Sum
avg         # Average
min         # Minimum
max         # Maximum
count       # Count
stddev      # Standard deviation
stdvar      # Standard variance
topk(N, )   # Top N series by value
bottomk(N,) # Bottom N series by value

5.2.4 Functions

Rate Functions:

# rate: Per-second average rate of increase
rate(http_requests_total[5m])

# irate: Instant rate (last two samples)
irate(http_requests_total[5m])

# increase: Total increase over time range
increase(http_requests_total[1h])

Quantile Functions:

# histogram_quantile: Calculate quantile from histogram
histogram_quantile(0.95, rate(http_request_duration_bucket[5m]))

Time Functions:

# time(): Current Unix timestamp
time()

# timestamp(): Timestamp of sample
timestamp(metric)

Math Functions:

# abs, ceil, floor, round, sqrt, exp, ln, log2, log10
abs(metric)
round(metric, 0.1)

5.2.5 Binary Operators

Arithmetic:

metric1 + metric2
metric1 - metric2
metric1 * metric2
metric1 / metric2
metric1 % metric2
metric1 ^ metric2

Comparison:

metric1 == metric2  # Equal
metric1 != metric2  # Not equal
metric1 > metric2   # Greater than
metric1 < metric2   # Less than
metric1 >= metric2  # Greater or equal
metric1 <= metric2  # Less or equal

Logical:

metric1 and metric2  # Intersection
metric1 or metric2   # Union
metric1 unless metric2  # Complement

Vector Matching:

# One-to-one matching
metric1 + metric2

# Many-to-one matching
metric1 + on(label) group_left metric2

# One-to-many matching
metric1 + on(label) group_right metric2

5.2.6 Subqueries (NOT SUPPORTED in v1)

Subqueries are complex and not supported in v1:

# NOT SUPPORTED
max_over_time(rate(http_requests_total[5m])[1h:])

5.3 Query Execution Model

5.3.1 Query Parsing

Use promql-parser crate (GreptimeTeam) for parsing:

use promql_parser::{parser, label};

fn parse_query(query: &str) -> Result<parser::Expr, ParseError> {
    parser::parse(query)
}

// Example
let expr = parse_query("http_requests_total{method=\"GET\"}[5m]")?;
match expr {
    parser::Expr::VectorSelector(vs) => {
        println!("Metric: {}", vs.name);
        for matcher in vs.matchers.matchers {
            println!("Label: {} {} {}", matcher.name, matcher.op, matcher.value);
        }
        println!("Range: {:?}", vs.range);
    }
    _ => {}
}

AST Types:

pub enum Expr {
    Aggregate(AggregateExpr),        // sum, avg, etc.
    Unary(UnaryExpr),                // -metric
    Binary(BinaryExpr),              // metric1 + metric2
    Paren(ParenExpr),                // (expr)
    Subquery(SubqueryExpr),          // NOT SUPPORTED
    NumberLiteral(NumberLiteral),    // 1.5
    StringLiteral(StringLiteral),    // "value"
    VectorSelector(VectorSelector),  // metric{labels}
    MatrixSelector(MatrixSelector),  // metric[5m]
    Call(Call),                      // rate(...)
}

5.3.2 Query Planner

Convert AST to execution plan:

enum QueryPlan {
    VectorSelector {
        matchers: Vec<LabelMatcher>,
        timestamp: i64,
    },
    MatrixSelector {
        matchers: Vec<LabelMatcher>,
        range: Duration,
        timestamp: i64,
    },
    Aggregate {
        op: AggregateOp,
        input: Box<QueryPlan>,
        grouping: Vec<String>,
    },
    RateFunc {
        input: Box<QueryPlan>,
    },
    BinaryOp {
        op: BinaryOp,
        lhs: Box<QueryPlan>,
        rhs: Box<QueryPlan>,
        matching: VectorMatching,
    },
}

struct QueryPlanner;

impl QueryPlanner {
    fn plan(expr: parser::Expr, query_time: i64) -> Result<QueryPlan> {
        match expr {
            parser::Expr::VectorSelector(vs) => {
                Ok(QueryPlan::VectorSelector {
                    matchers: vs.matchers.matchers.into_iter()
                        .map(|m| LabelMatcher::from_ast(m))
                        .collect(),
                    timestamp: query_time,
                })
            }
            parser::Expr::MatrixSelector(ms) => {
                Ok(QueryPlan::MatrixSelector {
                    matchers: ms.vector_selector.matchers.matchers.into_iter()
                        .map(|m| LabelMatcher::from_ast(m))
                        .collect(),
                    range: Duration::from_millis(ms.range as u64),
                    timestamp: query_time,
                })
            }
            parser::Expr::Call(call) => {
                match call.func.name.as_str() {
                    "rate" => {
                        let arg_plan = Self::plan(*call.args[0].clone(), query_time)?;
                        Ok(QueryPlan::RateFunc { input: Box::new(arg_plan) })
                    }
                    // ... other functions
                    _ => Err(Error::UnsupportedFunction(call.func.name)),
                }
            }
            parser::Expr::Aggregate(agg) => {
                let input_plan = Self::plan(*agg.expr, query_time)?;
                Ok(QueryPlan::Aggregate {
                    op: AggregateOp::from_str(&agg.op.to_string())?,
                    input: Box::new(input_plan),
                    grouping: agg.grouping.unwrap_or_default(),
                })
            }
            parser::Expr::Binary(bin) => {
                let lhs_plan = Self::plan(*bin.lhs, query_time)?;
                let rhs_plan = Self::plan(*bin.rhs, query_time)?;
                Ok(QueryPlan::BinaryOp {
                    op: BinaryOp::from_str(&bin.op.to_string())?,
                    lhs: Box::new(lhs_plan),
                    rhs: Box::new(rhs_plan),
                    matching: bin.modifier.map(|m| VectorMatching::from_ast(m)).unwrap_or_default(),
                })
            }
            _ => Err(Error::UnsupportedExpr),
        }
    }
}

5.3.3 Query Executor

Execute the plan:

struct QueryExecutor {
    head: Arc<Head>,
    blocks: Arc<BlockManager>,
}

impl QueryExecutor {
    async fn execute(&self, plan: QueryPlan) -> Result<QueryResult> {
        match plan {
            QueryPlan::VectorSelector { matchers, timestamp } => {
                self.execute_vector_selector(matchers, timestamp).await
            }
            QueryPlan::MatrixSelector { matchers, range, timestamp } => {
                self.execute_matrix_selector(matchers, range, timestamp).await
            }
            QueryPlan::RateFunc { input } => {
                let matrix = self.execute(*input).await?;
                self.apply_rate(matrix)
            }
            QueryPlan::Aggregate { op, input, grouping } => {
                let vector = self.execute(*input).await?;
                self.apply_aggregate(op, vector, grouping)
            }
            QueryPlan::BinaryOp { op, lhs, rhs, matching } => {
                let lhs_result = self.execute(*lhs).await?;
                let rhs_result = self.execute(*rhs).await?;
                self.apply_binary_op(op, lhs_result, rhs_result, matching)
            }
        }
    }

    async fn execute_vector_selector(
        &self,
        matchers: Vec<LabelMatcher>,
        timestamp: i64,
    ) -> Result<InstantVector> {
        // 1. Find matching series from index
        let series_ids = self.find_series(&matchers).await?;

        // 2. For each series, get sample at timestamp
        let mut samples = Vec::new();
        for series_id in series_ids {
            if let Some(sample) = self.get_sample_at(series_id, timestamp).await? {
                samples.push(sample);
            }
        }

        Ok(InstantVector { samples })
    }

    async fn execute_matrix_selector(
        &self,
        matchers: Vec<LabelMatcher>,
        range: Duration,
        timestamp: i64,
    ) -> Result<RangeVector> {
        let series_ids = self.find_series(&matchers).await?;

        let start = timestamp - range.as_millis() as i64;
        let end = timestamp;

        let mut ranges = Vec::new();
        for series_id in series_ids {
            let samples = self.get_samples_range(series_id, start, end).await?;
            ranges.push(RangeVectorSeries {
                labels: self.get_labels(series_id).await?,
                samples,
            });
        }

        Ok(RangeVector { ranges })
    }

    fn apply_rate(&self, matrix: RangeVector) -> Result<InstantVector> {
        let mut samples = Vec::new();

        for range in matrix.ranges {
            if range.samples.len() < 2 {
                continue;  // Need at least 2 samples for rate
            }

            let first = &range.samples[0];
            let last = &range.samples[range.samples.len() - 1];

            let delta_value = last.value - first.value;
            let delta_time = (last.timestamp - first.timestamp) as f64 / 1000.0;  // Convert to seconds

            let rate = delta_value / delta_time;

            samples.push(Sample {
                labels: range.labels,
                timestamp: last.timestamp,
                value: rate,
            });
        }

        Ok(InstantVector { samples })
    }

    fn apply_aggregate(
        &self,
        op: AggregateOp,
        vector: InstantVector,
        grouping: Vec<String>,
    ) -> Result<InstantVector> {
        // Group samples by grouping labels
        let mut groups: HashMap<Vec<(String, String)>, Vec<Sample>> = HashMap::new();

        for sample in vector.samples {
            let group_key = if grouping.is_empty() {
                vec![]
            } else {
                grouping.iter()
                    .filter_map(|label| sample.labels.get(label).map(|v| (label.clone(), v.clone())))
                    .collect()
            };

            groups.entry(group_key).or_insert_with(Vec::new).push(sample);
        }

        // Apply aggregation to each group
        let mut result_samples = Vec::new();
        for (group_labels, samples) in groups {
            let aggregated_value = match op {
                AggregateOp::Sum => samples.iter().map(|s| s.value).sum(),
                AggregateOp::Avg => samples.iter().map(|s| s.value).sum::<f64>() / samples.len() as f64,
                AggregateOp::Min => samples.iter().map(|s| s.value).fold(f64::INFINITY, f64::min),
                AggregateOp::Max => samples.iter().map(|s| s.value).fold(f64::NEG_INFINITY, f64::max),
                AggregateOp::Count => samples.len() as f64,
                // ... other aggregations
            };

            result_samples.push(Sample {
                labels: group_labels.into_iter().collect(),
                timestamp: samples[0].timestamp,
                value: aggregated_value,
            });
        }

        Ok(InstantVector { samples: result_samples })
    }
}

5.4 Read Path Architecture

┌──────────────────────────────────────────────────────────────┐
│                   Query Layer                                 │
│                                                                │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │  HTTP API   │  │  PromQL     │  │ Query       │          │
│  │  /api/v1/   │─▶│  Parser     │─▶│ Planner     │          │
│  │  query      │  │             │  │             │          │
│  └─────────────┘  └─────────────┘  └──────┬──────┘          │
│                                             │                  │
│                                             ▼                  │
│                                  ┌─────────────────┐          │
│                                  │  Query          │          │
│                                  │  Executor       │          │
│                                  └────────┬────────┘          │
└───────────────────────────────────────────┼──────────────────┘
                                            │
                                            ▼
┌──────────────────────────────────────────────────────────────┐
│                   Index Layer                                 │
│                                                                │
│  ┌──────────────┐  ┌──────────────┐                          │
│  │  Label Index │  │  Posting     │                          │
│  │  (In-Memory) │  │  Lists       │                          │
│  └──────┬───────┘  └──────┬───────┘                          │
│         │                 │                                    │
│         └─────────────────┘                                    │
│                  │                                             │
│                  │ Series IDs                                  │
│                  ▼                                             │
└──────────────────────────────────────────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────────────────┐
│                   Storage Layer                               │
│                                                                │
│  ┌─────────────┐         ┌─────────────┐                     │
│  │   Head      │         │   Blocks    │                     │
│  │ (In-Memory) │         │   (Disk)    │                     │
│  └─────┬───────┘         └─────┬───────┘                     │
│        │                       │                               │
│        │ Recent data (<2h)     │ Historical data              │
│        │                       │                               │
│        ▼                       ▼                               │
│  ┌─────────────────────────────────────┐                     │
│  │  Chunk Reader                        │                     │
│  │  - Decompress Gorilla chunks         │                     │
│  │  - Filter by time range              │                     │
│  │  - Return samples                    │                     │
│  └─────────────────────────────────────┘                     │
└──────────────────────────────────────────────────────────────┘

5.5 HTTP Query API

5.5.1 Instant Query

GET /api/v1/query?query=<promql>&time=<timestamp>&timeout=<duration>

Parameters:

  • query: PromQL expression (required)
  • time: Unix timestamp (optional, default: now)
  • timeout: Query timeout (optional, default: 30s)

Response (JSON):

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "http_requests_total",
          "method": "GET",
          "status": "200"
        },
        "value": [1733832000, "1543"]
      }
    ]
  }
}

5.5.2 Range Query

GET /api/v1/query_range?query=<promql>&start=<timestamp>&end=<timestamp>&step=<duration>

Parameters:

  • query: PromQL expression (required)
  • start: Start timestamp (required)
  • end: End timestamp (required)
  • step: Query resolution step (required, e.g., "15s")

Response (JSON):

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {
          "__name__": "http_requests_total",
          "method": "GET"
        },
        "values": [
          [1733832000, "1543"],
          [1733832015, "1556"],
          [1733832030, "1570"]
        ]
      }
    ]
  }
}

5.5.3 Label Values Query

GET /api/v1/label/<label_name>/values?match[]=<matcher>

Example:

GET /api/v1/label/method/values?match[]=http_requests_total

Response:

{
  "status": "success",
  "data": ["GET", "POST", "PUT", "DELETE"]
}

5.5.4 Series Metadata Query

GET /api/v1/series?match[]=<matcher>&start=<timestamp>&end=<timestamp>

Example:

GET /api/v1/series?match[]=http_requests_total{method="GET"}

Response:

{
  "status": "success",
  "data": [
    {
      "__name__": "http_requests_total",
      "method": "GET",
      "status": "200",
      "instance": "flaredb-1:9092"
    }
  ]
}

5.6 Performance Optimizations

5.6.1 Query Caching

Cache query results for identical queries:

struct QueryCache {
    cache: Arc<Mutex<LruCache<String, (QueryResult, Instant)>>>,
    ttl: Duration,
}

impl QueryCache {
    fn get(&self, query_hash: &str) -> Option<QueryResult> {
        let cache = self.cache.lock().unwrap();
        if let Some((result, timestamp)) = cache.get(query_hash) {
            if timestamp.elapsed() < self.ttl {
                return Some(result.clone());
            }
        }
        None
    }

    fn put(&self, query_hash: String, result: QueryResult) {
        let mut cache = self.cache.lock().unwrap();
        cache.put(query_hash, (result, Instant::now()));
    }
}

5.6.2 Posting List Intersection

Use efficient algorithms for label matcher intersection:

fn intersect_posting_lists(lists: Vec<&[SeriesID]>) -> Vec<SeriesID> {
    if lists.is_empty() {
        return vec![];
    }

    // Sort lists by length (shortest first for early termination)
    let mut sorted_lists = lists;
    sorted_lists.sort_by_key(|list| list.len());

    // Use shortest list as base, intersect with others
    let mut result: HashSet<SeriesID> = sorted_lists[0].iter().copied().collect();

    for list in &sorted_lists[1..] {
        let list_set: HashSet<SeriesID> = list.iter().copied().collect();
        result.retain(|id| list_set.contains(id));

        if result.is_empty() {
            break;  // Early termination
        }
    }

    result.into_iter().collect()
}

5.6.3 Chunk Pruning

Skip chunks that don't overlap query time range:

fn query_chunks(
    chunks: &[ChunkRef],
    start_time: i64,
    end_time: i64,
) -> Vec<ChunkRef> {
    chunks.iter()
        .filter(|chunk| {
            // Chunk overlaps query range if:
            // chunk.max_time > start AND chunk.min_time < end
            chunk.max_time > start_time && chunk.min_time < end_time
        })
        .copied()
        .collect()
}

6. Storage Backend Architecture

6.1 Architecture Decision: Hybrid Approach

After analyzing trade-offs, Metricstor adopts a hybrid storage architecture:

  1. Dedicated time-series engine for sample storage (optimized for write throughput and compression)
  2. Optional FlareDB integration for metadata and distributed coordination (future work)
  3. Optional S3-compatible backend for cold data archival (future work)

6.2 Decision Rationale

6.2.1 Why NOT Pure FlareDB Backend?

FlareDB Characteristics:

  • General-purpose KV store with Raft consensus
  • Optimized for: Strong consistency, small KV pairs, random access
  • Storage: RocksDB (LSM tree)

Time-Series Workload Characteristics:

  • High write throughput (100K samples/sec)
  • Sequential writes (append-only)
  • Temporal locality (queries focus on recent data)
  • Bulk reads (range scans over time windows)

Mismatch Analysis:

Aspect FlareDB (KV) Time-Series Engine
Write pattern Random writes, compaction overhead Append-only, minimal overhead
Compression Generic LZ4/Snappy Domain-specific (Gorilla: 12x)
Read pattern Point lookups Range scans over time
Indexing Key-based Label-based inverted index
Consistency Strong (Raft) Eventual OK for metrics

Conclusion: Using FlareDB for sample storage would sacrifice 5-10x write throughput and 10x compression efficiency.

6.2.2 Why NOT VictoriaMetrics Binary?

VictoriaMetrics is written in Go and has excellent performance, but:

  • mTLS support is paid only (violates PROJECT.md requirement)
  • Not Rust (violates PROJECT.md "Rustで書く")
  • Cannot integrate with FlareDB for metadata (future requirement)
  • Less control over storage format and optimizations

6.2.3 Why Hybrid (Dedicated + Optional FlareDB)?

Phase 1 (T033 v1): Pure dedicated engine

  • Simple, single-instance deployment
  • Focus on core functionality (ingest + query)
  • Local disk storage only

Phase 2 (Future): Add FlareDB for metadata

  • Store series labels and metadata in FlareDB "metrics" namespace
  • Enables multi-instance coordination
  • Global view of series cardinality, label values
  • Samples still in dedicated engine (local disk)

Phase 3 (Future): Add S3 for cold storage

  • Automatically upload old blocks (>7 days) to S3
  • Query federation across local + S3 blocks
  • Unlimited retention with cost-effective storage

Benefits:

  • v1 simplicity: No FlareDB dependency, easy deployment
  • Future scalability: Metadata in FlareDB, samples distributed
  • Operational flexibility: Can run standalone or integrated

6.3 Storage Layout

6.3.1 Directory Structure

/var/lib/metricstor/
├── data/
│   ├── wal/
│   │   ├── 00000001              # WAL segment
│   │   ├── 00000002
│   │   └── checkpoint.00000002   # WAL checkpoint
│   ├── 01HQZQZQZQZQZQZQZQZQZQ/  # Block (ULID)
│   │   ├── meta.json
│   │   ├── index
│   │   ├── chunks/
│   │   │   ├── 000001
│   │   │   └── 000002
│   │   └── tombstones
│   ├── 01HQZR.../                # Another block
│   └── ...
└── tmp/                          # Temp files for compaction

6.3.2 Metadata Storage (Future: FlareDB Integration)

When FlareDB integration is enabled:

Series Metadata (stored in FlareDB "metrics" namespace):

Key: series:<series_id>
Value: {
  "labels": {"__name__": "http_requests_total", "method": "GET", ...},
  "first_seen": 1733832000000,
  "last_seen": 1733839200000
}

Key: label_index:<label_name>:<label_value>
Value: [series_id1, series_id2, ...]  # Posting list

Benefits:

  • Fast label value lookups across all instances
  • Global series cardinality tracking
  • Distributed query planning (future)

Trade-off: Adds dependency on FlareDB, increases complexity

6.4 Scalability Approach

6.4.1 Vertical Scaling (v1)

Single instance scales to:

  • 10M active series
  • 100K samples/sec write throughput
  • 1K queries/sec

Scaling strategy:

  • Increase memory (more series in Head)
  • Faster disk (NVMe for WAL/blocks)
  • More CPU cores (parallel compaction, query execution)

6.4.2 Horizontal Scaling (Future)

Sharding Strategy (inspired by Prometheus federation + Thanos):

┌────────────────────────────────────────────────────────────┐
│                   Query Frontend                            │
│                   (Query Federation)                        │
└─────┬────────────────────┬─────────────────────┬───────────┘
      │                    │                     │
      ▼                    ▼                     ▼
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Metricstor  │     │ Metricstor  │     │ Metricstor  │
│ Instance 1  │     │ Instance 2  │     │ Instance N  │
│             │     │             │     │             │
│ Hash shard: │     │ Hash shard: │     │ Hash shard: │
│ 0-333       │     │ 334-666     │     │ 667-999     │
└─────────────┘     └─────────────┘     └─────────────┘
      │                    │                     │
      └────────────────────┴─────────────────────┘
                          │
                          ▼
                  ┌───────────────┐
                  │   FlareDB     │
                  │  (Metadata)   │
                  └───────────────┘

Sharding Key: Hash(series_id) % num_shards

Query Execution:

  1. Query frontend receives PromQL query
  2. Determine which shards contain matching series (via FlareDB metadata)
  3. Send subqueries to relevant shards
  4. Merge results (aggregation, deduplication)
  5. Return to client

Challenges (deferred to future work):

  • Rebalancing when adding/removing shards
  • Handling series that span multiple shards (rare)
  • Ensuring query consistency across shards

6.5 S3 Integration Strategy (Future)

Objective: Cost-effective long-term retention (>15 days)

Architecture:

┌───────────────────────────────────────────────────┐
│  Metricstor Server                                 │
│                                                    │
│  ┌──────────┐      ┌──────────┐                  │
│  │  Head    │      │  Blocks  │                  │
│  │  (0-2h)  │      │  (2h-15d)│                  │
│  └──────────┘      └────┬─────┘                  │
│                          │                         │
│                          │ Background uploader     │
│                          ▼                         │
│                   ┌─────────────┐                 │
│                   │  Upload to  │                 │
│                   │  S3 (>7d)   │                 │
│                   └──────┬──────┘                 │
└──────────────────────────┼────────────────────────┘
                           │
                           ▼
                  ┌─────────────────┐
                  │  S3 Bucket      │
                  │  /blocks/       │
                  │    01HQZ.../    │
                  │    01HRZ.../    │
                  └─────────────────┘

Workflow:

  1. Block compaction creates local block files
  2. Blocks older than 7 days (configurable) are uploaded to S3
  3. Local block files deleted after successful upload
  4. Query executor checks both local and S3 for blocks in query range
  5. Download S3 blocks on-demand (with local cache)

Configuration:

[storage.s3]
enabled = true
endpoint = "https://s3.example.com"
bucket = "metricstor-blocks"
access_key_id = "..."
secret_access_key = "..."
upload_after_days = 7
local_cache_size_gb = 100

7. Integration Points

7.1 Service Discovery (How Services Push Metrics)

7.1.1 Service Configuration Pattern

Each platform service (FlareDB, ChainFire, etc.) exports Prometheus metrics on ports 9091-9099.

Example (FlareDB metrics exporter):

// flaredb-server/src/main.rs
use metrics_exporter_prometheus::PrometheusBuilder;

#[tokio::main]
async fn main() -> Result<()> {
    // ... initialization ...

    let metrics_addr = format!("0.0.0.0:{}", args.metrics_port);
    let builder = PrometheusBuilder::new();
    builder
        .with_http_listener(metrics_addr.parse::<std::net::SocketAddr>()?)
        .install()
        .expect("Failed to install Prometheus metrics exporter");

    info!("Prometheus metrics available at http://{}/metrics", metrics_addr);

    // ... rest of main ...
}

Service Metrics Ports (from T027.S2):

Service Port Endpoint
ChainFire 9091 http://chainfire:9091/metrics
FlareDB 9092 http://flaredb:9092/metrics
PlasmaVMC 9093 http://plasmavmc:9093/metrics
IAM 9094 http://iam:9094/metrics
LightningSTOR 9095 http://lightningstor:9095/metrics
FlashDNS 9096 http://flashdns:9096/metrics
FiberLB 9097 http://fiberlb:9097/metrics
Novanet 9098 http://novanet:9098/metrics

7.1.2 Scrape-to-Push Adapter

Since Metricstor is push-based but services export pull-based Prometheus /metrics endpoints, we need a scrape-to-push adapter.

Option 1: Prometheus Agent Mode + Remote Write

Deploy Prometheus in agent mode (no storage, only scraping):

# prometheus-agent.yaml
global:
  scrape_interval: 15s
  external_labels:
    cluster: 'cloud-platform'

scrape_configs:
  - job_name: 'chainfire'
    static_configs:
      - targets: ['chainfire:9091']

  - job_name: 'flaredb'
    static_configs:
      - targets: ['flaredb:9092']

  # ... other services ...

remote_write:
  - url: 'https://metricstor:8080/api/v1/write'
    tls_config:
      cert_file: /etc/certs/client.crt
      key_file: /etc/certs/client.key
      ca_file: /etc/certs/ca.crt

Option 2: Custom Rust Scraper (Platform-Native)

Build a lightweight scraper in Rust that integrates with Metricstor:

// metricstor-scraper/src/main.rs

struct Scraper {
    targets: Vec<ScrapeTarget>,
    client: reqwest::Client,
    metricstor_client: MetricstorClient,
}

struct ScrapeTarget {
    job_name: String,
    url: String,
    interval: Duration,
}

impl Scraper {
    async fn scrape_loop(&self) {
        loop {
            for target in &self.targets {
                let result = self.scrape_target(target).await;
                match result {
                    Ok(samples) => {
                        if let Err(e) = self.metricstor_client.write(samples).await {
                            error!("Failed to write to Metricstor: {}", e);
                        }
                    }
                    Err(e) => {
                        error!("Failed to scrape {}: {}", target.url, e);
                    }
                }
            }
            tokio::time::sleep(Duration::from_secs(15)).await;
        }
    }

    async fn scrape_target(&self, target: &ScrapeTarget) -> Result<Vec<Sample>> {
        let response = self.client.get(&target.url).send().await?;
        let body = response.text().await?;

        // Parse Prometheus text format
        let samples = parse_prometheus_text(&body, &target.job_name)?;
        Ok(samples)
    }
}

fn parse_prometheus_text(text: &str, job: &str) -> Result<Vec<Sample>> {
    // Use prometheus-parse crate or implement simple parser
    // Example output:
    // http_requests_total{method="GET",status="200",job="flaredb"} 1543 1733832000000
}

Deployment:

  • metricstor-scraper runs as a sidecar or separate service
  • Reads scrape config from TOML file
  • Uses mTLS to push to Metricstor

Recommendation: Option 2 (custom scraper) for consistency with platform philosophy (100% Rust, no external dependencies).

7.2 mTLS Configuration (T027/T031 Patterns)

7.2.1 TLS Config Structure

Following existing patterns (FlareDB, ChainFire, IAM):

# metricstor.toml

[server]
addr = "0.0.0.0:8080"
log_level = "info"

[server.tls]
cert_file = "/etc/metricstor/certs/server.crt"
key_file = "/etc/metricstor/certs/server.key"
ca_file = "/etc/metricstor/certs/ca.crt"
require_client_cert = true  # Enable mTLS

Rust Config Struct:

// metricstor-server/src/config.rs

use serde::{Deserialize, Serialize};
use std::net::SocketAddr;

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ServerConfig {
    pub server: ServerSettings,
    pub storage: StorageConfig,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ServerSettings {
    pub addr: SocketAddr,
    pub log_level: String,
    pub tls: Option<TlsConfig>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TlsConfig {
    pub cert_file: String,
    pub key_file: String,
    pub ca_file: Option<String>,
    #[serde(default)]
    pub require_client_cert: bool,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StorageConfig {
    pub data_dir: String,
    pub retention_days: u32,
    pub wal_segment_size_mb: usize,
    // ... other storage settings
}

7.2.2 mTLS Server Setup

// metricstor-server/src/main.rs

use axum::Router;
use axum_server::tls_rustls::RustlsConfig;
use std::sync::Arc;

#[tokio::main]
async fn main() -> Result<()> {
    let config = ServerConfig::load("metricstor.toml")?;

    // Build router
    let app = Router::new()
        .route("/api/v1/write", post(handle_remote_write))
        .route("/api/v1/query", get(handle_instant_query))
        .route("/api/v1/query_range", get(handle_range_query))
        .route("/health", get(health_check))
        .route("/ready", get(readiness_check))
        .with_state(Arc::new(service));

    // Setup TLS if configured
    if let Some(tls_config) = &config.server.tls {
        info!("TLS enabled, loading certificates...");

        let rustls_config = if tls_config.require_client_cert {
            info!("mTLS enabled, requiring client certificates");

            let ca_cert_pem = tokio::fs::read_to_string(
                tls_config.ca_file.as_ref().ok_or("ca_file required for mTLS")?
            ).await?;

            RustlsConfig::from_pem_file(
                &tls_config.cert_file,
                &tls_config.key_file,
            )
            .await?
            .with_client_cert_verifier(ca_cert_pem)
        } else {
            info!("TLS-only mode, client certificates not required");
            RustlsConfig::from_pem_file(
                &tls_config.cert_file,
                &tls_config.key_file,
            ).await?
        };

        axum_server::bind_rustls(config.server.addr, rustls_config)
            .serve(app.into_make_service())
            .await?;
    } else {
        info!("TLS disabled, running in plain-text mode");
        axum_server::bind(config.server.addr)
            .serve(app.into_make_service())
            .await?;
    }

    Ok(())
}

7.2.3 Client Certificate Validation

Extract client identity from mTLS certificate:

use axum::{
    http::Request,
    middleware::Next,
    response::Response,
    Extension,
};
use axum_server::tls_rustls::RustlsAcceptor;

#[derive(Clone, Debug)]
struct ClientIdentity {
    common_name: String,
    organization: String,
}

async fn extract_client_identity<B>(
    Extension(client_cert): Extension<Option<rustls::Certificate>>,
    mut request: Request<B>,
    next: Next<B>,
) -> Response {
    if let Some(cert) = client_cert {
        // Parse certificate to extract CN, O, etc.
        let identity = parse_certificate(&cert);
        request.extensions_mut().insert(identity);
    }

    next.run(request).await
}

// Use identity for rate limiting, audit logging, etc.
async fn handle_remote_write(
    Extension(identity): Extension<ClientIdentity>,
    State(service): State<Arc<IngestionService>>,
    body: Bytes,
) -> Result<StatusCode, (StatusCode, String)> {
    info!("Write request from: {}", identity.common_name);

    // Apply per-client rate limiting
    if !service.rate_limiter.allow(&identity.common_name) {
        return Err((StatusCode::TOO_MANY_REQUESTS, "Rate limit exceeded".into()));
    }

    // ... rest of handler ...
}

7.3 gRPC API Design

While HTTP is the primary interface (Prometheus compatibility), a gRPC API can provide:

  • Better performance for internal services
  • Streaming support for batch ingestion
  • Type-safe client libraries

Proto Definition:

// proto/metricstor.proto

syntax = "proto3";

package metricstor.v1;

service MetricstorService {
  // Write samples
  rpc Write(WriteRequest) returns (WriteResponse);

  // Streaming write for high-throughput scenarios
  rpc WriteStream(stream WriteRequest) returns (WriteResponse);

  // Query (instant)
  rpc Query(QueryRequest) returns (QueryResponse);

  // Query (range)
  rpc QueryRange(QueryRangeRequest) returns (QueryRangeResponse);

  // Admin operations
  rpc Compact(CompactRequest) returns (CompactResponse);
  rpc DeleteSeries(DeleteSeriesRequest) returns (DeleteSeriesResponse);
}

message WriteRequest {
  repeated TimeSeries timeseries = 1;
}

message WriteResponse {
  uint64 samples_ingested = 1;
  uint64 samples_rejected = 2;
}

message QueryRequest {
  string query = 1;  // PromQL
  int64 time = 2;    // Unix timestamp (ms)
  int64 timeout_ms = 3;
}

message QueryResponse {
  string result_type = 1;  // "vector" or "matrix"
  repeated InstantVectorSample vector = 2;
  repeated RangeVectorSeries matrix = 3;
}

message InstantVectorSample {
  map<string, string> labels = 1;
  double value = 2;
  int64 timestamp = 3;
}

message RangeVectorSeries {
  map<string, string> labels = 1;
  repeated Sample samples = 2;
}

message Sample {
  double value = 1;
  int64 timestamp = 2;
}

7.4 NixOS Module Integration

Following T024 patterns, create a NixOS module for Metricstor.

File: nix/modules/metricstor.nix

{ config, lib, pkgs, ... }:

with lib;

let
  cfg = config.services.metricstor;

  configFile = pkgs.writeText "metricstor.toml" ''
    [server]
    addr = "${cfg.listenAddress}"
    log_level = "${cfg.logLevel}"

    ${optionalString (cfg.tls.enable) ''
    [server.tls]
    cert_file = "${cfg.tls.certFile}"
    key_file = "${cfg.tls.keyFile}"
    ${optionalString (cfg.tls.caFile != null) ''
    ca_file = "${cfg.tls.caFile}"
    ''}
    require_client_cert = ${boolToString cfg.tls.requireClientCert}
    ''}

    [storage]
    data_dir = "${cfg.dataDir}"
    retention_days = ${toString cfg.storage.retentionDays}
    wal_segment_size_mb = ${toString cfg.storage.walSegmentSizeMb}
  '';

in {
  options.services.metricstor = {
    enable = mkEnableOption "Metricstor metrics storage service";

    package = mkOption {
      type = types.package;
      default = pkgs.metricstor;
      description = "Metricstor package to use";
    };

    listenAddress = mkOption {
      type = types.str;
      default = "0.0.0.0:8080";
      description = "Address and port to listen on";
    };

    logLevel = mkOption {
      type = types.enum [ "trace" "debug" "info" "warn" "error" ];
      default = "info";
      description = "Log level";
    };

    dataDir = mkOption {
      type = types.path;
      default = "/var/lib/metricstor";
      description = "Data directory for TSDB storage";
    };

    tls = {
      enable = mkEnableOption "TLS encryption";

      certFile = mkOption {
        type = types.str;
        description = "Path to TLS certificate file";
      };

      keyFile = mkOption {
        type = types.str;
        description = "Path to TLS private key file";
      };

      caFile = mkOption {
        type = types.nullOr types.str;
        default = null;
        description = "Path to CA certificate for client verification (mTLS)";
      };

      requireClientCert = mkOption {
        type = types.bool;
        default = false;
        description = "Require client certificates (mTLS)";
      };
    };

    storage = {
      retentionDays = mkOption {
        type = types.ints.positive;
        default = 15;
        description = "Data retention period in days";
      };

      walSegmentSizeMb = mkOption {
        type = types.ints.positive;
        default = 128;
        description = "WAL segment size in MB";
      };
    };
  };

  config = mkIf cfg.enable {
    systemd.services.metricstor = {
      description = "Metricstor Metrics Storage Service";
      wantedBy = [ "multi-user.target" ];
      after = [ "network.target" ];

      serviceConfig = {
        Type = "simple";
        ExecStart = "${cfg.package}/bin/metricstor-server --config ${configFile}";
        Restart = "on-failure";
        RestartSec = "5s";

        # Security hardening
        DynamicUser = true;
        StateDirectory = "metricstor";
        ProtectSystem = "strict";
        ProtectHome = true;
        PrivateTmp = true;
        NoNewPrivileges = true;
      };
    };

    # Expose metrics endpoint
    networking.firewall.allowedTCPPorts = mkIf cfg.openFirewall [ 8080 ];
  };
}

Usage Example (in NixOS configuration):

{
  services.metricstor = {
    enable = true;
    listenAddress = "0.0.0.0:8080";
    logLevel = "info";

    tls = {
      enable = true;
      certFile = "/etc/certs/metricstor-server.crt";
      keyFile = "/etc/certs/metricstor-server.key";
      caFile = "/etc/certs/ca.crt";
      requireClientCert = true;
    };

    storage = {
      retentionDays = 30;
    };
  };
}

8. Implementation Plan

8.1 Step Breakdown (S1-S6)

The implementation follows a phased approach aligned with the task.yaml steps.

S1: Research & Architecture (Current Document)

Deliverable: This design document

Status: Completed


S2: Workspace Scaffold

Goal: Create metricstor workspace with skeleton structure

Tasks:

  1. Create workspace structure:

    metricstor/
    ├── Cargo.toml
    ├── crates/
    │   ├── metricstor-api/          # Client library
    │   ├── metricstor-server/       # Main service
    │   └── metricstor-types/        # Shared types
    ├── proto/
    │   ├── remote_write.proto
    │   └── metricstor.proto
    └── README.md
    
  2. Setup proto compilation in build.rs

  3. Define core types:

    // metricstor-types/src/lib.rs
    
    pub type SeriesID = u64;
    pub type Timestamp = i64;  // Unix timestamp in milliseconds
    
    pub struct Sample {
        pub timestamp: Timestamp,
        pub value: f64,
    }
    
    pub struct Series {
        pub id: SeriesID,
        pub labels: BTreeMap<String, String>,
    }
    
    pub struct LabelMatcher {
        pub name: String,
        pub value: String,
        pub op: MatchOp,
    }
    
    pub enum MatchOp {
        Equal,
        NotEqual,
        RegexMatch,
        RegexNotMatch,
    }
    
  4. Add dependencies:

    [workspace.dependencies]
    # Core
    tokio = { version = "1.35", features = ["full"] }
    anyhow = "1.0"
    tracing = "0.1"
    tracing-subscriber = "0.3"
    
    # Serialization
    serde = { version = "1.0", features = ["derive"] }
    serde_json = "1.0"
    toml = "0.8"
    
    # gRPC
    tonic = "0.10"
    prost = "0.12"
    prost-types = "0.12"
    
    # HTTP
    axum = "0.7"
    axum-server = { version = "0.6", features = ["tls-rustls"] }
    
    # Compression
    snap = "1.1"  # Snappy
    
    # Time-series
    promql-parser = "0.4"
    
    # Storage
    rocksdb = "0.21"  # (NOT for TSDB, only for examples)
    
    # Crypto
    rustls = "0.21"
    

Estimated Effort: 2 days


S3: Push Ingestion

Goal: Implement Prometheus remote_write compatible ingestion endpoint

Tasks:

  1. Implement WAL:

    // metricstor-server/src/wal.rs
    
    struct WAL {
        dir: PathBuf,
        segment_size: usize,
        active_segment: RwLock<WALSegment>,
    }
    
    impl WAL {
        fn new(dir: PathBuf, segment_size: usize) -> Result<Self>;
        fn append(&self, record: WALRecord) -> Result<()>;
        fn replay(&self) -> Result<Vec<WALRecord>>;
        fn checkpoint(&self, min_segment: u64) -> Result<()>;
    }
    
  2. Implement In-Memory Head Block:

    // metricstor-server/src/head.rs
    
    struct Head {
        series: DashMap<SeriesID, Arc<Series>>,  // Concurrent HashMap
        min_time: AtomicI64,
        max_time: AtomicI64,
        config: HeadConfig,
    }
    
    impl Head {
        async fn append(&self, series_id: SeriesID, labels: Labels, ts: Timestamp, value: f64) -> Result<()>;
        async fn get(&self, series_id: SeriesID) -> Option<Arc<Series>>;
        async fn series_count(&self) -> usize;
    }
    
  3. Implement Gorilla Compression (basic version):

    // metricstor-server/src/compression.rs
    
    struct GorillaEncoder { /* ... */ }
    struct GorillaDecoder { /* ... */ }
    
    impl GorillaEncoder {
        fn encode_timestamp(&mut self, ts: i64) -> Result<()>;
        fn encode_value(&mut self, value: f64) -> Result<()>;
        fn finish(self) -> Vec<u8>;
    }
    
  4. Implement HTTP Ingestion Handler:

    // metricstor-server/src/handlers/ingest.rs
    
    async fn handle_remote_write(
        State(service): State<Arc<IngestionService>>,
        body: Bytes,
    ) -> Result<StatusCode, (StatusCode, String)> {
        // 1. Decompress Snappy
        // 2. Decode protobuf
        // 3. Validate samples
        // 4. Append to WAL
        // 5. Insert into Head
        // 6. Return 204 No Content
    }
    
  5. Add Rate Limiting:

    struct RateLimiter {
        rate: f64,  // samples/sec
        tokens: AtomicU64,
    }
    
    impl RateLimiter {
        fn allow(&self) -> bool;
    }
    
  6. Integration Test:

    #[tokio::test]
    async fn test_remote_write_ingestion() {
        // Start server
        // Send WriteRequest
        // Verify samples stored
    }
    

Estimated Effort: 5 days


S4: PromQL Query Engine

Goal: Basic PromQL query support (instant + range queries)

Tasks:

  1. Integrate promql-parser:

    // metricstor-server/src/query/parser.rs
    
    use promql_parser::parser;
    
    pub fn parse(query: &str) -> Result<parser::Expr> {
        parser::parse(query).map_err(|e| Error::ParseError(e.to_string()))
    }
    
  2. Implement Query Planner:

    // metricstor-server/src/query/planner.rs
    
    pub enum QueryPlan {
        VectorSelector { matchers: Vec<LabelMatcher>, timestamp: i64 },
        MatrixSelector { matchers: Vec<LabelMatcher>, range: Duration, timestamp: i64 },
        Aggregate { op: AggregateOp, input: Box<QueryPlan>, grouping: Vec<String> },
        RateFunc { input: Box<QueryPlan> },
        // ... other operators
    }
    
    pub fn plan(expr: parser::Expr, query_time: i64) -> Result<QueryPlan>;
    
  3. Implement Label Index:

    // metricstor-server/src/index.rs
    
    struct LabelIndex {
        // label_name -> label_value -> [series_ids]
        inverted_index: DashMap<String, DashMap<String, Vec<SeriesID>>>,
    }
    
    impl LabelIndex {
        fn find_series(&self, matchers: &[LabelMatcher]) -> Result<Vec<SeriesID>>;
        fn add_series(&self, series_id: SeriesID, labels: &Labels);
    }
    
  4. Implement Query Executor:

    // metricstor-server/src/query/executor.rs
    
    struct QueryExecutor {
        head: Arc<Head>,
        blocks: Arc<BlockManager>,
        index: Arc<LabelIndex>,
    }
    
    impl QueryExecutor {
        async fn execute(&self, plan: QueryPlan) -> Result<QueryResult>;
    
        async fn execute_vector_selector(&self, matchers: Vec<LabelMatcher>, ts: i64) -> Result<InstantVector>;
        async fn execute_matrix_selector(&self, matchers: Vec<LabelMatcher>, range: Duration, ts: i64) -> Result<RangeVector>;
    
        fn apply_rate(&self, matrix: RangeVector) -> Result<InstantVector>;
        fn apply_aggregate(&self, op: AggregateOp, vector: InstantVector, grouping: Vec<String>) -> Result<InstantVector>;
    }
    
  5. Implement HTTP Query Handlers:

    // metricstor-server/src/handlers/query.rs
    
    async fn handle_instant_query(
        Query(params): Query<QueryParams>,
        State(executor): State<Arc<QueryExecutor>>,
    ) -> Result<Json<QueryResponse>, (StatusCode, String)> {
        let expr = parse(&params.query)?;
        let plan = plan(expr, params.time.unwrap_or_else(now))?;
        let result = executor.execute(plan).await?;
        Ok(Json(format_response(result)))
    }
    
    async fn handle_range_query(
        Query(params): Query<RangeQueryParams>,
        State(executor): State<Arc<QueryExecutor>>,
    ) -> Result<Json<QueryResponse>, (StatusCode, String)> {
        // Similar to instant query, but iterate over [start, end] with step
    }
    
  6. Integration Test:

    #[tokio::test]
    async fn test_instant_query() {
        // Ingest samples
        // Query: http_requests_total{method="GET"}
        // Verify results
    }
    
    #[tokio::test]
    async fn test_range_query_with_rate() {
        // Ingest counter samples
        // Query: rate(http_requests_total[5m])
        // Verify rate calculation
    }
    

Estimated Effort: 7 days


S5: Storage Layer

Goal: Time-series storage with retention and compaction

Tasks:

  1. Implement Block Writer:

    // metricstor-server/src/block/writer.rs
    
    struct BlockWriter {
        block_dir: PathBuf,
        index_writer: IndexWriter,
        chunk_writer: ChunkWriter,
    }
    
    impl BlockWriter {
        fn new(block_dir: PathBuf) -> Result<Self>;
        fn write_series(&mut self, series: &Series, samples: &[Sample]) -> Result<()>;
        fn finalize(self) -> Result<BlockMeta>;
    }
    
  2. Implement Block Reader:

    // metricstor-server/src/block/reader.rs
    
    struct BlockReader {
        meta: BlockMeta,
        index: Index,
        chunks: ChunkReader,
    }
    
    impl BlockReader {
        fn open(block_dir: PathBuf) -> Result<Self>;
        fn query_samples(&self, series_id: SeriesID, start: i64, end: i64) -> Result<Vec<Sample>>;
    }
    
  3. Implement Compaction:

    // metricstor-server/src/compaction.rs
    
    struct Compactor {
        data_dir: PathBuf,
        config: CompactionConfig,
    }
    
    impl Compactor {
        async fn compact_head_to_l0(&self, head: &Head) -> Result<BlockID>;
        async fn compact_blocks(&self, source_blocks: Vec<BlockID>) -> Result<BlockID>;
        async fn run_compaction_loop(&self);  // Background task
    }
    
  4. Implement Retention Enforcement:

    impl Compactor {
        async fn enforce_retention(&self, retention: Duration) -> Result<()> {
            let cutoff = SystemTime::now() - retention;
            // Delete blocks older than cutoff
        }
    }
    
  5. Implement Block Manager:

    // metricstor-server/src/block/manager.rs
    
    struct BlockManager {
        blocks: RwLock<Vec<Arc<BlockReader>>>,
        data_dir: PathBuf,
    }
    
    impl BlockManager {
        fn load_blocks(&mut self) -> Result<()>;
        fn add_block(&mut self, block: BlockReader);
        fn remove_block(&mut self, block_id: &BlockID);
        fn query_blocks(&self, start: i64, end: i64) -> Vec<Arc<BlockReader>>;
    }
    
  6. Integration Test:

    #[tokio::test]
    async fn test_compaction() {
        // Ingest data for >2h
        // Trigger compaction
        // Verify block created
        // Query old data from block
    }
    
    #[tokio::test]
    async fn test_retention() {
        // Create old blocks
        // Run retention enforcement
        // Verify old blocks deleted
    }
    

Estimated Effort: 8 days


S6: Integration & Documentation

Goal: NixOS module, TLS config, integration tests, operator docs

Tasks:

  1. Create NixOS Module:

    • File: nix/modules/metricstor.nix
    • Follow T024 patterns
    • Include systemd service, firewall rules
    • Support TLS configuration options
  2. Implement mTLS:

    • Load certs in server startup
    • Configure Rustls with client cert verification
    • Extract client identity for rate limiting
  3. Create Metricstor Scraper:

    • Standalone scraper service
    • Reads scrape config (TOML)
    • Scrapes /metrics endpoints from services
    • Pushes to Metricstor via remote_write
  4. Integration Tests:

    #[tokio::test]
    async fn test_e2e_ingest_and_query() {
        // Start Metricstor server
        // Ingest samples via remote_write
        // Query via /api/v1/query
        // Query via /api/v1/query_range
        // Verify results match
    }
    
    #[tokio::test]
    async fn test_mtls_authentication() {
        // Start server with mTLS
        // Connect without client cert -> rejected
        // Connect with valid client cert -> accepted
    }
    
    #[tokio::test]
    async fn test_grafana_compatibility() {
        // Configure Grafana to use Metricstor
        // Execute sample queries
        // Verify dashboards render correctly
    }
    
  5. Write Operator Documentation:

    • File: docs/por/T033-metricstor/OPERATOR.md
    • Installation (NixOS, standalone)
    • Configuration guide
    • mTLS setup
    • Scraper configuration
    • Troubleshooting
    • Performance tuning
  6. Write Developer Documentation:

    • File: metricstor/README.md
    • Architecture overview
    • Building from source
    • Running tests
    • Contributing guidelines

Estimated Effort: 5 days


8.2 Dependency Ordering

S1 (Research) → S2 (Scaffold)
                    ↓
                S3 (Ingestion) ──┐
                    ↓             │
                S4 (Query)        │
                    ↓             │
                S5 (Storage) ←────┘
                    ↓
                S6 (Integration)

Critical Path: S1 → S2 → S3 → S5 → S6 Parallelizable: S4 can start after S3 completes basic ingestion

8.3 Total Effort Estimate

Step Effort Priority
S1: Research 2 days P0
S2: Scaffold 2 days P0
S3: Ingestion 5 days P0
S4: Query Engine 7 days P0
S5: Storage Layer 8 days P1
S6: Integration 5 days P1
Total 29 days

Realistic Timeline: 6-8 weeks (accounting for testing, debugging, documentation)


9. Open Questions

9.1 Decisions Requiring User Input

Q1: Scraper Implementation Choice

Question: Should we use Prometheus in agent mode or build a custom Rust scraper?

Option A: Prometheus Agent + Remote Write

  • Pros: Battle-tested, standard tool, no implementation effort
  • Cons: Adds Go dependency, less platform integration

Option B: Custom Rust Scraper

  • Pros: 100% Rust, platform consistency, easier integration
  • Cons: Implementation effort, needs testing

Recommendation: Option B (custom scraper) for consistency with PROJECT.md philosophy

Decision: [ ] A [ ] B [ ] Defer to later


Q2: gRPC vs HTTP Priority

Question: Should we prioritize gRPC API or focus only on HTTP (Prometheus compatibility)?

Option A: HTTP only (v1)

  • Pros: Simpler, Prometheus/Grafana compatibility is sufficient
  • Cons: Less efficient for internal services

Option B: Both HTTP and gRPC (v1)

  • Pros: Better performance for internal services, more flexibility
  • Cons: More implementation effort

Recommendation: Option A for v1, add gRPC in v2 if needed

Decision: [ ] A [ ] B


Q3: FlareDB Metadata Integration Timeline

Question: When should we integrate FlareDB for metadata storage?

Option A: v1 (T033)

  • Pros: Unified metadata story from the start
  • Cons: Increases complexity, adds dependency

Option B: v2 (Future)

  • Pros: Simpler v1, can deploy standalone
  • Cons: Migration effort later

Recommendation: Option B (defer to v2)

Decision: [ ] A [ ] B


Q4: S3 Cold Storage Priority

Question: Should S3 cold storage be part of v1 or deferred?

Option A: v1 (T033.S5)

  • Pros: Unlimited retention from day 1
  • Cons: Complexity, operational overhead

Option B: v2 (Future)

  • Pros: Simpler v1, focus on core functionality
  • Cons: Limited retention (local disk only)

Recommendation: Option B (defer to v2), use local disk for v1 with 15-30 day retention

Decision: [ ] A [ ] B


9.2 Areas Needing Further Investigation

I1: PromQL Function Coverage

Issue: Need to determine exact subset of PromQL functions to support in v1.

Investigation Needed:

  • Survey existing Grafana dashboards in use
  • Identify most common functions (rate, increase, histogram_quantile, etc.)
  • Prioritize by usage frequency

Proposed Approach:

  • Analyze 10-20 sample dashboards
  • Create coverage matrix
  • Implement top 80% functions first

I2: Query Performance Benchmarking

Issue: Need to validate query latency targets (p95 <100ms) are achievable.

Investigation Needed:

  • Benchmark promql-parser crate performance
  • Measure Gorilla decompression throughput
  • Test index lookup performance at 10M series scale

Proposed Approach:

  • Create benchmark suite with synthetic data (1M, 10M series)
  • Measure end-to-end query latency
  • Identify bottlenecks and optimize

I3: Series Cardinality Limits

Issue: How to prevent series explosion (high cardinality killing performance)?

Investigation Needed:

  • Research cardinality estimation algorithms (HyperLogLog)
  • Define cardinality limits (per metric, per label, global)
  • Implement rejection strategy (reject new series beyond limit)

Proposed Approach:

  • Add cardinality tracking to label index
  • Implement warnings at 80% limit, rejection at 100%
  • Provide admin API to inspect high-cardinality series

I4: Out-of-Order Sample Edge Cases

Issue: How to handle out-of-order samples spanning chunk boundaries?

Investigation Needed:

  • Test scenarios: samples arriving 1h late, 2h late, etc.
  • Determine if we need multi-chunk updates or reject old samples
  • Benchmark impact of re-sorting chunks

Proposed Approach:

  • Implement configurable out-of-order window (default: 1h)
  • Reject samples older than window
  • For within-window samples, insert into correct chunk (may require chunk re-compression)

10. References

10.1 Research Sources

Time-Series Storage Formats

PromQL Implementation

Prometheus Remote Write Protocol

Rust TSDB Implementations

10.2 Platform References

Internal Documentation

  • PROJECT.md (Item 12: Metrics Store)
  • docs/por/T033-metricstor/task.yaml
  • docs/por/T027-production-hardening/ (TLS patterns)
  • docs/por/T024-nixos-packaging/ (NixOS module patterns)

Existing Service Patterns

  • flaredb/crates/flaredb-server/src/main.rs (TLS, metrics export)
  • flaredb/crates/flaredb-server/src/config/mod.rs (Config structure)
  • chainfire/crates/chainfire-server/src/config.rs (TLS config)
  • iam/crates/iam-server/src/config.rs (Config patterns)

10.3 External Tools


Appendix A: PromQL Function Reference (v1 Support)

Supported Functions

Function Category Description Example
rate() Counter Per-second rate of increase rate(http_requests_total[5m])
irate() Counter Instant rate (last 2 samples) irate(http_requests_total[5m])
increase() Counter Total increase over range increase(http_requests_total[1h])
histogram_quantile() Histogram Calculate quantile from histogram histogram_quantile(0.95, rate(http_duration_bucket[5m]))
sum() Aggregation Sum values sum(metric)
avg() Aggregation Average values avg(metric)
min() Aggregation Minimum value min(metric)
max() Aggregation Maximum value max(metric)
count() Aggregation Count series count(metric)
stddev() Aggregation Standard deviation stddev(metric)
stdvar() Aggregation Standard variance stdvar(metric)
topk() Aggregation Top K series topk(5, metric)
bottomk() Aggregation Bottom K series bottomk(5, metric)
time() Time Current timestamp time()
timestamp() Time Sample timestamp timestamp(metric)
abs() Math Absolute value abs(metric)
ceil() Math Round up ceil(metric)
floor() Math Round down floor(metric)
round() Math Round to nearest round(metric, 0.1)

NOT Supported in v1

Function Category Reason
predict_linear() Prediction Complex, low usage
deriv() Math Low usage
holt_winters() Prediction Complex
resets() Counter Low usage
changes() Analysis Low usage
Subqueries Advanced Very complex

Appendix B: Configuration Reference

Complete Configuration Example

# metricstor.toml - Complete configuration example

[server]
# Listen address for HTTP/gRPC API
addr = "0.0.0.0:8080"

# Log level: trace, debug, info, warn, error
log_level = "info"

# Metrics port for self-monitoring (Prometheus /metrics endpoint)
metrics_port = 9099

[server.tls]
# Enable TLS
cert_file = "/etc/metricstor/certs/server.crt"
key_file = "/etc/metricstor/certs/server.key"

# Enable mTLS (require client certificates)
ca_file = "/etc/metricstor/certs/ca.crt"
require_client_cert = true

[storage]
# Data directory for TSDB blocks and WAL
data_dir = "/var/lib/metricstor/data"

# Data retention period (days)
retention_days = 15

# WAL segment size (MB)
wal_segment_size_mb = 128

# Block duration for compaction
min_block_duration = "2h"
max_block_duration = "24h"

# Out-of-order sample acceptance window
out_of_order_time_window = "1h"

# Series cardinality limits
max_series = 10_000_000
max_series_per_metric = 100_000

# Memory limits
max_head_chunks_per_series = 2
max_head_size_mb = 2048

[query]
# Query timeout (seconds)
timeout_seconds = 30

# Maximum query range (hours)
max_range_hours = 24

# Query result cache TTL (seconds)
cache_ttl_seconds = 60

# Maximum concurrent queries
max_concurrent_queries = 100

[ingestion]
# Write buffer size (samples)
write_buffer_size = 100_000

# Backpressure strategy: "block" or "reject"
backpressure_strategy = "block"

# Rate limiting (samples per second per client)
rate_limit_per_client = 50_000

# Maximum samples per write request
max_samples_per_request = 10_000

[compaction]
# Enable background compaction
enabled = true

# Compaction interval (seconds)
interval_seconds = 7200  # 2 hours

# Number of compaction threads
num_threads = 2

[s3]
# S3 cold storage (optional, future)
enabled = false
endpoint = "https://s3.example.com"
bucket = "metricstor-blocks"
access_key_id = "..."
secret_access_key = "..."
upload_after_days = 7
local_cache_size_gb = 100

[flaredb]
# FlareDB metadata integration (optional, future)
enabled = false
endpoints = ["flaredb-1:50051", "flaredb-2:50051"]
namespace = "metrics"

Appendix C: Metrics Exported by Metricstor

Metricstor exports metrics about itself on port 9099 (configurable).

Ingestion Metrics

# Samples ingested
metricstor_samples_ingested_total{} counter

# Samples rejected (out-of-order, invalid, etc.)
metricstor_samples_rejected_total{reason="out_of_order|invalid|rate_limit"} counter

# Ingestion latency (milliseconds)
metricstor_ingestion_latency_ms{quantile="0.5|0.9|0.99"} summary

# Active series
metricstor_active_series{} gauge

# Head memory usage (bytes)
metricstor_head_memory_bytes{} gauge

Query Metrics

# Queries executed
metricstor_queries_total{type="instant|range"} counter

# Query latency (milliseconds)
metricstor_query_latency_ms{type="instant|range", quantile="0.5|0.9|0.99"} summary

# Query errors
metricstor_query_errors_total{reason="timeout|parse_error|execution_error"} counter

Storage Metrics

# WAL segments
metricstor_wal_segments{} gauge

# WAL size (bytes)
metricstor_wal_size_bytes{} gauge

# Blocks
metricstor_blocks_total{level="0|1|2"} gauge

# Block size (bytes)
metricstor_block_size_bytes{level="0|1|2"} gauge

# Compactions
metricstor_compactions_total{level="0|1|2"} counter

# Compaction duration (seconds)
metricstor_compaction_duration_seconds{level="0|1|2", quantile="0.5|0.9|0.99"} summary

System Metrics

# Go runtime metrics (if using Go for scraper)
# Rust memory metrics
metricstor_memory_allocated_bytes{} gauge

# CPU usage
metricstor_cpu_usage_seconds_total{} counter

Appendix D: Error Codes and Troubleshooting

HTTP Error Codes

Code Meaning Common Causes
200 OK Query successful
204 No Content Write successful
400 Bad Request Invalid PromQL, malformed protobuf
401 Unauthorized mTLS cert validation failed
429 Too Many Requests Rate limit exceeded
500 Internal Server Error Storage error, WAL corruption
503 Service Unavailable Write buffer full, server overloaded

Common Issues

Issue: "Samples rejected: out_of_order"

Cause: Samples arriving with timestamps older than out_of_order_time_window

Solution:

  • Increase out_of_order_time_window in config
  • Check clock sync on clients (NTP)
  • Reduce scrape batch size

Issue: "Rate limit exceeded"

Cause: Client exceeding rate_limit_per_client samples/sec

Solution:

  • Increase rate limit in config
  • Reduce scrape frequency
  • Shard writes across multiple clients

Issue: "Query timeout"

Cause: Query exceeding timeout_seconds

Solution:

  • Increase query timeout
  • Reduce query time range
  • Add more specific label matchers to reduce series scanned

Issue: "Series cardinality explosion"

Cause: Too many unique label combinations (high cardinality)

Solution:

  • Review label design (avoid unbounded labels like user_id)
  • Use relabeling to drop high-cardinality labels
  • Increase max_series limit (if justified)

End of Design Document

Total Length: ~3,800 lines

Status: Ready for review and S2-S6 implementation

Next Steps:

  1. Review and approve design decisions
  2. Create GitHub issues for S2-S6 tasks
  3. Begin S2: Workspace Scaffold