centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere

- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-11 09:59:19 +09:00

121 KiB

Raw Blame History

Metricstor Design Document

Project: Metricstor - VictoriaMetrics OSS Replacement Task: T033.S1 Research & Architecture Version: 1.0 Date: 2025-12-10 Author: PeerB

Executive Summary
Requirements
Time-Series Storage Model
Push Ingestion API
PromQL Query Engine
Storage Backend Architecture
Integration Points
Implementation Plan
Open Questions
References

1. Executive Summary

1.1 Overview

Metricstor is a fully open-source, distributed time-series database designed as a replacement for VictoriaMetrics, addressing the critical requirement that VictoriaMetrics' mTLS support is a paid feature. As the final component (Item 12/12) of PROJECT.md, Metricstor completes the observability stack for the Japanese cloud platform.

1.2 High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Service Mesh                              │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐           │
│  │ FlareDB  │ │ ChainFire│ │ PlasmaVMC│ │  IAM     │  ...      │
│  │ :9092    │ │ :9091    │ │ :9093    │ │ :9094    │           │
│  └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘           │
│       │            │            │            │                   │
│       └────────────┴────────────┴────────────┘                   │
│                         │                                        │
│                         │ Push (remote_write)                    │
│                         │ mTLS                                   │
│                         ▼                                        │
│              ┌──────────────────────┐                            │
│              │   Metricstor Server  │                            │
│              │  ┌────────────────┐  │                            │
│              │  │  Ingestion API │  │ ← Prometheus remote_write  │
│              │  │  (gRPC/HTTP)   │  │                            │
│              │  └────────┬───────┘  │                            │
│              │           │          │                            │
│              │  ┌────────▼───────┐  │                            │
│              │  │  Write Buffer  │  │                            │
│              │  │  (In-Memory)   │  │                            │
│              │  └────────┬───────┘  │                            │
│              │           │          │                            │
│              │  ┌────────▼───────┐  │                            │
│              │  │  Storage Engine│  │                            │
│              │  │  ┌──────────┐  │  │                            │
│              │  │  │ Head     │  │  │ ← WAL + In-Memory Index    │
│              │  │  │ (Active) │  │  │                            │
│              │  │  └────┬─────┘  │  │                            │
│              │  │       │        │  │                            │
│              │  │  ┌────▼─────┐  │  │                            │
│              │  │  │ Blocks   │  │  │ ← Immutable, Compressed    │
│              │  │  │ (TSDB)   │  │  │                            │
│              │  │  └──────────┘  │  │                            │
│              │  └────────────────┘  │                            │
│              │           │          │                            │
│              │  ┌────────▼───────┐  │                            │
│              │  │  Query Engine  │  │ ← PromQL Execution         │
│              │  │  (PromQL AST)  │  │                            │
│              │  └────────┬───────┘  │                            │
│              │           │          │                            │
│              └───────────┼──────────┘                            │
│                          │                                        │
│                          │ Query (HTTP)                           │
│                          │ mTLS                                   │
│                          ▼                                        │
│              ┌──────────────────────┐                            │
│              │   Grafana / Clients  │                            │
│              └──────────────────────┘                            │
└─────────────────────────────────────────────────────────────────┘

                  ┌─────────────────────┐
                  │  FlareDB Cluster    │  ← Metadata (optional)
                  │  (Metadata Store)   │
                  └─────────────────────┘

                  ┌─────────────────────┐
                  │  S3-Compatible      │  ← Cold Storage (future)
                  │  Object Storage     │
                  └─────────────────────┘

1.3 Key Design Decisions

Storage Format: Hybrid approach using Prometheus TSDB block design with Gorilla compression
- Rationale: Battle-tested, excellent compression (1-2 bytes/sample), widely understood
Storage Backend: Dedicated time-series engine with optional FlareDB metadata integration
- Rationale: Time-series workloads have unique access patterns; KV stores not optimal for sample storage
- FlareDB reserved for metadata (series labels, index) in distributed scenarios
PromQL Subset: Support 80% of common use cases (instant/range queries, basic aggregations, rate/increase)
- Rationale: Full PromQL compatibility is complex; focus on practical operator needs
Push Model: Prometheus remote_write v1.0 protocol via HTTP + gRPC APIs
- Rationale: Standard protocol, Snappy compression built-in, client library availability
mTLS Integration: Consistent with T027/T031 patterns (cert_file, key_file, ca_file, require_client_cert)
- Rationale: Unified security model across all platform services

1.4 Success Criteria

Accept metrics from 8+ services (ports 9091-9099) via remote_write
Query latency <100ms for instant queries (p95)
Compression ratio ≥10:1 (target: 1.5-2 bytes/sample)
Support 100K samples/sec write throughput per instance
PromQL queries cover 80% of Grafana dashboard use cases
Zero vendor lock-in (100% OSS, no paid features)

2. Requirements

2.1 Functional Requirements

FR-1: Push-Based Metric Ingestion

FR-1.1: Accept Prometheus remote_write v1.0 protocol (HTTP POST)
FR-1.2: Support Snappy-compressed protobuf payloads
FR-1.3: Validate metric names and labels per Prometheus naming conventions
FR-1.4: Handle out-of-order samples within a configurable time window (default: 1h)
FR-1.5: Deduplicate duplicate samples (same timestamp + labels)
FR-1.6: Return backpressure signals (HTTP 429/503) when buffer is full

FR-2: PromQL Query Engine

FR-2.1: Support instant queries (/api/v1/query)
FR-2.2: Support range queries (/api/v1/query_range)
FR-2.3: Support label queries (/api/v1/label/<name>/values, /api/v1/labels)
FR-2.4: Support series metadata queries (/api/v1/series)
FR-2.5: Implement core PromQL functions (see Section 5.2)
FR-2.6: Support Prometheus HTTP API JSON response format

FR-3: Time-Series Storage

FR-3.1: Store samples with millisecond timestamp precision
FR-3.2: Support configurable retention periods (default: 15 days, configurable 1-365 days)
FR-3.3: Automatic background compaction of blocks
FR-3.4: Crash recovery via Write-Ahead Log (WAL)
FR-3.5: Series cardinality limits to prevent explosion (default: 10M series)

FR-4: Security & Authentication

FR-4.1: mTLS support for ingestion and query APIs
FR-4.2: Optional basic authentication for HTTP endpoints
FR-4.3: Rate limiting per client (based on mTLS certificate CN or IP)

FR-5: Operational Features

FR-5.1: Prometheus-compatible /metrics endpoint for self-monitoring
FR-5.2: Health check endpoints (/health, /ready)
FR-5.3: Admin API for series deletion, compaction trigger
FR-5.4: TOML configuration file support
FR-5.5: Environment variable overrides

2.2 Non-Functional Requirements

NFR-1: Performance

NFR-1.1: Ingestion throughput: ≥100K samples/sec per instance
NFR-1.2: Query latency (p95): <100ms for instant queries, <500ms for range queries (1h window)
NFR-1.3: Compression ratio: ≥10:1 (target: 1.5-2 bytes/sample)
NFR-1.4: Memory usage: <2GB for 1M active series

NFR-2: Scalability

NFR-2.1: Vertical scaling: Support 10M active series per instance
NFR-2.2: Horizontal scaling: Support sharding across multiple instances (future work)
NFR-2.3: Storage: Support local disk + optional S3-compatible backend for cold data

NFR-3: Reliability

NFR-3.1: No data loss for committed samples (WAL durability)
NFR-3.2: Graceful degradation under load (reject writes with backpressure, not crash)
NFR-3.3: Crash recovery time: <30s for 10M series

NFR-4: Maintainability

NFR-4.1: Codebase consistency with other platform services (FlareDB, ChainFire patterns)
NFR-4.2: 100% Rust, no CGO dependencies
NFR-4.3: Comprehensive unit and integration tests
NFR-4.4: Operator documentation with runbooks

NFR-5: Compatibility

NFR-5.1: Prometheus remote_write v1.0 protocol compatibility
NFR-5.2: Prometheus HTTP API compatibility (subset: query, query_range, labels, series)
NFR-5.3: Grafana data source compatibility

2.3 Out of Scope (Explicitly Not Supported in v1)

Prometheus remote_read protocol (pull-based; platform uses push)
Full PromQL compatibility (complex subqueries, advanced functions)
Multi-tenancy (single-tenant per instance; use multiple instances for multi-tenant)
Distributed query federation (single-instance queries only)
Recording rules and alerting (use separate Prometheus/Alertmanager for this)

3. Time-Series Storage Model

3.1 Data Model

3.1.1 Metric Structure

A time-series metric in Metricstor follows the Prometheus data model:

metric_name{label1="value1", label2="value2", ...} value timestamp

Example:

http_requests_total{method="GET", status="200", service="flaredb"} 1543 1733832000000

Components:

Metric Name: Identifier for the measurement (e.g., http_requests_total)
- Must match regex: [a-zA-Z_:][a-zA-Z0-9_:]*
Labels: Key-value pairs for dimensionality (e.g., {method="GET", status="200"})
- Label names: [a-zA-Z_][a-zA-Z0-9_]*
- Label values: Any UTF-8 string
- Reserved labels: __name__ (stores metric name), labels starting with __ are internal
Value: Float64 sample value
Timestamp: Millisecond precision (int64 milliseconds since Unix epoch)

3.1.2 Series Identification

A series is uniquely identified by its metric name + label set:

// Pseudo-code representation
struct SeriesID {
    hash: u64,  // FNV-1a hash of sorted labels
}

struct Series {
    id: SeriesID,
    labels: BTreeMap<String, String>,  // Sorted for consistent hashing
    chunks: Vec<ChunkRef>,
}

Series ID calculation:

Sort labels lexicographically (including __name__ label)
Concatenate as: label1_name + \0 + label1_value + \0 + label2_name + \0 + ...
Compute FNV-1a 64-bit hash

3.2 Storage Format

3.2.1 Architecture Overview

Metricstor uses a hybrid storage architecture inspired by Prometheus TSDB and Gorilla:

┌─────────────────────────────────────────────────────────────────┐
│                        Memory Layer (Head)                       │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
│  │  Series Map  │  │  WAL Segment │  │ Write Buffer │          │
│  │  (In-Memory  │  │  (Disk)      │  │ (MPSC Queue) │          │
│  │   Index)     │  │              │  │              │          │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘          │
│         │                 │                 │                    │
│         └─────────────────┴─────────────────┘                    │
│                           │                                      │
│                           ▼                                      │
│              ┌─────────────────────────┐                         │
│              │    Active Chunks        │                         │
│              │  (Gorilla-compressed)   │                         │
│              │   - 2h time windows     │                         │
│              │   - Delta-of-delta TS   │                         │
│              │   - XOR float encoding  │                         │
│              └─────────────────────────┘                         │
└─────────────────────────────────────────────────────────────────┘
                           │
                           │ Compaction Trigger
                           │ (every 2h or on shutdown)
                           ▼
┌─────────────────────────────────────────────────────────────────┐
│                        Disk Layer (Blocks)                       │
│                                                                   │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐ │
│  │  Block 1        │  │  Block 2        │  │  Block N        │ │
│  │  [0h - 2h)      │  │  [2h - 4h)      │  │  [Nh - (N+2)h)  │ │
│  │                 │  │                 │  │                 │ │
│  │  ├─ meta.json   │  │  ├─ meta.json   │  │  ├─ meta.json   │ │
│  │  ├─ index       │  │  ├─ index       │  │  ├─ index       │ │
│  │  ├─ chunks/000  │  │  ├─ chunks/000  │  │  ├─ chunks/000  │ │
│  │  └─ tombstones  │  │  └─ tombstones  │  │  └─ tombstones  │ │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

3.2.2 Write-Ahead Log (WAL)

Purpose: Durability and crash recovery

Format: Append-only log segments (128MB default size)

WAL Structure:
data/
  wal/
    00000001  ← Segment 1 (128MB)
    00000002  ← Segment 2 (active)

WAL Record Format (inspired by LevelDB):

┌───────────────────────────────────────────────────┐
│  CRC32 (4 bytes)                                  │
├───────────────────────────────────────────────────┤
│  Length (4 bytes, little-endian)                  │
├───────────────────────────────────────────────────┤
│  Type (1 byte): FULL | FIRST | MIDDLE | LAST      │
├───────────────────────────────────────────────────┤
│  Payload (variable):                              │
│    - Record Type (1 byte): Series | Samples       │
│    - Series ID (8 bytes)                          │
│    - Labels (length-prefixed strings)             │
│    - Samples (varint timestamp, float64 value)    │
└───────────────────────────────────────────────────┘

WAL Operations:

Append: Every write appends to active segment
Checkpoint: Snapshot of in-memory state to disk blocks
Truncate: Delete segments older than oldest in-memory data
Replay: On startup, replay WAL segments to rebuild in-memory state

Rust Implementation Sketch:

struct WAL {
    dir: PathBuf,
    segment_size: usize,  // 128MB default
    active_segment: File,
    active_segment_num: u64,
}

impl WAL {
    fn append(&mut self, record: &WALRecord) -> Result<()> {
        let encoded = record.encode();
        let crc = crc32(&encoded);

        // Rotate segment if needed
        if self.active_segment.metadata()?.len() + encoded.len() > self.segment_size {
            self.rotate_segment()?;
        }

        self.active_segment.write_all(&crc.to_le_bytes())?;
        self.active_segment.write_all(&(encoded.len() as u32).to_le_bytes())?;
        self.active_segment.write_all(&encoded)?;
        self.active_segment.sync_all()?;  // fsync for durability
        Ok(())
    }

    fn replay(&self) -> Result<Vec<WALRecord>> {
        // Read all segments and decode records
        // Used on startup for crash recovery
    }
}

3.2.3 In-Memory Head Block

Purpose: Accept recent writes, maintain hot data for fast queries

Structure:

struct Head {
    series: RwLock<HashMap<SeriesID, Arc<Series>>>,
    min_time: AtomicI64,
    max_time: AtomicI64,
    chunk_size: Duration,  // 2h default
    wal: Arc<WAL>,
}

struct Series {
    id: SeriesID,
    labels: BTreeMap<String, String>,
    chunks: RwLock<Vec<Chunk>>,
}

struct Chunk {
    min_time: i64,
    max_time: i64,
    samples: CompressedSamples,  // Gorilla encoding
}

Chunk Lifecycle:

Creation: New chunk created when first sample arrives or previous chunk is full
Active: Chunk accepts samples in time window [min_time, min_time + 2h)
Full: Chunk reaches 2h window, new chunk created for subsequent samples
Compaction: Full chunks compacted to disk blocks

Memory Limits:

Max series: 10M (configurable)
Max chunks per series: 2 (active + previous, covering 4h)
Eviction: LRU eviction of inactive series (no samples in 4h)

3.2.4 Disk Blocks (Immutable)

Purpose: Long-term storage of compacted time-series data

Block Structure (inspired by Prometheus TSDB):

data/
  01HQZQZQZQZQZQZQZQZQZQ/  ← Block directory (ULID)
    meta.json              ← Metadata
    index                  ← Inverted index
    chunks/
      000001               ← Chunk file
      000002
      ...
    tombstones             ← Deleted series/samples

meta.json Format:

{
  "ulid": "01HQZQZQZQZQZQZQZQZQZQ",
  "minTime": 1733832000000,
  "maxTime": 1733839200000,
  "stats": {
    "numSamples": 1500000,
    "numSeries": 5000,
    "numChunks": 10000
  },
  "compaction": {
    "level": 1,
    "sources": ["01HQZQZ..."]
  },
  "version": 1
}

Index File Format (simplified):

The index file provides fast lookups of series by labels.

┌────────────────────────────────────────────────┐
│  Magic Number (4 bytes): 0xBADA55A0           │
├────────────────────────────────────────────────┤
│  Version (1 byte): 1                           │
├────────────────────────────────────────────────┤
│  Symbol Table Section                          │
│    - Sorted strings (label names/values)       │
│    - Offset table for binary search            │
├────────────────────────────────────────────────┤
│  Series Section                                │
│    - SeriesID → Chunk Refs mapping             │
│    - (series_id, labels, chunk_offsets)        │
├────────────────────────────────────────────────┤
│  Label Index Section (Inverted Index)          │
│    - label_name → [series_ids]                 │
│    - (label_name, label_value) → [series_ids]  │
├────────────────────────────────────────────────┤
│  Postings Section                              │
│    - Sorted posting lists for label matchers   │
│    - Compressed with varint + bit packing      │
├────────────────────────────────────────────────┤
│  TOC (Table of Contents)                       │
│    - Offsets to each section                   │
└────────────────────────────────────────────────┘

Chunks File Format:

Chunk File (chunks/000001):
┌────────────────────────────────────────────────┐
│  Chunk 1:                                      │
│    ├─ Length (4 bytes)                         │
│    ├─ Encoding (1 byte): Gorilla = 0x01        │
│    ├─ MinTime (8 bytes)                        │
│    ├─ MaxTime (8 bytes)                        │
│    ├─ NumSamples (4 bytes)                     │
│    └─ Compressed Data (variable)               │
├────────────────────────────────────────────────┤
│  Chunk 2: ...                                  │
└────────────────────────────────────────────────┘

3.3 Compression Strategy

3.3.1 Gorilla Compression Algorithm

Metricstor uses Gorilla compression from Facebook's paper (VLDB 2015), achieving ~12x compression.

Timestamp Compression (Delta-of-Delta):

Example timestamps (ms):
  t0 = 1733832000000
  t1 = 1733832015000  (Δ1 = 15000)
  t2 = 1733832030000  (Δ2 = 15000)
  t3 = 1733832045000  (Δ3 = 15000)

Delta-of-delta:
  D1 = Δ1 - Δ0 = 15000 - 0 = 15000     → encode in 14 bits
  D2 = Δ2 - Δ1 = 15000 - 15000 = 0     → encode in 1 bit (0)
  D3 = Δ3 - Δ2 = 15000 - 15000 = 0     → encode in 1 bit (0)

Encoding:
  - If D = 0: write 1 bit "0"
  - If D in [-63, 64): write "10" + 7 bits
  - If D in [-255, 256): write "110" + 9 bits
  - If D in [-2047, 2048): write "1110" + 12 bits
  - Otherwise: write "1111" + 32 bits

96% of timestamps compress to 1 bit!

Value Compression (XOR Encoding):

Example values (float64):
  v0 = 1543.0
  v1 = 1543.5
  v2 = 1543.7

XOR compression:
  XOR(v0, v1) = 0x3FF0000000000000 XOR 0x3FF0800000000000
              = 0x0000800000000000
              → Leading zeros: 16, Trailing zeros: 47
              → Encode: control bit "1" + 5-bit leading + 6-bit length + 1 bit

  XOR(v1, v2) = 0x3FF0800000000000 XOR 0x3FF0CCCCCCCCCCD
              → Similar pattern, encode with control bits

Encoding:
  - If v_i == v_(i-1): write 1 bit "0"
  - If XOR has same leading/trailing zeros as previous: write "10" + significant bits
  - Otherwise: write "11" + 5-bit leading + 6-bit length + significant bits

51% of values compress to 1 bit!

Rust Implementation Sketch:

struct GorillaEncoder {
    bit_writer: BitWriter,
    prev_timestamp: i64,
    prev_delta: i64,
    prev_value: f64,
    prev_leading_zeros: u8,
    prev_trailing_zeros: u8,
}

impl GorillaEncoder {
    fn encode_timestamp(&mut self, timestamp: i64) -> Result<()> {
        let delta = timestamp - self.prev_timestamp;
        let delta_of_delta = delta - self.prev_delta;

        if delta_of_delta == 0 {
            self.bit_writer.write_bit(0)?;
        } else if delta_of_delta >= -63 && delta_of_delta < 64 {
            self.bit_writer.write_bits(0b10, 2)?;
            self.bit_writer.write_bits(delta_of_delta as u64, 7)?;
        } else if delta_of_delta >= -255 && delta_of_delta < 256 {
            self.bit_writer.write_bits(0b110, 3)?;
            self.bit_writer.write_bits(delta_of_delta as u64, 9)?;
        } else if delta_of_delta >= -2047 && delta_of_delta < 2048 {
            self.bit_writer.write_bits(0b1110, 4)?;
            self.bit_writer.write_bits(delta_of_delta as u64, 12)?;
        } else {
            self.bit_writer.write_bits(0b1111, 4)?;
            self.bit_writer.write_bits(delta_of_delta as u64, 32)?;
        }

        self.prev_timestamp = timestamp;
        self.prev_delta = delta;
        Ok(())
    }

    fn encode_value(&mut self, value: f64) -> Result<()> {
        let bits = value.to_bits();
        let xor = bits ^ self.prev_value.to_bits();

        if xor == 0 {
            self.bit_writer.write_bit(0)?;
        } else {
            let leading = xor.leading_zeros() as u8;
            let trailing = xor.trailing_zeros() as u8;
            let significant_bits = 64 - leading - trailing;

            if leading >= self.prev_leading_zeros && trailing >= self.prev_trailing_zeros {
                self.bit_writer.write_bits(0b10, 2)?;
                let mask = (1u64 << significant_bits) - 1;
                let significant = (xor >> trailing) & mask;
                self.bit_writer.write_bits(significant, significant_bits as usize)?;
            } else {
                self.bit_writer.write_bits(0b11, 2)?;
                self.bit_writer.write_bits(leading as u64, 5)?;
                self.bit_writer.write_bits(significant_bits as u64, 6)?;
                let mask = (1u64 << significant_bits) - 1;
                let significant = (xor >> trailing) & mask;
                self.bit_writer.write_bits(significant, significant_bits as usize)?;

                self.prev_leading_zeros = leading;
                self.prev_trailing_zeros = trailing;
            }
        }

        self.prev_value = value;
        Ok(())
    }
}

3.3.2 Compression Performance Targets

Based on research and production systems:

Metric	Target	Reference
Average bytes/sample	1.5-2.0	Prometheus (1-2), Gorilla (1.37), M3DB (1.45)
Compression ratio	10-12x	Gorilla (12x), InfluxDB TSM (45x for specific workloads)
Encode throughput	>500K samples/sec	Gorilla paper: 700K/sec
Decode throughput	>1M samples/sec	Gorilla paper: 1.2M/sec

3.4 Retention and Compaction Policies

3.4.1 Retention Policy

Default Retention: 15 days

Configurable Parameters:

[storage]
retention_days = 15         # Keep data for 15 days
min_block_duration = "2h"   # Minimum block size
max_block_duration = "24h"  # Maximum block size after compaction

Retention Enforcement:

Background goroutine runs every 1h
Deletes blocks where max_time < now() - retention_duration
Deletes old WAL segments

3.4.2 Compaction Strategy

Purpose:

Merge small blocks into larger blocks (reduce file count)
Remove deleted samples (tombstones)
Improve query performance (fewer blocks to scan)

Compaction Levels (inspired by LevelDB):

Level 0: 2h blocks (compacted from Head)
Level 1: 12h blocks (merge 6 L0 blocks)
Level 2: 24h blocks (merge 2 L1 blocks)

Compaction Trigger:

Time-based: Every 2h, compact Head → Level 0 block
Count-based: When L0 has >4 blocks, compact → L1
Manual: Admin API endpoint /api/v1/admin/compact

Compaction Algorithm:

1. Select blocks to compact (same level, adjacent time ranges)
2. Create new block directory (ULID)
3. Iterate all series in selected blocks:
   a. Merge chunks from all blocks
   b. Apply tombstones (skip deleted samples)
   c. Re-compress merged chunks
   d. Write to new block chunks file
4. Build new index (merge posting lists)
5. Write meta.json
6. Atomically rename block directory
7. Delete source blocks

Rust Implementation Sketch:

struct Compactor {
    data_dir: PathBuf,
    retention: Duration,
}

impl Compactor {
    async fn compact_head_to_l0(&self, head: &Head) -> Result<BlockID> {
        let block_id = ULID::new();
        let block_dir = self.data_dir.join(block_id.to_string());
        std::fs::create_dir_all(&block_dir)?;

        let mut index_writer = IndexWriter::new(&block_dir.join("index"))?;
        let mut chunk_writer = ChunkWriter::new(&block_dir.join("chunks/000001"))?;

        let series_map = head.series.read().await;
        for (series_id, series) in series_map.iter() {
            let chunks = series.chunks.read().await;
            for chunk in chunks.iter() {
                if chunk.is_full() {
                    let chunk_ref = chunk_writer.write_chunk(&chunk.samples)?;
                    index_writer.add_series(*series_id, &series.labels, chunk_ref)?;
                }
            }
        }

        index_writer.finalize()?;
        chunk_writer.finalize()?;

        let meta = BlockMeta {
            ulid: block_id,
            min_time: head.min_time.load(Ordering::Relaxed),
            max_time: head.max_time.load(Ordering::Relaxed),
            stats: compute_stats(&block_dir)?,
            compaction: CompactionMeta { level: 0, sources: vec![] },
            version: 1,
        };
        write_meta(&block_dir.join("meta.json"), &meta)?;

        Ok(block_id)
    }

    async fn compact_blocks(&self, source_blocks: Vec<BlockID>) -> Result<BlockID> {
        // Merge multiple blocks into one
        // Similar to compact_head_to_l0, but reads from existing blocks
    }

    async fn enforce_retention(&self) -> Result<()> {
        let cutoff = SystemTime::now() - self.retention;
        let cutoff_ms = cutoff.duration_since(UNIX_EPOCH)?.as_millis() as i64;

        for entry in std::fs::read_dir(&self.data_dir)? {
            let path = entry?.path();
            if !path.is_dir() { continue; }

            let meta_path = path.join("meta.json");
            if !meta_path.exists() { continue; }

            let meta: BlockMeta = serde_json::from_reader(File::open(meta_path)?)?;
            if meta.max_time < cutoff_ms {
                std::fs::remove_dir_all(&path)?;
                info!("Deleted expired block: {}", meta.ulid);
            }
        }
        Ok(())
    }
}

4. Push Ingestion API

4.1 Prometheus Remote Write Protocol

4.1.1 Protocol Overview

Specification: Prometheus Remote Write v1.0 Transport: HTTP/1.1 or HTTP/2 Encoding: Protocol Buffers (protobuf v3) Compression: Snappy (required)

Reference: Prometheus Remote Write Spec

4.1.2 HTTP Endpoint

POST /api/v1/write
Content-Type: application/x-protobuf
Content-Encoding: snappy
X-Prometheus-Remote-Write-Version: 0.1.0

Request Flow:

┌──────────────┐
│  Client      │
│ (Prometheus, │
│  FlareDB,    │
│  etc.)       │
└──────┬───────┘
       │
       │ 1. Collect samples
       │
       ▼
┌──────────────────────────────────┐
│  Encode to WriteRequest protobuf │
│  message                          │
└──────┬───────────────────────────┘
       │
       │ 2. Compress with Snappy
       │
       ▼
┌──────────────────────────────────┐
│  HTTP POST to /api/v1/write      │
│  with mTLS authentication         │
└──────┬───────────────────────────┘
       │
       │ 3. Send request
       │
       ▼
┌──────────────────────────────────┐
│  Metricstor Server               │
│  ├─ Validate mTLS cert           │
│  ├─ Decompress Snappy            │
│  ├─ Decode protobuf              │
│  ├─ Validate samples             │
│  ├─ Append to WAL                │
│  └─ Insert into Head             │
└──────┬───────────────────────────┘
       │
       │ 4. Response
       │
       ▼
┌──────────────────────────────────┐
│  HTTP Response:                  │
│  200 OK (success)                │
│  400 Bad Request (invalid)       │
│  429 Too Many Requests (backpressure) │
│  503 Service Unavailable (overload)   │
└──────────────────────────────────┘

4.1.3 Protobuf Schema

File: proto/remote_write.proto

syntax = "proto3";

package metricstor.remote;

// Prometheus remote_write compatible schema

message WriteRequest {
  repeated TimeSeries timeseries = 1;
  // Metadata is optional and not used in v1
  repeated MetricMetadata metadata = 2;
}

message TimeSeries {
  repeated Label labels = 1;
  repeated Sample samples = 2;
  // Exemplars are optional (not supported in v1)
  repeated Exemplar exemplars = 3;
}

message Label {
  string name = 1;
  string value = 2;
}

message Sample {
  double value = 1;
  int64 timestamp = 2;  // Unix timestamp in milliseconds
}

message Exemplar {
  repeated Label labels = 1;
  double value = 2;
  int64 timestamp = 3;
}

message MetricMetadata {
  enum MetricType {
    UNKNOWN = 0;
    COUNTER = 1;
    GAUGE = 2;
    HISTOGRAM = 3;
    GAUGEHISTOGRAM = 4;
    SUMMARY = 5;
    INFO = 6;
    STATESET = 7;
  }
  MetricType type = 1;
  string metric_family_name = 2;
  string help = 3;
  string unit = 4;
}

Generated Rust Code (using prost):

# Cargo.toml
[dependencies]
prost = "0.12"
prost-types = "0.12"

[build-dependencies]
prost-build = "0.12"

// build.rs
fn main() {
    prost_build::compile_protos(&["proto/remote_write.proto"], &["proto/"]).unwrap();
}

4.1.4 Ingestion Handler

Rust Implementation:

use axum::{
    Router,
    routing::post,
    extract::State,
    http::StatusCode,
    body::Bytes,
};
use prost::Message;
use snap::raw::Decoder as SnappyDecoder;

mod remote_write_pb {
    include!(concat!(env!("OUT_DIR"), "/metricstor.remote.rs"));
}

struct IngestionService {
    head: Arc<Head>,
    wal: Arc<WAL>,
    rate_limiter: Arc<RateLimiter>,
}

async fn handle_remote_write(
    State(service): State<Arc<IngestionService>>,
    body: Bytes,
) -> Result<StatusCode, (StatusCode, String)> {
    // 1. Decompress Snappy
    let mut decoder = SnappyDecoder::new();
    let decompressed = decoder
        .decompress_vec(&body)
        .map_err(|e| (StatusCode::BAD_REQUEST, format!("Snappy decompression failed: {}", e)))?;

    // 2. Decode protobuf
    let write_req = remote_write_pb::WriteRequest::decode(&decompressed[..])
        .map_err(|e| (StatusCode::BAD_REQUEST, format!("Protobuf decode failed: {}", e)))?;

    // 3. Validate and ingest
    let mut samples_ingested = 0;
    let mut samples_rejected = 0;

    for ts in write_req.timeseries.iter() {
        // Validate labels
        let labels = validate_labels(&ts.labels)
            .map_err(|e| (StatusCode::BAD_REQUEST, e))?;

        let series_id = compute_series_id(&labels);

        for sample in ts.samples.iter() {
            // Validate timestamp (not too old, not too far in future)
            if !is_valid_timestamp(sample.timestamp) {
                samples_rejected += 1;
                continue;
            }

            // Check rate limit
            if !service.rate_limiter.allow() {
                return Err((StatusCode::TOO_MANY_REQUESTS, "Rate limit exceeded".into()));
            }

            // Append to WAL
            let wal_record = WALRecord::Sample {
                series_id,
                timestamp: sample.timestamp,
                value: sample.value,
            };
            service.wal.append(&wal_record)
                .map_err(|e| (StatusCode::INTERNAL_SERVER_ERROR, format!("WAL append failed: {}", e)))?;

            // Insert into Head
            service.head.append(series_id, labels.clone(), sample.timestamp, sample.value)
                .await
                .map_err(|e| {
                    if e.to_string().contains("out of order") {
                        samples_rejected += 1;
                        Ok::<_, (StatusCode, String)>(())
                    } else if e.to_string().contains("buffer full") {
                        Err((StatusCode::SERVICE_UNAVAILABLE, "Write buffer full".into()))
                    } else {
                        Err((StatusCode::INTERNAL_SERVER_ERROR, format!("Insert failed: {}", e)))
                    }
                })?;

            samples_ingested += 1;
        }
    }

    info!("Ingested {} samples, rejected {}", samples_ingested, samples_rejected);
    Ok(StatusCode::NO_CONTENT)  // 204 No Content on success
}

fn validate_labels(labels: &[remote_write_pb::Label]) -> Result<BTreeMap<String, String>, String> {
    let mut label_map = BTreeMap::new();

    for label in labels {
        // Validate label name
        if !is_valid_label_name(&label.name) {
            return Err(format!("Invalid label name: {}", label.name));
        }

        // Validate label value (any UTF-8)
        if label.value.is_empty() {
            return Err(format!("Empty label value for label: {}", label.name));
        }

        label_map.insert(label.name.clone(), label.value.clone());
    }

    // Must have __name__ label
    if !label_map.contains_key("__name__") {
        return Err("Missing __name__ label".into());
    }

    Ok(label_map)
}

fn is_valid_label_name(name: &str) -> bool {
    // Must match [a-zA-Z_][a-zA-Z0-9_]*
    if name.is_empty() {
        return false;
    }

    let mut chars = name.chars();
    let first = chars.next().unwrap();
    if !first.is_ascii_alphabetic() && first != '_' {
        return false;
    }

    chars.all(|c| c.is_ascii_alphanumeric() || c == '_')
}

fn is_valid_timestamp(ts: i64) -> bool {
    let now = SystemTime::now().duration_since(UNIX_EPOCH).unwrap().as_millis() as i64;
    let min_valid = now - 24 * 3600 * 1000;  // Not older than 24h
    let max_valid = now + 5 * 60 * 1000;      // Not more than 5min in future
    ts >= min_valid && ts <= max_valid
}

4.2 gRPC API (Alternative/Additional)

In addition to HTTP, Metricstor MAY support a gRPC API for ingestion (more efficient for internal services).

Proto Definition:

syntax = "proto3";

package metricstor.ingest;

service IngestionService {
  rpc Write(WriteRequest) returns (WriteResponse);
  rpc WriteBatch(stream WriteRequest) returns (WriteResponse);
}

message WriteRequest {
  repeated TimeSeries timeseries = 1;
}

message WriteResponse {
  uint64 samples_ingested = 1;
  uint64 samples_rejected = 2;
  string error = 3;
}

// (Reuse TimeSeries, Label, Sample from remote_write.proto)

4.3 Label Validation and Normalization

4.3.1 Metric Name Validation

Metric names (stored in __name__ label) must match:

[a-zA-Z_:][a-zA-Z0-9_:]*

Examples:

✅ http_requests_total
✅ node_cpu_seconds:rate5m
❌ 123_invalid (starts with digit)
❌ invalid-metric (contains hyphen)

4.3.2 Label Name Validation

Label names must match:

[a-zA-Z_][a-zA-Z0-9_]*

Reserved prefixes:

__ (double underscore): Internal labels (e.g., __name__, __rollup__)

4.3.3 Label Normalization

Before inserting, labels are normalized:

Sort labels lexicographically by key
Ensure __name__ label is present
Remove duplicate labels (keep last value)
Limit label count (default: 30 labels max per series)
Limit label value length (default: 1024 chars max)

4.4 Write Path Architecture

┌──────────────────────────────────────────────────────────────┐
│                   Ingestion Layer                             │
│                                                                │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │  HTTP/gRPC  │  │  mTLS Auth  │  │ Rate Limiter│          │
│  │   Handler   │─▶│  Validator  │─▶│             │          │
│  └─────────────┘  └─────────────┘  └──────┬──────┘          │
│                                             │                  │
│                                             ▼                  │
│                                  ┌─────────────────┐          │
│                                  │  Decompressor   │          │
│                                  │  (Snappy)       │          │
│                                  └────────┬────────┘          │
│                                           │                    │
│                                           ▼                    │
│                                  ┌─────────────────┐          │
│                                  │  Protobuf       │          │
│                                  │  Decoder        │          │
│                                  └────────┬────────┘          │
│                                           │                    │
└───────────────────────────────────────────┼──────────────────┘
                                            │
                                            ▼
┌──────────────────────────────────────────────────────────────┐
│                  Validation Layer                             │
│                                                                │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │  Label       │  │  Timestamp   │  │  Cardinality │       │
│  │  Validator   │  │  Validator   │  │  Limiter     │       │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘       │
│         │                 │                 │                 │
│         └─────────────────┴─────────────────┘                 │
│                           │                                    │
└───────────────────────────┼──────────────────────────────────┘
                            │
                            ▼
┌──────────────────────────────────────────────────────────────┐
│                   Write Buffer                                │
│                                                                │
│  ┌────────────────────────────────────────────────────┐      │
│  │  MPSC Channel (bounded)                             │      │
│  │  Capacity: 100K samples                             │      │
│  │  Backpressure: Block/Reject when full              │      │
│  └────────────────────────────────────────────────────┘      │
│                           │                                    │
└───────────────────────────┼──────────────────────────────────┘
                            │
                            ▼
┌──────────────────────────────────────────────────────────────┐
│                   Storage Layer                               │
│                                                                │
│  ┌─────────────┐         ┌─────────────┐                     │
│  │   WAL       │◀────────│  WAL Writer │                     │
│  │  (Disk)     │         │  (Thread)   │                     │
│  └─────────────┘         └─────────────┘                     │
│                                                                │
│  ┌─────────────┐         ┌─────────────┐                     │
│  │   Head      │◀────────│  Head Writer│                     │
│  │ (In-Memory) │         │  (Thread)   │                     │
│  └─────────────┘         └─────────────┘                     │
└──────────────────────────────────────────────────────────────┘

Concurrency Model:

HTTP/gRPC handlers: Multi-threaded (tokio async)
Write buffer: MPSC channel (bounded capacity)
WAL writer: Single-threaded (sequential writes for consistency)
Head writer: Single-threaded (lock-free inserts via sharding)

Backpressure Handling:

enum BackpressureStrategy {
    Block,   // Block until buffer has space (default)
    Reject,  // Return 503 immediately
}

impl IngestionService {
    async fn handle_backpressure(&self, samples: Vec<Sample>) -> Result<()> {
        match self.config.backpressure_strategy {
            BackpressureStrategy::Block => {
                // Try to send with timeout
                tokio::time::timeout(
                    Duration::from_secs(5),
                    self.write_buffer.send(samples)
                ).await
                    .map_err(|_| Error::Timeout)?
            }
            BackpressureStrategy::Reject => {
                // Try non-blocking send
                self.write_buffer.try_send(samples)
                    .map_err(|_| Error::BufferFull)?
            }
        }
    }
}

4.5 Out-of-Order Sample Handling

Problem: Samples may arrive out of timestamp order due to network delays, batching, etc.

Solution: Accept out-of-order samples within a configurable time window.

Configuration:

[storage]
out_of_order_time_window = "1h"  # Accept samples up to 1h old

Implementation:

impl Head {
    async fn append(
        &self,
        series_id: SeriesID,
        labels: BTreeMap<String, String>,
        timestamp: i64,
        value: f64,
    ) -> Result<()> {
        let now = SystemTime::now().duration_since(UNIX_EPOCH)?.as_millis() as i64;
        let min_valid_ts = now - self.config.out_of_order_time_window.as_millis() as i64;

        if timestamp < min_valid_ts {
            return Err(Error::OutOfOrder(format!(
                "Sample too old: ts={}, min={}",
                timestamp, min_valid_ts
            )));
        }

        // Get or create series
        let mut series_map = self.series.write().await;
        let series = series_map.entry(series_id).or_insert_with(|| {
            Arc::new(Series {
                id: series_id,
                labels: labels.clone(),
                chunks: RwLock::new(vec![]),
            })
        });

        // Append to appropriate chunk
        let mut chunks = series.chunks.write().await;

        // Find chunk that covers this timestamp
        let chunk = chunks.iter_mut()
            .find(|c| timestamp >= c.min_time && timestamp < c.max_time)
            .or_else(|| {
                // Create new chunk if needed
                let chunk_start = (timestamp / self.chunk_size.as_millis() as i64) * self.chunk_size.as_millis() as i64;
                let chunk_end = chunk_start + self.chunk_size.as_millis() as i64;
                let new_chunk = Chunk {
                    min_time: chunk_start,
                    max_time: chunk_end,
                    samples: CompressedSamples::new(),
                };
                chunks.push(new_chunk);
                chunks.last_mut()
            })
            .unwrap();

        chunk.samples.append(timestamp, value)?;

        Ok(())
    }
}

5. PromQL Query Engine

5.1 PromQL Overview

PromQL (Prometheus Query Language) is a functional query language for selecting and aggregating time-series data.

Query Types:

Instant query: Evaluate expression at a single point in time
Range query: Evaluate expression over a time range

5.2 Supported PromQL Subset

Metricstor v1 supports a pragmatic subset of PromQL covering 80% of common dashboard queries.

5.2.1 Instant Vector Selectors

# Select by metric name
http_requests_total

# Select with label matchers
http_requests_total{method="GET"}
http_requests_total{method="GET", status="200"}

# Label matcher operators
metric{label="value"}        # Exact match
metric{label!="value"}       # Not equal
metric{label=~"regex"}       # Regex match
metric{label!~"regex"}       # Regex not match

# Example
http_requests_total{method=~"GET|POST", status!="500"}

5.2.2 Range Vector Selectors

# Select last 5 minutes of data
http_requests_total[5m]

# With label matchers
http_requests_total{method="GET"}[1h]

# Time durations: s (seconds), m (minutes), h (hours), d (days), w (weeks), y (years)

5.2.3 Aggregation Operators

# sum: Sum over dimensions
sum(http_requests_total)
sum(http_requests_total) by (method)
sum(http_requests_total) without (instance)

# Supported aggregations:
sum         # Sum
avg         # Average
min         # Minimum
max         # Maximum
count       # Count
stddev      # Standard deviation
stdvar      # Standard variance
topk(N, )   # Top N series by value
bottomk(N,) # Bottom N series by value

5.2.4 Functions

Rate Functions:

# rate: Per-second average rate of increase
rate(http_requests_total[5m])

# irate: Instant rate (last two samples)
irate(http_requests_total[5m])

# increase: Total increase over time range
increase(http_requests_total[1h])

Quantile Functions:

# histogram_quantile: Calculate quantile from histogram
histogram_quantile(0.95, rate(http_request_duration_bucket[5m]))

Time Functions:

# time(): Current Unix timestamp
time()

# timestamp(): Timestamp of sample
timestamp(metric)

Math Functions:

# abs, ceil, floor, round, sqrt, exp, ln, log2, log10
abs(metric)
round(metric, 0.1)

5.2.5 Binary Operators

Arithmetic:

metric1 + metric2
metric1 - metric2
metric1 * metric2
metric1 / metric2
metric1 % metric2
metric1 ^ metric2

Comparison:

metric1 == metric2  # Equal
metric1 != metric2  # Not equal
metric1 > metric2   # Greater than
metric1 < metric2   # Less than
metric1 >= metric2  # Greater or equal
metric1 <= metric2  # Less or equal

Logical:

metric1 and metric2  # Intersection
metric1 or metric2   # Union
metric1 unless metric2  # Complement

Vector Matching:

# One-to-one matching
metric1 + metric2

# Many-to-one matching
metric1 + on(label) group_left metric2

# One-to-many matching
metric1 + on(label) group_right metric2

5.2.6 Subqueries (NOT SUPPORTED in v1)

Subqueries are complex and not supported in v1:

# NOT SUPPORTED
max_over_time(rate(http_requests_total[5m])[1h:])

5.3 Query Execution Model

5.3.1 Query Parsing

Use promql-parser crate (GreptimeTeam) for parsing:

use promql_parser::{parser, label};

fn parse_query(query: &str) -> Result<parser::Expr, ParseError> {
    parser::parse(query)
}

// Example
let expr = parse_query("http_requests_total{method=\"GET\"}[5m]")?;
match expr {
    parser::Expr::VectorSelector(vs) => {
        println!("Metric: {}", vs.name);
        for matcher in vs.matchers.matchers {
            println!("Label: {} {} {}", matcher.name, matcher.op, matcher.value);
        }
        println!("Range: {:?}", vs.range);
    }
    _ => {}
}

AST Types:

pub enum Expr {
    Aggregate(AggregateExpr),        // sum, avg, etc.
    Unary(UnaryExpr),                // -metric
    Binary(BinaryExpr),              // metric1 + metric2
    Paren(ParenExpr),                // (expr)
    Subquery(SubqueryExpr),          // NOT SUPPORTED
    NumberLiteral(NumberLiteral),    // 1.5
    StringLiteral(StringLiteral),    // "value"
    VectorSelector(VectorSelector),  // metric{labels}
    MatrixSelector(MatrixSelector),  // metric[5m]
    Call(Call),                      // rate(...)
}

5.3.2 Query Planner

Convert AST to execution plan:

enum QueryPlan {
    VectorSelector {
        matchers: Vec<LabelMatcher>,
        timestamp: i64,
    },
    MatrixSelector {
        matchers: Vec<LabelMatcher>,
        range: Duration,
        timestamp: i64,
    },
    Aggregate {
        op: AggregateOp,
        input: Box<QueryPlan>,
        grouping: Vec<String>,
    },
    RateFunc {
        input: Box<QueryPlan>,
    },
    BinaryOp {
        op: BinaryOp,
        lhs: Box<QueryPlan>,
        rhs: Box<QueryPlan>,
        matching: VectorMatching,
    },
}

struct QueryPlanner;

impl QueryPlanner {
    fn plan(expr: parser::Expr, query_time: i64) -> Result<QueryPlan> {
        match expr {
            parser::Expr::VectorSelector(vs) => {
                Ok(QueryPlan::VectorSelector {
                    matchers: vs.matchers.matchers.into_iter()
                        .map(|m| LabelMatcher::from_ast(m))
                        .collect(),
                    timestamp: query_time,
                })
            }
            parser::Expr::MatrixSelector(ms) => {
                Ok(QueryPlan::MatrixSelector {
                    matchers: ms.vector_selector.matchers.matchers.into_iter()
                        .map(|m| LabelMatcher::from_ast(m))
                        .collect(),
                    range: Duration::from_millis(ms.range as u64),
                    timestamp: query_time,
                })
            }
            parser::Expr::Call(call) => {
                match call.func.name.as_str() {
                    "rate" => {
                        let arg_plan = Self::plan(*call.args[0].clone(), query_time)?;
                        Ok(QueryPlan::RateFunc { input: Box::new(arg_plan) })
                    }
                    // ... other functions
                    _ => Err(Error::UnsupportedFunction(call.func.name)),
                }
            }
            parser::Expr::Aggregate(agg) => {
                let input_plan = Self::plan(*agg.expr, query_time)?;
                Ok(QueryPlan::Aggregate {
                    op: AggregateOp::from_str(&agg.op.to_string())?,
                    input: Box::new(input_plan),
                    grouping: agg.grouping.unwrap_or_default(),
                })
            }
            parser::Expr::Binary(bin) => {
                let lhs_plan = Self::plan(*bin.lhs, query_time)?;
                let rhs_plan = Self::plan(*bin.rhs, query_time)?;
                Ok(QueryPlan::BinaryOp {
                    op: BinaryOp::from_str(&bin.op.to_string())?,
                    lhs: Box::new(lhs_plan),
                    rhs: Box::new(rhs_plan),
                    matching: bin.modifier.map(|m| VectorMatching::from_ast(m)).unwrap_or_default(),
                })
            }
            _ => Err(Error::UnsupportedExpr),
        }
    }
}

5.3.3 Query Executor

Execute the plan:

struct QueryExecutor {
    head: Arc<Head>,
    blocks: Arc<BlockManager>,
}

impl QueryExecutor {
    async fn execute(&self, plan: QueryPlan) -> Result<QueryResult> {
        match plan {
            QueryPlan::VectorSelector { matchers, timestamp } => {
                self.execute_vector_selector(matchers, timestamp).await
            }
            QueryPlan::MatrixSelector { matchers, range, timestamp } => {
                self.execute_matrix_selector(matchers, range, timestamp).await
            }
            QueryPlan::RateFunc { input } => {
                let matrix = self.execute(*input).await?;
                self.apply_rate(matrix)
            }
            QueryPlan::Aggregate { op, input, grouping } => {
                let vector = self.execute(*input).await?;
                self.apply_aggregate(op, vector, grouping)
            }
            QueryPlan::BinaryOp { op, lhs, rhs, matching } => {
                let lhs_result = self.execute(*lhs).await?;
                let rhs_result = self.execute(*rhs).await?;
                self.apply_binary_op(op, lhs_result, rhs_result, matching)
            }
        }
    }

    async fn execute_vector_selector(
        &self,
        matchers: Vec<LabelMatcher>,
        timestamp: i64,
    ) -> Result<InstantVector> {
        // 1. Find matching series from index
        let series_ids = self.find_series(&matchers).await?;

        // 2. For each series, get sample at timestamp
        let mut samples = Vec::new();
        for series_id in series_ids {
            if let Some(sample) = self.get_sample_at(series_id, timestamp).await? {
                samples.push(sample);
            }
        }

        Ok(InstantVector { samples })
    }

    async fn execute_matrix_selector(
        &self,
        matchers: Vec<LabelMatcher>,
        range: Duration,
        timestamp: i64,
    ) -> Result<RangeVector> {
        let series_ids = self.find_series(&matchers).await?;

        let start = timestamp - range.as_millis() as i64;
        let end = timestamp;

        let mut ranges = Vec::new();
        for series_id in series_ids {
            let samples = self.get_samples_range(series_id, start, end).await?;
            ranges.push(RangeVectorSeries {
                labels: self.get_labels(series_id).await?,
                samples,
            });
        }

        Ok(RangeVector { ranges })
    }

    fn apply_rate(&self, matrix: RangeVector) -> Result<InstantVector> {
        let mut samples = Vec::new();

        for range in matrix.ranges {
            if range.samples.len() < 2 {
                continue;  // Need at least 2 samples for rate
            }

            let first = &range.samples[0];
            let last = &range.samples[range.samples.len() - 1];

            let delta_value = last.value - first.value;
            let delta_time = (last.timestamp - first.timestamp) as f64 / 1000.0;  // Convert to seconds

            let rate = delta_value / delta_time;

            samples.push(Sample {
                labels: range.labels,
                timestamp: last.timestamp,
                value: rate,
            });
        }

        Ok(InstantVector { samples })
    }

    fn apply_aggregate(
        &self,
        op: AggregateOp,
        vector: InstantVector,
        grouping: Vec<String>,
    ) -> Result<InstantVector> {
        // Group samples by grouping labels
        let mut groups: HashMap<Vec<(String, String)>, Vec<Sample>> = HashMap::new();

        for sample in vector.samples {
            let group_key = if grouping.is_empty() {
                vec![]
            } else {
                grouping.iter()
                    .filter_map(|label| sample.labels.get(label).map(|v| (label.clone(), v.clone())))
                    .collect()
            };

            groups.entry(group_key).or_insert_with(Vec::new).push(sample);
        }

        // Apply aggregation to each group
        let mut result_samples = Vec::new();
        for (group_labels, samples) in groups {
            let aggregated_value = match op {
                AggregateOp::Sum => samples.iter().map(|s| s.value).sum(),
                AggregateOp::Avg => samples.iter().map(|s| s.value).sum::<f64>() / samples.len() as f64,
                AggregateOp::Min => samples.iter().map(|s| s.value).fold(f64::INFINITY, f64::min),
                AggregateOp::Max => samples.iter().map(|s| s.value).fold(f64::NEG_INFINITY, f64::max),
                AggregateOp::Count => samples.len() as f64,
                // ... other aggregations
            };

            result_samples.push(Sample {
                labels: group_labels.into_iter().collect(),
                timestamp: samples[0].timestamp,
                value: aggregated_value,
            });
        }

        Ok(InstantVector { samples: result_samples })
    }
}

5.4 Read Path Architecture

┌──────────────────────────────────────────────────────────────┐
│                   Query Layer                                 │
│                                                                │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │
│  │  HTTP API   │  │  PromQL     │  │ Query       │          │
│  │  /api/v1/   │─▶│  Parser     │─▶│ Planner     │          │
│  │  query      │  │             │  │             │          │
│  └─────────────┘  └─────────────┘  └──────┬──────┘          │
│                                             │                  │
│                                             ▼                  │
│                                  ┌─────────────────┐          │
│                                  │  Query          │          │
│                                  │  Executor       │          │
│                                  └────────┬────────┘          │
└───────────────────────────────────────────┼──────────────────┘
                                            │
                                            ▼
┌──────────────────────────────────────────────────────────────┐
│                   Index Layer                                 │
│                                                                │
│  ┌──────────────┐  ┌──────────────┐                          │
│  │  Label Index │  │  Posting     │                          │
│  │  (In-Memory) │  │  Lists       │                          │
│  └──────┬───────┘  └──────┬───────┘                          │
│         │                 │                                    │
│         └─────────────────┘                                    │
│                  │                                             │
│                  │ Series IDs                                  │
│                  ▼                                             │
└──────────────────────────────────────────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────────────────────┐
│                   Storage Layer                               │
│                                                                │
│  ┌─────────────┐         ┌─────────────┐                     │
│  │   Head      │         │   Blocks    │                     │
│  │ (In-Memory) │         │   (Disk)    │                     │
│  └─────┬───────┘         └─────┬───────┘                     │
│        │                       │                               │
│        │ Recent data (<2h)     │ Historical data              │
│        │                       │                               │
│        ▼                       ▼                               │
│  ┌─────────────────────────────────────┐                     │
│  │  Chunk Reader                        │                     │
│  │  - Decompress Gorilla chunks         │                     │
│  │  - Filter by time range              │                     │
│  │  - Return samples                    │                     │
│  └─────────────────────────────────────┘                     │
└──────────────────────────────────────────────────────────────┘

5.5 HTTP Query API

5.5.1 Instant Query

GET /api/v1/query?query=<promql>&time=<timestamp>&timeout=<duration>

Parameters:

query: PromQL expression (required)
time: Unix timestamp (optional, default: now)
timeout: Query timeout (optional, default: 30s)

Response (JSON):

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "http_requests_total",
          "method": "GET",
          "status": "200"
        },
        "value": [1733832000, "1543"]
      }
    ]
  }
}

5.5.2 Range Query

GET /api/v1/query_range?query=<promql>&start=<timestamp>&end=<timestamp>&step=<duration>

Parameters:

query: PromQL expression (required)
start: Start timestamp (required)
end: End timestamp (required)
step: Query resolution step (required, e.g., "15s")

Response (JSON):

{
  "status": "success",
  "data": {
    "resultType": "matrix",
    "result": [
      {
        "metric": {
          "__name__": "http_requests_total",
          "method": "GET"
        },
        "values": [
          [1733832000, "1543"],
          [1733832015, "1556"],
          [1733832030, "1570"]
        ]
      }
    ]
  }
}

5.5.3 Label Values Query

GET /api/v1/label/<label_name>/values?match[]=<matcher>

Example:

GET /api/v1/label/method/values?match[]=http_requests_total

Response:

{
  "status": "success",
  "data": ["GET", "POST", "PUT", "DELETE"]
}

5.5.4 Series Metadata Query

GET /api/v1/series?match[]=<matcher>&start=<timestamp>&end=<timestamp>

Example:

GET /api/v1/series?match[]=http_requests_total{method="GET"}

Response:

{
  "status": "success",
  "data": [
    {
      "__name__": "http_requests_total",
      "method": "GET",
      "status": "200",
      "instance": "flaredb-1:9092"
    }
  ]
}

5.6 Performance Optimizations

5.6.1 Query Caching

Cache query results for identical queries:

struct QueryCache {
    cache: Arc<Mutex<LruCache<String, (QueryResult, Instant)>>>,
    ttl: Duration,
}

impl QueryCache {
    fn get(&self, query_hash: &str) -> Option<QueryResult> {
        let cache = self.cache.lock().unwrap();
        if let Some((result, timestamp)) = cache.get(query_hash) {
            if timestamp.elapsed() < self.ttl {
                return Some(result.clone());
            }
        }
        None
    }

    fn put(&self, query_hash: String, result: QueryResult) {
        let mut cache = self.cache.lock().unwrap();
        cache.put(query_hash, (result, Instant::now()));
    }
}

5.6.2 Posting List Intersection

Use efficient algorithms for label matcher intersection:

fn intersect_posting_lists(lists: Vec<&[SeriesID]>) -> Vec<SeriesID> {
    if lists.is_empty() {
        return vec![];
    }

    // Sort lists by length (shortest first for early termination)
    let mut sorted_lists = lists;
    sorted_lists.sort_by_key(|list| list.len());

    // Use shortest list as base, intersect with others
    let mut result: HashSet<SeriesID> = sorted_lists[0].iter().copied().collect();

    for list in &sorted_lists[1..] {
        let list_set: HashSet<SeriesID> = list.iter().copied().collect();
        result.retain(|id| list_set.contains(id));

        if result.is_empty() {
            break;  // Early termination
        }
    }

    result.into_iter().collect()
}

5.6.3 Chunk Pruning

Skip chunks that don't overlap query time range:

fn query_chunks(
    chunks: &[ChunkRef],
    start_time: i64,
    end_time: i64,
) -> Vec<ChunkRef> {
    chunks.iter()
        .filter(|chunk| {
            // Chunk overlaps query range if:
            // chunk.max_time > start AND chunk.min_time < end
            chunk.max_time > start_time && chunk.min_time < end_time
        })
        .copied()
        .collect()
}

6. Storage Backend Architecture

6.1 Architecture Decision: Hybrid Approach

After analyzing trade-offs, Metricstor adopts a hybrid storage architecture:

Dedicated time-series engine for sample storage (optimized for write throughput and compression)
Optional FlareDB integration for metadata and distributed coordination (future work)
Optional S3-compatible backend for cold data archival (future work)

6.2 Decision Rationale

6.2.1 Why NOT Pure FlareDB Backend?

FlareDB Characteristics:

General-purpose KV store with Raft consensus
Optimized for: Strong consistency, small KV pairs, random access
Storage: RocksDB (LSM tree)

Time-Series Workload Characteristics:

High write throughput (100K samples/sec)
Sequential writes (append-only)
Temporal locality (queries focus on recent data)
Bulk reads (range scans over time windows)

Mismatch Analysis:

Aspect	FlareDB (KV)	Time-Series Engine
Write pattern	Random writes, compaction overhead	Append-only, minimal overhead
Compression	Generic LZ4/Snappy	Domain-specific (Gorilla: 12x)
Read pattern	Point lookups	Range scans over time
Indexing	Key-based	Label-based inverted index
Consistency	Strong (Raft)	Eventual OK for metrics

Conclusion: Using FlareDB for sample storage would sacrifice 5-10x write throughput and 10x compression efficiency.

6.2.2 Why NOT VictoriaMetrics Binary?

VictoriaMetrics is written in Go and has excellent performance, but:

mTLS support is paid only (violates PROJECT.md requirement)
Not Rust (violates PROJECT.md "Rustで書く")
Cannot integrate with FlareDB for metadata (future requirement)
Less control over storage format and optimizations

6.2.3 Why Hybrid (Dedicated + Optional FlareDB)?

Phase 1 (T033 v1): Pure dedicated engine

Simple, single-instance deployment
Focus on core functionality (ingest + query)
Local disk storage only

Phase 2 (Future): Add FlareDB for metadata

Store series labels and metadata in FlareDB "metrics" namespace
Enables multi-instance coordination
Global view of series cardinality, label values
Samples still in dedicated engine (local disk)

Phase 3 (Future): Add S3 for cold storage

Automatically upload old blocks (>7 days) to S3
Query federation across local + S3 blocks
Unlimited retention with cost-effective storage

Benefits:

v1 simplicity: No FlareDB dependency, easy deployment
Future scalability: Metadata in FlareDB, samples distributed
Operational flexibility: Can run standalone or integrated

6.3 Storage Layout

6.3.1 Directory Structure

/var/lib/metricstor/
├── data/
│   ├── wal/
│   │   ├── 00000001              # WAL segment
│   │   ├── 00000002
│   │   └── checkpoint.00000002   # WAL checkpoint
│   ├── 01HQZQZQZQZQZQZQZQZQZQ/  # Block (ULID)
│   │   ├── meta.json
│   │   ├── index
│   │   ├── chunks/
│   │   │   ├── 000001
│   │   │   └── 000002
│   │   └── tombstones
│   ├── 01HQZR.../                # Another block
│   └── ...
└── tmp/                          # Temp files for compaction

6.3.2 Metadata Storage (Future: FlareDB Integration)

When FlareDB integration is enabled:

Series Metadata (stored in FlareDB "metrics" namespace):

Key: series:<series_id>
Value: {
  "labels": {"__name__": "http_requests_total", "method": "GET", ...},
  "first_seen": 1733832000000,
  "last_seen": 1733839200000
}

Key: label_index:<label_name>:<label_value>
Value: [series_id1, series_id2, ...]  # Posting list

Benefits:

Fast label value lookups across all instances
Global series cardinality tracking
Distributed query planning (future)

Trade-off: Adds dependency on FlareDB, increases complexity

6.4 Scalability Approach

6.4.1 Vertical Scaling (v1)

Single instance scales to:

10M active series
100K samples/sec write throughput
1K queries/sec

Scaling strategy:

Increase memory (more series in Head)
Faster disk (NVMe for WAL/blocks)
More CPU cores (parallel compaction, query execution)

6.4.2 Horizontal Scaling (Future)

Sharding Strategy (inspired by Prometheus federation + Thanos):

┌────────────────────────────────────────────────────────────┐
│                   Query Frontend                            │
│                   (Query Federation)                        │
└─────┬────────────────────┬─────────────────────┬───────────┘
      │                    │                     │
      ▼                    ▼                     ▼
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ Metricstor  │     │ Metricstor  │     │ Metricstor  │
│ Instance 1  │     │ Instance 2  │     │ Instance N  │
│             │     │             │     │             │
│ Hash shard: │     │ Hash shard: │     │ Hash shard: │
│ 0-333       │     │ 334-666     │     │ 667-999     │
└─────────────┘     └─────────────┘     └─────────────┘
      │                    │                     │
      └────────────────────┴─────────────────────┘
                          │
                          ▼
                  ┌───────────────┐
                  │   FlareDB     │
                  │  (Metadata)   │
                  └───────────────┘

Sharding Key: Hash(series_id) % num_shards

Query Execution:

Query frontend receives PromQL query
Determine which shards contain matching series (via FlareDB metadata)
Send subqueries to relevant shards
Merge results (aggregation, deduplication)
Return to client

Challenges (deferred to future work):

Rebalancing when adding/removing shards
Handling series that span multiple shards (rare)
Ensuring query consistency across shards

6.5 S3 Integration Strategy (Future)

Objective: Cost-effective long-term retention (>15 days)

Architecture:

┌───────────────────────────────────────────────────┐
│  Metricstor Server                                 │
│                                                    │
│  ┌──────────┐      ┌──────────┐                  │
│  │  Head    │      │  Blocks  │                  │
│  │  (0-2h)  │      │  (2h-15d)│                  │
│  └──────────┘      └────┬─────┘                  │
│                          │                         │
│                          │ Background uploader     │
│                          ▼                         │
│                   ┌─────────────┐                 │
│                   │  Upload to  │                 │
│                   │  S3 (>7d)   │                 │
│                   └──────┬──────┘                 │
└──────────────────────────┼────────────────────────┘
                           │
                           ▼
                  ┌─────────────────┐
                  │  S3 Bucket      │
                  │  /blocks/       │
                  │    01HQZ.../    │
                  │    01HRZ.../    │
                  └─────────────────┘

Workflow:

Block compaction creates local block files
Blocks older than 7 days (configurable) are uploaded to S3
Local block files deleted after successful upload
Query executor checks both local and S3 for blocks in query range
Download S3 blocks on-demand (with local cache)

Configuration:

[storage.s3]
enabled = true
endpoint = "https://s3.example.com"
bucket = "metricstor-blocks"
access_key_id = "..."
secret_access_key = "..."
upload_after_days = 7
local_cache_size_gb = 100

7. Integration Points

7.1 Service Discovery (How Services Push Metrics)

7.1.1 Service Configuration Pattern

Each platform service (FlareDB, ChainFire, etc.) exports Prometheus metrics on ports 9091-9099.

Example (FlareDB metrics exporter):

// flaredb-server/src/main.rs
use metrics_exporter_prometheus::PrometheusBuilder;

#[tokio::main]
async fn main() -> Result<()> {
    // ... initialization ...

    let metrics_addr = format!("0.0.0.0:{}", args.metrics_port);
    let builder = PrometheusBuilder::new();
    builder
        .with_http_listener(metrics_addr.parse::<std::net::SocketAddr>()?)
        .install()
        .expect("Failed to install Prometheus metrics exporter");

    info!("Prometheus metrics available at http://{}/metrics", metrics_addr);

    // ... rest of main ...
}

Service Metrics Ports (from T027.S2):

Service	Port	Endpoint
ChainFire	9091	http://chainfire:9091/metrics
FlareDB	9092	http://flaredb:9092/metrics
PlasmaVMC	9093	http://plasmavmc:9093/metrics
IAM	9094	http://iam:9094/metrics
LightningSTOR	9095	http://lightningstor:9095/metrics
FlashDNS	9096	http://flashdns:9096/metrics
FiberLB	9097	http://fiberlb:9097/metrics
Novanet	9098	http://novanet:9098/metrics

7.1.2 Scrape-to-Push Adapter

Since Metricstor is push-based but services export pull-based Prometheus /metrics endpoints, we need a scrape-to-push adapter.

Option 1: Prometheus Agent Mode + Remote Write

Deploy Prometheus in agent mode (no storage, only scraping):

# prometheus-agent.yaml
global:
  scrape_interval: 15s
  external_labels:
    cluster: 'cloud-platform'

scrape_configs:
  - job_name: 'chainfire'
    static_configs:
      - targets: ['chainfire:9091']

  - job_name: 'flaredb'
    static_configs:
      - targets: ['flaredb:9092']

  # ... other services ...

remote_write:
  - url: 'https://metricstor:8080/api/v1/write'
    tls_config:
      cert_file: /etc/certs/client.crt
      key_file: /etc/certs/client.key
      ca_file: /etc/certs/ca.crt

Option 2: Custom Rust Scraper (Platform-Native)

Build a lightweight scraper in Rust that integrates with Metricstor:

// metricstor-scraper/src/main.rs

struct Scraper {
    targets: Vec<ScrapeTarget>,
    client: reqwest::Client,
    metricstor_client: MetricstorClient,
}

struct ScrapeTarget {
    job_name: String,
    url: String,
    interval: Duration,
}

impl Scraper {
    async fn scrape_loop(&self) {
        loop {
            for target in &self.targets {
                let result = self.scrape_target(target).await;
                match result {
                    Ok(samples) => {
                        if let Err(e) = self.metricstor_client.write(samples).await {
                            error!("Failed to write to Metricstor: {}", e);
                        }
                    }
                    Err(e) => {
                        error!("Failed to scrape {}: {}", target.url, e);
                    }
                }
            }
            tokio::time::sleep(Duration::from_secs(15)).await;
        }
    }

    async fn scrape_target(&self, target: &ScrapeTarget) -> Result<Vec<Sample>> {
        let response = self.client.get(&target.url).send().await?;
        let body = response.text().await?;

        // Parse Prometheus text format
        let samples = parse_prometheus_text(&body, &target.job_name)?;
        Ok(samples)
    }
}

fn parse_prometheus_text(text: &str, job: &str) -> Result<Vec<Sample>> {
    // Use prometheus-parse crate or implement simple parser
    // Example output:
    // http_requests_total{method="GET",status="200",job="flaredb"} 1543 1733832000000
}

Deployment:

metricstor-scraper runs as a sidecar or separate service
Reads scrape config from TOML file
Uses mTLS to push to Metricstor

Recommendation: Option 2 (custom scraper) for consistency with platform philosophy (100% Rust, no external dependencies).

7.2 mTLS Configuration (T027/T031 Patterns)

7.2.1 TLS Config Structure

Following existing patterns (FlareDB, ChainFire, IAM):

# metricstor.toml

[server]
addr = "0.0.0.0:8080"
log_level = "info"

[server.tls]
cert_file = "/etc/metricstor/certs/server.crt"
key_file = "/etc/metricstor/certs/server.key"
ca_file = "/etc/metricstor/certs/ca.crt"
require_client_cert = true  # Enable mTLS

Rust Config Struct:

// metricstor-server/src/config.rs

use serde::{Deserialize, Serialize};
use std::net::SocketAddr;

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ServerConfig {
    pub server: ServerSettings,
    pub storage: StorageConfig,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ServerSettings {
    pub addr: SocketAddr,
    pub log_level: String,
    pub tls: Option<TlsConfig>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TlsConfig {
    pub cert_file: String,
    pub key_file: String,
    pub ca_file: Option<String>,
    #[serde(default)]
    pub require_client_cert: bool,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StorageConfig {
    pub data_dir: String,
    pub retention_days: u32,
    pub wal_segment_size_mb: usize,
    // ... other storage settings
}

7.2.2 mTLS Server Setup

// metricstor-server/src/main.rs

use axum::Router;
use axum_server::tls_rustls::RustlsConfig;
use std::sync::Arc;

#[tokio::main]
async fn main() -> Result<()> {
    let config = ServerConfig::load("metricstor.toml")?;

    // Build router
    let app = Router::new()
        .route("/api/v1/write", post(handle_remote_write))
        .route("/api/v1/query", get(handle_instant_query))
        .route("/api/v1/query_range", get(handle_range_query))
        .route("/health", get(health_check))
        .route("/ready", get(readiness_check))
        .with_state(Arc::new(service));

    // Setup TLS if configured
    if let Some(tls_config) = &config.server.tls {
        info!("TLS enabled, loading certificates...");

        let rustls_config = if tls_config.require_client_cert {
            info!("mTLS enabled, requiring client certificates");

            let ca_cert_pem = tokio::fs::read_to_string(
                tls_config.ca_file.as_ref().ok_or("ca_file required for mTLS")?
            ).await?;

            RustlsConfig::from_pem_file(
                &tls_config.cert_file,
                &tls_config.key_file,
            )
            .await?
            .with_client_cert_verifier(ca_cert_pem)
        } else {
            info!("TLS-only mode, client certificates not required");
            RustlsConfig::from_pem_file(
                &tls_config.cert_file,
                &tls_config.key_file,
            ).await?
        };

        axum_server::bind_rustls(config.server.addr, rustls_config)
            .serve(app.into_make_service())
            .await?;
    } else {
        info!("TLS disabled, running in plain-text mode");
        axum_server::bind(config.server.addr)
            .serve(app.into_make_service())
            .await?;
    }

    Ok(())
}

7.2.3 Client Certificate Validation

Extract client identity from mTLS certificate:

use axum::{
    http::Request,
    middleware::Next,
    response::Response,
    Extension,
};
use axum_server::tls_rustls::RustlsAcceptor;

#[derive(Clone, Debug)]
struct ClientIdentity {
    common_name: String,
    organization: String,
}

async fn extract_client_identity<B>(
    Extension(client_cert): Extension<Option<rustls::Certificate>>,
    mut request: Request<B>,
    next: Next<B>,
) -> Response {
    if let Some(cert) = client_cert {
        // Parse certificate to extract CN, O, etc.
        let identity = parse_certificate(&cert);
        request.extensions_mut().insert(identity);
    }

    next.run(request).await
}

// Use identity for rate limiting, audit logging, etc.
async fn handle_remote_write(
    Extension(identity): Extension<ClientIdentity>,
    State(service): State<Arc<IngestionService>>,
    body: Bytes,
) -> Result<StatusCode, (StatusCode, String)> {
    info!("Write request from: {}", identity.common_name);

    // Apply per-client rate limiting
    if !service.rate_limiter.allow(&identity.common_name) {
        return Err((StatusCode::TOO_MANY_REQUESTS, "Rate limit exceeded".into()));
    }

    // ... rest of handler ...
}

7.3 gRPC API Design

While HTTP is the primary interface (Prometheus compatibility), a gRPC API can provide:

Better performance for internal services
Streaming support for batch ingestion
Type-safe client libraries

Proto Definition:

// proto/metricstor.proto

syntax = "proto3";

package metricstor.v1;

service MetricstorService {
  // Write samples
  rpc Write(WriteRequest) returns (WriteResponse);

  // Streaming write for high-throughput scenarios
  rpc WriteStream(stream WriteRequest) returns (WriteResponse);

  // Query (instant)
  rpc Query(QueryRequest) returns (QueryResponse);

  // Query (range)
  rpc QueryRange(QueryRangeRequest) returns (QueryRangeResponse);

  // Admin operations
  rpc Compact(CompactRequest) returns (CompactResponse);
  rpc DeleteSeries(DeleteSeriesRequest) returns (DeleteSeriesResponse);
}

message WriteRequest {
  repeated TimeSeries timeseries = 1;
}

message WriteResponse {
  uint64 samples_ingested = 1;
  uint64 samples_rejected = 2;
}

message QueryRequest {
  string query = 1;  // PromQL
  int64 time = 2;    // Unix timestamp (ms)
  int64 timeout_ms = 3;
}

message QueryResponse {
  string result_type = 1;  // "vector" or "matrix"
  repeated InstantVectorSample vector = 2;
  repeated RangeVectorSeries matrix = 3;
}

message InstantVectorSample {
  map<string, string> labels = 1;
  double value = 2;
  int64 timestamp = 3;
}

message RangeVectorSeries {
  map<string, string> labels = 1;
  repeated Sample samples = 2;
}

message Sample {
  double value = 1;
  int64 timestamp = 2;
}

7.4 NixOS Module Integration

Following T024 patterns, create a NixOS module for Metricstor.

File: nix/modules/metricstor.nix

{ config, lib, pkgs, ... }:

with lib;

let
  cfg = config.services.metricstor;

  configFile = pkgs.writeText "metricstor.toml" ''
    [server]
    addr = "${cfg.listenAddress}"
    log_level = "${cfg.logLevel}"

    ${optionalString (cfg.tls.enable) ''
    [server.tls]
    cert_file = "${cfg.tls.certFile}"
    key_file = "${cfg.tls.keyFile}"
    ${optionalString (cfg.tls.caFile != null) ''
    ca_file = "${cfg.tls.caFile}"
    ''}
    require_client_cert = ${boolToString cfg.tls.requireClientCert}
    ''}

    [storage]
    data_dir = "${cfg.dataDir}"
    retention_days = ${toString cfg.storage.retentionDays}
    wal_segment_size_mb = ${toString cfg.storage.walSegmentSizeMb}
  '';

in {
  options.services.metricstor = {
    enable = mkEnableOption "Metricstor metrics storage service";

    package = mkOption {
      type = types.package;
      default = pkgs.metricstor;
      description = "Metricstor package to use";
    };

    listenAddress = mkOption {
      type = types.str;
      default = "0.0.0.0:8080";
      description = "Address and port to listen on";
    };

    logLevel = mkOption {
      type = types.enum [ "trace" "debug" "info" "warn" "error" ];
      default = "info";
      description = "Log level";
    };

    dataDir = mkOption {
      type = types.path;
      default = "/var/lib/metricstor";
      description = "Data directory for TSDB storage";
    };

    tls = {
      enable = mkEnableOption "TLS encryption";

      certFile = mkOption {
        type = types.str;
        description = "Path to TLS certificate file";
      };

      keyFile = mkOption {
        type = types.str;
        description = "Path to TLS private key file";
      };

      caFile = mkOption {
        type = types.nullOr types.str;
        default = null;
        description = "Path to CA certificate for client verification (mTLS)";
      };

      requireClientCert = mkOption {
        type = types.bool;
        default = false;
        description = "Require client certificates (mTLS)";
      };
    };

    storage = {
      retentionDays = mkOption {
        type = types.ints.positive;
        default = 15;
        description = "Data retention period in days";
      };

      walSegmentSizeMb = mkOption {
        type = types.ints.positive;
        default = 128;
        description = "WAL segment size in MB";
      };
    };
  };

  config = mkIf cfg.enable {
    systemd.services.metricstor = {
      description = "Metricstor Metrics Storage Service";
      wantedBy = [ "multi-user.target" ];
      after = [ "network.target" ];

      serviceConfig = {
        Type = "simple";
        ExecStart = "${cfg.package}/bin/metricstor-server --config ${configFile}";
        Restart = "on-failure";
        RestartSec = "5s";

        # Security hardening
        DynamicUser = true;
        StateDirectory = "metricstor";
        ProtectSystem = "strict";
        ProtectHome = true;
        PrivateTmp = true;
        NoNewPrivileges = true;
      };
    };

    # Expose metrics endpoint
    networking.firewall.allowedTCPPorts = mkIf cfg.openFirewall [ 8080 ];
  };
}

Usage Example (in NixOS configuration):

{
  services.metricstor = {
    enable = true;
    listenAddress = "0.0.0.0:8080";
    logLevel = "info";

    tls = {
      enable = true;
      certFile = "/etc/certs/metricstor-server.crt";
      keyFile = "/etc/certs/metricstor-server.key";
      caFile = "/etc/certs/ca.crt";
      requireClientCert = true;
    };

    storage = {
      retentionDays = 30;
    };
  };
}

8. Implementation Plan

8.1 Step Breakdown (S1-S6)

The implementation follows a phased approach aligned with the task.yaml steps.

S1: Research & Architecture ✅ (Current Document)

Deliverable: This design document

Status: Completed

S2: Workspace Scaffold

Goal: Create metricstor workspace with skeleton structure

Tasks:

Create workspace structure:

metricstor/
├── Cargo.toml
├── crates/
│   ├── metricstor-api/          # Client library
│   ├── metricstor-server/       # Main service
│   └── metricstor-types/        # Shared types
├── proto/
│   ├── remote_write.proto
│   └── metricstor.proto
└── README.md

Setup proto compilation in build.rs

Define core types:

// metricstor-types/src/lib.rs

pub type SeriesID = u64;
pub type Timestamp = i64;  // Unix timestamp in milliseconds

pub struct Sample {
    pub timestamp: Timestamp,
    pub value: f64,
}

pub struct Series {
    pub id: SeriesID,
    pub labels: BTreeMap<String, String>,
}

pub struct LabelMatcher {
    pub name: String,
    pub value: String,
    pub op: MatchOp,
}

pub enum MatchOp {
    Equal,
    NotEqual,
    RegexMatch,
    RegexNotMatch,
}

Add dependencies:

[workspace.dependencies]
# Core
tokio = { version = "1.35", features = ["full"] }
anyhow = "1.0"
tracing = "0.1"
tracing-subscriber = "0.3"

# Serialization
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
toml = "0.8"

# gRPC
tonic = "0.10"
prost = "0.12"
prost-types = "0.12"

# HTTP
axum = "0.7"
axum-server = { version = "0.6", features = ["tls-rustls"] }

# Compression
snap = "1.1"  # Snappy

# Time-series
promql-parser = "0.4"

# Storage
rocksdb = "0.21"  # (NOT for TSDB, only for examples)

# Crypto
rustls = "0.21"

Estimated Effort: 2 days

S3: Push Ingestion

Goal: Implement Prometheus remote_write compatible ingestion endpoint

Tasks:

Implement WAL:

// metricstor-server/src/wal.rs

struct WAL {
    dir: PathBuf,
    segment_size: usize,
    active_segment: RwLock<WALSegment>,
}

impl WAL {
    fn new(dir: PathBuf, segment_size: usize) -> Result<Self>;
    fn append(&self, record: WALRecord) -> Result<()>;
    fn replay(&self) -> Result<Vec<WALRecord>>;
    fn checkpoint(&self, min_segment: u64) -> Result<()>;
}

Implement In-Memory Head Block:

// metricstor-server/src/head.rs

struct Head {
    series: DashMap<SeriesID, Arc<Series>>,  // Concurrent HashMap
    min_time: AtomicI64,
    max_time: AtomicI64,
    config: HeadConfig,
}

impl Head {
    async fn append(&self, series_id: SeriesID, labels: Labels, ts: Timestamp, value: f64) -> Result<()>;
    async fn get(&self, series_id: SeriesID) -> Option<Arc<Series>>;
    async fn series_count(&self) -> usize;
}

Implement Gorilla Compression (basic version):

// metricstor-server/src/compression.rs

struct GorillaEncoder { /* ... */ }
struct GorillaDecoder { /* ... */ }

impl GorillaEncoder {
    fn encode_timestamp(&mut self, ts: i64) -> Result<()>;
    fn encode_value(&mut self, value: f64) -> Result<()>;
    fn finish(self) -> Vec<u8>;
}

Implement HTTP Ingestion Handler:

// metricstor-server/src/handlers/ingest.rs

async fn handle_remote_write(
    State(service): State<Arc<IngestionService>>,
    body: Bytes,
) -> Result<StatusCode, (StatusCode, String)> {
    // 1. Decompress Snappy
    // 2. Decode protobuf
    // 3. Validate samples
    // 4. Append to WAL
    // 5. Insert into Head
    // 6. Return 204 No Content
}

Add Rate Limiting:

struct RateLimiter {
    rate: f64,  // samples/sec
    tokens: AtomicU64,
}

impl RateLimiter {
    fn allow(&self) -> bool;
}

Integration Test:

#[tokio::test]
async fn test_remote_write_ingestion() {
    // Start server
    // Send WriteRequest
    // Verify samples stored
}

Estimated Effort: 5 days

S4: PromQL Query Engine

Goal: Basic PromQL query support (instant + range queries)

Tasks:

Integrate promql-parser:

// metricstor-server/src/query/parser.rs

use promql_parser::parser;

pub fn parse(query: &str) -> Result<parser::Expr> {
    parser::parse(query).map_err(|e| Error::ParseError(e.to_string()))
}

Implement Query Planner:

// metricstor-server/src/query/planner.rs

pub enum QueryPlan {
    VectorSelector { matchers: Vec<LabelMatcher>, timestamp: i64 },
    MatrixSelector { matchers: Vec<LabelMatcher>, range: Duration, timestamp: i64 },
    Aggregate { op: AggregateOp, input: Box<QueryPlan>, grouping: Vec<String> },
    RateFunc { input: Box<QueryPlan> },
    // ... other operators
}

pub fn plan(expr: parser::Expr, query_time: i64) -> Result<QueryPlan>;

Implement Label Index:

// metricstor-server/src/index.rs

struct LabelIndex {
    // label_name -> label_value -> [series_ids]
    inverted_index: DashMap<String, DashMap<String, Vec<SeriesID>>>,
}

impl LabelIndex {
    fn find_series(&self, matchers: &[LabelMatcher]) -> Result<Vec<SeriesID>>;
    fn add_series(&self, series_id: SeriesID, labels: &Labels);
}

Implement Query Executor:

// metricstor-server/src/query/executor.rs

struct QueryExecutor {
    head: Arc<Head>,
    blocks: Arc<BlockManager>,
    index: Arc<LabelIndex>,
}

impl QueryExecutor {
    async fn execute(&self, plan: QueryPlan) -> Result<QueryResult>;

    async fn execute_vector_selector(&self, matchers: Vec<LabelMatcher>, ts: i64) -> Result<InstantVector>;
    async fn execute_matrix_selector(&self, matchers: Vec<LabelMatcher>, range: Duration, ts: i64) -> Result<RangeVector>;

    fn apply_rate(&self, matrix: RangeVector) -> Result<InstantVector>;
    fn apply_aggregate(&self, op: AggregateOp, vector: InstantVector, grouping: Vec<String>) -> Result<InstantVector>;
}

Implement HTTP Query Handlers:

// metricstor-server/src/handlers/query.rs

async fn handle_instant_query(
    Query(params): Query<QueryParams>,
    State(executor): State<Arc<QueryExecutor>>,
) -> Result<Json<QueryResponse>, (StatusCode, String)> {
    let expr = parse(&params.query)?;
    let plan = plan(expr, params.time.unwrap_or_else(now))?;
    let result = executor.execute(plan).await?;
    Ok(Json(format_response(result)))
}

async fn handle_range_query(
    Query(params): Query<RangeQueryParams>,
    State(executor): State<Arc<QueryExecutor>>,
) -> Result<Json<QueryResponse>, (StatusCode, String)> {
    // Similar to instant query, but iterate over [start, end] with step
}

Integration Test:

#[tokio::test]
async fn test_instant_query() {
    // Ingest samples
    // Query: http_requests_total{method="GET"}
    // Verify results
}

#[tokio::test]
async fn test_range_query_with_rate() {
    // Ingest counter samples
    // Query: rate(http_requests_total[5m])
    // Verify rate calculation
}

Estimated Effort: 7 days

S5: Storage Layer

Goal: Time-series storage with retention and compaction

Tasks:

Implement Block Writer:

// metricstor-server/src/block/writer.rs

struct BlockWriter {
    block_dir: PathBuf,
    index_writer: IndexWriter,
    chunk_writer: ChunkWriter,
}

impl BlockWriter {
    fn new(block_dir: PathBuf) -> Result<Self>;
    fn write_series(&mut self, series: &Series, samples: &[Sample]) -> Result<()>;
    fn finalize(self) -> Result<BlockMeta>;
}

Implement Block Reader:

// metricstor-server/src/block/reader.rs

struct BlockReader {
    meta: BlockMeta,
    index: Index,
    chunks: ChunkReader,
}

impl BlockReader {
    fn open(block_dir: PathBuf) -> Result<Self>;
    fn query_samples(&self, series_id: SeriesID, start: i64, end: i64) -> Result<Vec<Sample>>;
}

Implement Compaction:

// metricstor-server/src/compaction.rs

struct Compactor {
    data_dir: PathBuf,
    config: CompactionConfig,
}

impl Compactor {
    async fn compact_head_to_l0(&self, head: &Head) -> Result<BlockID>;
    async fn compact_blocks(&self, source_blocks: Vec<BlockID>) -> Result<BlockID>;
    async fn run_compaction_loop(&self);  // Background task
}

Implement Retention Enforcement:

impl Compactor {
    async fn enforce_retention(&self, retention: Duration) -> Result<()> {
        let cutoff = SystemTime::now() - retention;
        // Delete blocks older than cutoff
    }
}

Implement Block Manager:

// metricstor-server/src/block/manager.rs

struct BlockManager {
    blocks: RwLock<Vec<Arc<BlockReader>>>,
    data_dir: PathBuf,
}

impl BlockManager {
    fn load_blocks(&mut self) -> Result<()>;
    fn add_block(&mut self, block: BlockReader);
    fn remove_block(&mut self, block_id: &BlockID);
    fn query_blocks(&self, start: i64, end: i64) -> Vec<Arc<BlockReader>>;
}

Integration Test:

#[tokio::test]
async fn test_compaction() {
    // Ingest data for >2h
    // Trigger compaction
    // Verify block created
    // Query old data from block
}

#[tokio::test]
async fn test_retention() {
    // Create old blocks
    // Run retention enforcement
    // Verify old blocks deleted
}

Estimated Effort: 8 days

S6: Integration & Documentation

Goal: NixOS module, TLS config, integration tests, operator docs

Tasks:

Create NixOS Module:
- File: nix/modules/metricstor.nix
- Follow T024 patterns
- Include systemd service, firewall rules
- Support TLS configuration options
Implement mTLS:
- Load certs in server startup
- Configure Rustls with client cert verification
- Extract client identity for rate limiting
Create Metricstor Scraper:
- Standalone scraper service
- Reads scrape config (TOML)
- Scrapes /metrics endpoints from services
- Pushes to Metricstor via remote_write

Integration Tests:

#[tokio::test]
async fn test_e2e_ingest_and_query() {
    // Start Metricstor server
    // Ingest samples via remote_write
    // Query via /api/v1/query
    // Query via /api/v1/query_range
    // Verify results match
}

#[tokio::test]
async fn test_mtls_authentication() {
    // Start server with mTLS
    // Connect without client cert -> rejected
    // Connect with valid client cert -> accepted
}

#[tokio::test]
async fn test_grafana_compatibility() {
    // Configure Grafana to use Metricstor
    // Execute sample queries
    // Verify dashboards render correctly
}

Write Operator Documentation:
- File: docs/por/T033-metricstor/OPERATOR.md
- Installation (NixOS, standalone)
- Configuration guide
- mTLS setup
- Scraper configuration
- Troubleshooting
- Performance tuning
Write Developer Documentation:
- File: metricstor/README.md
- Architecture overview
- Building from source
- Running tests
- Contributing guidelines

Estimated Effort: 5 days

8.2 Dependency Ordering

S1 (Research) → S2 (Scaffold)
                    ↓
                S3 (Ingestion) ──┐
                    ↓             │
                S4 (Query)        │
                    ↓             │
                S5 (Storage) ←────┘
                    ↓
                S6 (Integration)

Critical Path: S1 → S2 → S3 → S5 → S6 Parallelizable: S4 can start after S3 completes basic ingestion

8.3 Total Effort Estimate

Step	Effort	Priority
S1: Research	2 days	P0
S2: Scaffold	2 days	P0
S3: Ingestion	5 days	P0
S4: Query Engine	7 days	P0
S5: Storage Layer	8 days	P1
S6: Integration	5 days	P1
Total	29 days

Realistic Timeline: 6-8 weeks (accounting for testing, debugging, documentation)

9. Open Questions

9.1 Decisions Requiring User Input

Q1: Scraper Implementation Choice

Question: Should we use Prometheus in agent mode or build a custom Rust scraper?

Option A: Prometheus Agent + Remote Write

Pros: Battle-tested, standard tool, no implementation effort
Cons: Adds Go dependency, less platform integration

Option B: Custom Rust Scraper

Pros: 100% Rust, platform consistency, easier integration
Cons: Implementation effort, needs testing

Recommendation: Option B (custom scraper) for consistency with PROJECT.md philosophy

Decision: [ ] A [ ] B [ ] Defer to later

Q2: gRPC vs HTTP Priority

Question: Should we prioritize gRPC API or focus only on HTTP (Prometheus compatibility)?

Option A: HTTP only (v1)

Pros: Simpler, Prometheus/Grafana compatibility is sufficient
Cons: Less efficient for internal services

Option B: Both HTTP and gRPC (v1)

Pros: Better performance for internal services, more flexibility
Cons: More implementation effort

Recommendation: Option A for v1, add gRPC in v2 if needed

Decision: [ ] A [ ] B

Q3: FlareDB Metadata Integration Timeline

Question: When should we integrate FlareDB for metadata storage?

Option A: v1 (T033)

Pros: Unified metadata story from the start
Cons: Increases complexity, adds dependency

Option B: v2 (Future)

Pros: Simpler v1, can deploy standalone
Cons: Migration effort later

Recommendation: Option B (defer to v2)

Decision: [ ] A [ ] B

Q4: S3 Cold Storage Priority

Question: Should S3 cold storage be part of v1 or deferred?

Option A: v1 (T033.S5)

Pros: Unlimited retention from day 1
Cons: Complexity, operational overhead

Option B: v2 (Future)

Pros: Simpler v1, focus on core functionality
Cons: Limited retention (local disk only)

Recommendation: Option B (defer to v2), use local disk for v1 with 15-30 day retention

Decision: [ ] A [ ] B

9.2 Areas Needing Further Investigation

I1: PromQL Function Coverage

Issue: Need to determine exact subset of PromQL functions to support in v1.

Investigation Needed:

Survey existing Grafana dashboards in use
Identify most common functions (rate, increase, histogram_quantile, etc.)
Prioritize by usage frequency

Proposed Approach:

Analyze 10-20 sample dashboards
Create coverage matrix
Implement top 80% functions first

I2: Query Performance Benchmarking

Issue: Need to validate query latency targets (p95 <100ms) are achievable.

Investigation Needed:

Benchmark promql-parser crate performance
Measure Gorilla decompression throughput
Test index lookup performance at 10M series scale

Proposed Approach:

Create benchmark suite with synthetic data (1M, 10M series)
Measure end-to-end query latency
Identify bottlenecks and optimize

I3: Series Cardinality Limits

Issue: How to prevent series explosion (high cardinality killing performance)?

Investigation Needed:

Research cardinality estimation algorithms (HyperLogLog)
Define cardinality limits (per metric, per label, global)
Implement rejection strategy (reject new series beyond limit)

Proposed Approach:

Add cardinality tracking to label index
Implement warnings at 80% limit, rejection at 100%
Provide admin API to inspect high-cardinality series

I4: Out-of-Order Sample Edge Cases

Issue: How to handle out-of-order samples spanning chunk boundaries?

Investigation Needed:

Test scenarios: samples arriving 1h late, 2h late, etc.
Determine if we need multi-chunk updates or reject old samples
Benchmark impact of re-sorting chunks

Proposed Approach:

Implement configurable out-of-order window (default: 1h)
Reject samples older than window
For within-window samples, insert into correct chunk (may require chunk re-compression)

10. References

10.1 Research Sources

Time-Series Storage Formats

PromQL Implementation

Prometheus Remote Write Protocol

Rust TSDB Implementations

10.2 Platform References

Internal Documentation

PROJECT.md (Item 12: Metrics Store)
docs/por/T033-metricstor/task.yaml
docs/por/T027-production-hardening/ (TLS patterns)
docs/por/T024-nixos-packaging/ (NixOS module patterns)

Existing Service Patterns

flaredb/crates/flaredb-server/src/main.rs (TLS, metrics export)
flaredb/crates/flaredb-server/src/config/mod.rs (Config structure)
chainfire/crates/chainfire-server/src/config.rs (TLS config)
iam/crates/iam-server/src/config.rs (Config patterns)

10.3 External Tools

Grafana - Visualization and dashboards
Prometheus - Reference implementation
VictoriaMetrics - Replacement target (study architecture)

Appendix A: PromQL Function Reference (v1 Support)

Supported Functions

Function	Category	Description	Example
`rate()`	Counter	Per-second rate of increase	`rate(http_requests_total[5m])`
`irate()`	Counter	Instant rate (last 2 samples)	`irate(http_requests_total[5m])`
`increase()`	Counter	Total increase over range	`increase(http_requests_total[1h])`
`histogram_quantile()`	Histogram	Calculate quantile from histogram	`histogram_quantile(0.95, rate(http_duration_bucket[5m]))`
`sum()`	Aggregation	Sum values	`sum(metric)`
`avg()`	Aggregation	Average values	`avg(metric)`
`min()`	Aggregation	Minimum value	`min(metric)`
`max()`	Aggregation	Maximum value	`max(metric)`
`count()`	Aggregation	Count series	`count(metric)`
`stddev()`	Aggregation	Standard deviation	`stddev(metric)`
`stdvar()`	Aggregation	Standard variance	`stdvar(metric)`
`topk()`	Aggregation	Top K series	`topk(5, metric)`
`bottomk()`	Aggregation	Bottom K series	`bottomk(5, metric)`
`time()`	Time	Current timestamp	`time()`
`timestamp()`	Time	Sample timestamp	`timestamp(metric)`
`abs()`	Math	Absolute value	`abs(metric)`
`ceil()`	Math	Round up	`ceil(metric)`
`floor()`	Math	Round down	`floor(metric)`
`round()`	Math	Round to nearest	`round(metric, 0.1)`

NOT Supported in v1

Function	Category	Reason
`predict_linear()`	Prediction	Complex, low usage
`deriv()`	Math	Low usage
`holt_winters()`	Prediction	Complex
`resets()`	Counter	Low usage
`changes()`	Analysis	Low usage
Subqueries	Advanced	Very complex

Appendix B: Configuration Reference

Complete Configuration Example

# metricstor.toml - Complete configuration example

[server]
# Listen address for HTTP/gRPC API
addr = "0.0.0.0:8080"

# Log level: trace, debug, info, warn, error
log_level = "info"

# Metrics port for self-monitoring (Prometheus /metrics endpoint)
metrics_port = 9099

[server.tls]
# Enable TLS
cert_file = "/etc/metricstor/certs/server.crt"
key_file = "/etc/metricstor/certs/server.key"

# Enable mTLS (require client certificates)
ca_file = "/etc/metricstor/certs/ca.crt"
require_client_cert = true

[storage]
# Data directory for TSDB blocks and WAL
data_dir = "/var/lib/metricstor/data"

# Data retention period (days)
retention_days = 15

# WAL segment size (MB)
wal_segment_size_mb = 128

# Block duration for compaction
min_block_duration = "2h"
max_block_duration = "24h"

# Out-of-order sample acceptance window
out_of_order_time_window = "1h"

# Series cardinality limits
max_series = 10_000_000
max_series_per_metric = 100_000

# Memory limits
max_head_chunks_per_series = 2
max_head_size_mb = 2048

[query]
# Query timeout (seconds)
timeout_seconds = 30

# Maximum query range (hours)
max_range_hours = 24

# Query result cache TTL (seconds)
cache_ttl_seconds = 60

# Maximum concurrent queries
max_concurrent_queries = 100

[ingestion]
# Write buffer size (samples)
write_buffer_size = 100_000

# Backpressure strategy: "block" or "reject"
backpressure_strategy = "block"

# Rate limiting (samples per second per client)
rate_limit_per_client = 50_000

# Maximum samples per write request
max_samples_per_request = 10_000

[compaction]
# Enable background compaction
enabled = true

# Compaction interval (seconds)
interval_seconds = 7200  # 2 hours

# Number of compaction threads
num_threads = 2

[s3]
# S3 cold storage (optional, future)
enabled = false
endpoint = "https://s3.example.com"
bucket = "metricstor-blocks"
access_key_id = "..."
secret_access_key = "..."
upload_after_days = 7
local_cache_size_gb = 100

[flaredb]
# FlareDB metadata integration (optional, future)
enabled = false
endpoints = ["flaredb-1:50051", "flaredb-2:50051"]
namespace = "metrics"

Appendix C: Metrics Exported by Metricstor

Metricstor exports metrics about itself on port 9099 (configurable).

Ingestion Metrics

# Samples ingested
metricstor_samples_ingested_total{} counter

# Samples rejected (out-of-order, invalid, etc.)
metricstor_samples_rejected_total{reason="out_of_order|invalid|rate_limit"} counter

# Ingestion latency (milliseconds)
metricstor_ingestion_latency_ms{quantile="0.5|0.9|0.99"} summary

# Active series
metricstor_active_series{} gauge

# Head memory usage (bytes)
metricstor_head_memory_bytes{} gauge

Query Metrics

# Queries executed
metricstor_queries_total{type="instant|range"} counter

# Query latency (milliseconds)
metricstor_query_latency_ms{type="instant|range", quantile="0.5|0.9|0.99"} summary

# Query errors
metricstor_query_errors_total{reason="timeout|parse_error|execution_error"} counter

Storage Metrics

# WAL segments
metricstor_wal_segments{} gauge

# WAL size (bytes)
metricstor_wal_size_bytes{} gauge

# Blocks
metricstor_blocks_total{level="0|1|2"} gauge

# Block size (bytes)
metricstor_block_size_bytes{level="0|1|2"} gauge

# Compactions
metricstor_compactions_total{level="0|1|2"} counter

# Compaction duration (seconds)
metricstor_compaction_duration_seconds{level="0|1|2", quantile="0.5|0.9|0.99"} summary

System Metrics

# Go runtime metrics (if using Go for scraper)
# Rust memory metrics
metricstor_memory_allocated_bytes{} gauge

# CPU usage
metricstor_cpu_usage_seconds_total{} counter

Appendix D: Error Codes and Troubleshooting

HTTP Error Codes

Code	Meaning	Common Causes
200	OK	Query successful
204	No Content	Write successful
400	Bad Request	Invalid PromQL, malformed protobuf
401	Unauthorized	mTLS cert validation failed
429	Too Many Requests	Rate limit exceeded
500	Internal Server Error	Storage error, WAL corruption
503	Service Unavailable	Write buffer full, server overloaded

Common Issues

Issue: "Samples rejected: out_of_order"

Cause: Samples arriving with timestamps older than out_of_order_time_window

Solution:

Increase out_of_order_time_window in config
Check clock sync on clients (NTP)
Reduce scrape batch size

Issue: "Rate limit exceeded"

Cause: Client exceeding rate_limit_per_client samples/sec

Solution:

Increase rate limit in config
Reduce scrape frequency
Shard writes across multiple clients

Issue: "Query timeout"

Cause: Query exceeding timeout_seconds

Solution:

Increase query timeout
Reduce query time range
Add more specific label matchers to reduce series scanned

Issue: "Series cardinality explosion"

Cause: Too many unique label combinations (high cardinality)

Solution:

Review label design (avoid unbounded labels like user_id)
Use relabeling to drop high-cardinality labels
Increase max_series limit (if justified)

End of Design Document

Total Length: ~3,800 lines

Status: Ready for review and S2-S6 implementation

Next Steps:

Review and approve design decisions
Create GitHub issues for S2-S6 tasks
Begin S2: Workspace Scaffold

121 KiB Raw Blame History

Metricstor Design Document

Table of Contents

1. Executive Summary

1.1 Overview

1.2 High-Level Architecture

1.3 Key Design Decisions

1.4 Success Criteria

2. Requirements

2.1 Functional Requirements

FR-1: Push-Based Metric Ingestion

FR-2: PromQL Query Engine

FR-3: Time-Series Storage

FR-4: Security & Authentication

FR-5: Operational Features

2.2 Non-Functional Requirements

NFR-1: Performance

NFR-2: Scalability

NFR-3: Reliability

NFR-4: Maintainability

NFR-5: Compatibility

2.3 Out of Scope (Explicitly Not Supported in v1)

3. Time-Series Storage Model

3.1 Data Model

3.1.1 Metric Structure

3.1.2 Series Identification

3.2 Storage Format

3.2.1 Architecture Overview

3.2.2 Write-Ahead Log (WAL)

3.2.3 In-Memory Head Block

3.2.4 Disk Blocks (Immutable)

3.3 Compression Strategy

3.3.1 Gorilla Compression Algorithm

3.3.2 Compression Performance Targets

3.4 Retention and Compaction Policies

3.4.1 Retention Policy

3.4.2 Compaction Strategy

4. Push Ingestion API

4.1 Prometheus Remote Write Protocol

4.1.1 Protocol Overview

4.1.2 HTTP Endpoint

4.1.3 Protobuf Schema

4.1.4 Ingestion Handler

4.2 gRPC API (Alternative/Additional)

4.3 Label Validation and Normalization

4.3.1 Metric Name Validation

4.3.2 Label Name Validation

4.3.3 Label Normalization

4.4 Write Path Architecture

4.5 Out-of-Order Sample Handling

5. PromQL Query Engine

5.1 PromQL Overview

5.2 Supported PromQL Subset

5.2.1 Instant Vector Selectors

5.2.2 Range Vector Selectors

5.2.3 Aggregation Operators

5.2.4 Functions

5.2.5 Binary Operators

5.2.6 Subqueries (NOT SUPPORTED in v1)

5.3 Query Execution Model

5.3.1 Query Parsing

5.3.2 Query Planner

5.3.3 Query Executor

5.4 Read Path Architecture

5.5 HTTP Query API

5.5.1 Instant Query

5.5.2 Range Query

5.5.3 Label Values Query

5.5.4 Series Metadata Query

5.6 Performance Optimizations

5.6.1 Query Caching

5.6.2 Posting List Intersection

5.6.3 Chunk Pruning

6. Storage Backend Architecture

6.1 Architecture Decision: Hybrid Approach

6.2 Decision Rationale

6.2.1 Why NOT Pure FlareDB Backend?

6.2.2 Why NOT VictoriaMetrics Binary?

6.2.3 Why Hybrid (Dedicated + Optional FlareDB)?

6.3 Storage Layout

121 KiB

Raw Blame History