# Nightlight Design Document **Project:** Nightlight - VictoriaMetrics OSS Replacement **Task:** T033.S1 Research & Architecture **Version:** 1.0 **Date:** 2025-12-10 **Author:** PeerB --- ## Table of Contents 1. [Executive Summary](#1-executive-summary) 2. [Requirements](#2-requirements) 3. [Time-Series Storage Model](#3-time-series-storage-model) 4. [Push Ingestion API](#4-push-ingestion-api) 5. [PromQL Query Engine](#5-promql-query-engine) 6. [Storage Backend Architecture](#6-storage-backend-architecture) 7. [Integration Points](#7-integration-points) 8. [Implementation Plan](#8-implementation-plan) 9. [Open Questions](#9-open-questions) 10. [References](#10-references) --- ## 1. Executive Summary ### 1.1 Overview Nightlight is a fully open-source, distributed time-series database designed as a replacement for VictoriaMetrics, addressing the critical requirement that VictoriaMetrics' mTLS support is a paid feature. As the final component (Item 12/12) of PROJECT.md, Nightlight completes the observability stack for the Japanese cloud platform. ### 1.2 High-Level Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ Service Mesh │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ FlareDB │ │ ChainFire│ │ PlasmaVMC│ │ IAM │ ... │ │ │ :9092 │ │ :9091 │ │ :9093 │ │ :9094 │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ │ │ └────────────┴────────────┴────────────┘ │ │ │ │ │ │ Push (remote_write) │ │ │ mTLS │ │ ▼ │ │ ┌──────────────────────┐ │ │ │ Nightlight Server │ │ │ │ ┌────────────────┐ │ │ │ │ │ Ingestion API │ │ ← Prometheus remote_write │ │ │ │ (gRPC/HTTP) │ │ │ │ │ └────────┬───────┘ │ │ │ │ │ │ │ │ │ ┌────────▼───────┐ │ │ │ │ │ Write Buffer │ │ │ │ │ │ (In-Memory) │ │ │ │ │ └────────┬───────┘ │ │ │ │ │ │ │ │ │ ┌────────▼───────┐ │ │ │ │ │ Storage Engine│ │ │ │ │ │ ┌──────────┐ │ │ │ │ │ │ │ Head │ │ │ ← WAL + In-Memory Index │ │ │ │ │ (Active) │ │ │ │ │ │ │ └────┬─────┘ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌────▼─────┐ │ │ │ │ │ │ │ Blocks │ │ │ ← Immutable, Compressed │ │ │ │ │ (TSDB) │ │ │ │ │ │ │ └──────────┘ │ │ │ │ │ └────────────────┘ │ │ │ │ │ │ │ │ │ ┌────────▼───────┐ │ │ │ │ │ Query Engine │ │ ← PromQL Execution │ │ │ │ (PromQL AST) │ │ │ │ │ └────────┬───────┘ │ │ │ │ │ │ │ │ └───────────┼──────────┘ │ │ │ │ │ │ Query (HTTP) │ │ │ mTLS │ │ ▼ │ │ ┌──────────────────────┐ │ │ │ Grafana / Clients │ │ │ └──────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ ┌─────────────────────┐ │ FlareDB Cluster │ ← Metadata (optional) │ (Metadata Store) │ └─────────────────────┘ ┌─────────────────────┐ │ S3-Compatible │ ← Cold Storage (future) │ Object Storage │ └─────────────────────┘ ``` ### 1.3 Key Design Decisions 1. **Storage Format**: Hybrid approach using Prometheus TSDB block design with Gorilla compression - **Rationale**: Battle-tested, excellent compression (1-2 bytes/sample), widely understood 2. **Storage Backend**: Dedicated time-series engine with optional FlareDB metadata integration - **Rationale**: Time-series workloads have unique access patterns; KV stores not optimal for sample storage - FlareDB reserved for metadata (series labels, index) in distributed scenarios 3. **PromQL Subset**: Support 80% of common use cases (instant/range queries, basic aggregations, rate/increase) - **Rationale**: Full PromQL compatibility is complex; focus on practical operator needs 4. **Push Model**: Prometheus remote_write v1.0 protocol via HTTP + gRPC APIs - **Rationale**: Standard protocol, Snappy compression built-in, client library availability 5. **mTLS Integration**: Consistent with T027/T031 patterns (cert_file, key_file, ca_file, require_client_cert) - **Rationale**: Unified security model across all platform services ### 1.4 Success Criteria - Accept metrics from 8+ services (ports 9091-9099) via remote_write - Query latency <100ms for instant queries (p95) - Compression ratio ≥10:1 (target: 1.5-2 bytes/sample) - Support 100K samples/sec write throughput per instance - PromQL queries cover 80% of Grafana dashboard use cases - Zero vendor lock-in (100% OSS, no paid features) --- ## 2. Requirements ### 2.1 Functional Requirements #### FR-1: Push-Based Metric Ingestion - **FR-1.1**: Accept Prometheus remote_write v1.0 protocol (HTTP POST) - **FR-1.2**: Support Snappy-compressed protobuf payloads - **FR-1.3**: Validate metric names and labels per Prometheus naming conventions - **FR-1.4**: Handle out-of-order samples within a configurable time window (default: 1h) - **FR-1.5**: Deduplicate duplicate samples (same timestamp + labels) - **FR-1.6**: Return backpressure signals (HTTP 429/503) when buffer is full #### FR-2: PromQL Query Engine - **FR-2.1**: Support instant queries (`/api/v1/query`) - **FR-2.2**: Support range queries (`/api/v1/query_range`) - **FR-2.3**: Support label queries (`/api/v1/label//values`, `/api/v1/labels`) - **FR-2.4**: Support series metadata queries (`/api/v1/series`) - **FR-2.5**: Implement core PromQL functions (see Section 5.2) - **FR-2.6**: Support Prometheus HTTP API JSON response format #### FR-3: Time-Series Storage - **FR-3.1**: Store samples with millisecond timestamp precision - **FR-3.2**: Support configurable retention periods (default: 15 days, configurable 1-365 days) - **FR-3.3**: Automatic background compaction of blocks - **FR-3.4**: Crash recovery via Write-Ahead Log (WAL) - **FR-3.5**: Series cardinality limits to prevent explosion (default: 10M series) #### FR-4: Security & Authentication - **FR-4.1**: mTLS support for ingestion and query APIs - **FR-4.2**: Optional basic authentication for HTTP endpoints - **FR-4.3**: Rate limiting per client (based on mTLS certificate CN or IP) #### FR-5: Operational Features - **FR-5.1**: Prometheus-compatible `/metrics` endpoint for self-monitoring - **FR-5.2**: Health check endpoints (`/health`, `/ready`) - **FR-5.3**: Admin API for series deletion, compaction trigger - **FR-5.4**: TOML configuration file support - **FR-5.5**: Environment variable overrides ### 2.2 Non-Functional Requirements #### NFR-1: Performance - **NFR-1.1**: Ingestion throughput: ≥100K samples/sec per instance - **NFR-1.2**: Query latency (p95): <100ms for instant queries, <500ms for range queries (1h window) - **NFR-1.3**: Compression ratio: ≥10:1 (target: 1.5-2 bytes/sample) - **NFR-1.4**: Memory usage: <2GB for 1M active series #### NFR-2: Scalability - **NFR-2.1**: Vertical scaling: Support 10M active series per instance - **NFR-2.2**: Horizontal scaling: Support sharding across multiple instances (future work) - **NFR-2.3**: Storage: Support local disk + optional S3-compatible backend for cold data #### NFR-3: Reliability - **NFR-3.1**: No data loss for committed samples (WAL durability) - **NFR-3.2**: Graceful degradation under load (reject writes with backpressure, not crash) - **NFR-3.3**: Crash recovery time: <30s for 10M series #### NFR-4: Maintainability - **NFR-4.1**: Codebase consistency with other platform services (FlareDB, ChainFire patterns) - **NFR-4.2**: 100% Rust, no CGO dependencies - **NFR-4.3**: Comprehensive unit and integration tests - **NFR-4.4**: Operator documentation with runbooks #### NFR-5: Compatibility - **NFR-5.1**: Prometheus remote_write v1.0 protocol compatibility - **NFR-5.2**: Prometheus HTTP API compatibility (subset: query, query_range, labels, series) - **NFR-5.3**: Grafana data source compatibility ### 2.3 Out of Scope (Explicitly Not Supported in v1) - Prometheus remote_read protocol (pull-based; platform uses push) - Full PromQL compatibility (complex subqueries, advanced functions) - Multi-tenancy (single-tenant per instance; use multiple instances for multi-tenant) - Distributed query federation (single-instance queries only) - Recording rules and alerting (use separate Prometheus/Alertmanager for this) --- ## 3. Time-Series Storage Model ### 3.1 Data Model #### 3.1.1 Metric Structure A time-series metric in Nightlight follows the Prometheus data model: ``` metric_name{label1="value1", label2="value2", ...} value timestamp ``` **Example:** ``` http_requests_total{method="GET", status="200", service="flaredb"} 1543 1733832000000 ``` Components: - **Metric Name**: Identifier for the measurement (e.g., `http_requests_total`) - Must match regex: `[a-zA-Z_:][a-zA-Z0-9_:]*` - **Labels**: Key-value pairs for dimensionality (e.g., `{method="GET", status="200"}`) - Label names: `[a-zA-Z_][a-zA-Z0-9_]*` - Label values: Any UTF-8 string - Reserved labels: `__name__` (stores metric name), labels starting with `__` are internal - **Value**: Float64 sample value - **Timestamp**: Millisecond precision (int64 milliseconds since Unix epoch) #### 3.1.2 Series Identification A **series** is uniquely identified by its metric name + label set: ```rust // Pseudo-code representation struct SeriesID { hash: u64, // FNV-1a hash of sorted labels } struct Series { id: SeriesID, labels: BTreeMap, // Sorted for consistent hashing chunks: Vec, } ``` Series ID calculation: 1. Sort labels lexicographically (including `__name__` label) 2. Concatenate as: `label1_name + \0 + label1_value + \0 + label2_name + \0 + ...` 3. Compute FNV-1a 64-bit hash ### 3.2 Storage Format #### 3.2.1 Architecture Overview Nightlight uses a **hybrid storage architecture** inspired by Prometheus TSDB and Gorilla: ``` ┌─────────────────────────────────────────────────────────────────┐ │ Memory Layer (Head) │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Series Map │ │ WAL Segment │ │ Write Buffer │ │ │ │ (In-Memory │ │ (Disk) │ │ (MPSC Queue) │ │ │ │ Index) │ │ │ │ │ │ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ │ └─────────────────┴─────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────┐ │ │ │ Active Chunks │ │ │ │ (Gorilla-compressed) │ │ │ │ - 2h time windows │ │ │ │ - Delta-of-delta TS │ │ │ │ - XOR float encoding │ │ │ └─────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ │ │ Compaction Trigger │ (every 2h or on shutdown) ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Disk Layer (Blocks) │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Block 1 │ │ Block 2 │ │ Block N │ │ │ │ [0h - 2h) │ │ [2h - 4h) │ │ [Nh - (N+2)h) │ │ │ │ │ │ │ │ │ │ │ │ ├─ meta.json │ │ ├─ meta.json │ │ ├─ meta.json │ │ │ │ ├─ index │ │ ├─ index │ │ ├─ index │ │ │ │ ├─ chunks/000 │ │ ├─ chunks/000 │ │ ├─ chunks/000 │ │ │ │ └─ tombstones │ │ └─ tombstones │ │ └─ tombstones │ │ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ ``` #### 3.2.2 Write-Ahead Log (WAL) **Purpose**: Durability and crash recovery **Format**: Append-only log segments (128MB default size) ``` WAL Structure: data/ wal/ 00000001 ← Segment 1 (128MB) 00000002 ← Segment 2 (active) ``` **WAL Record Format** (inspired by LevelDB): ``` ┌───────────────────────────────────────────────────┐ │ CRC32 (4 bytes) │ ├───────────────────────────────────────────────────┤ │ Length (4 bytes, little-endian) │ ├───────────────────────────────────────────────────┤ │ Type (1 byte): FULL | FIRST | MIDDLE | LAST │ ├───────────────────────────────────────────────────┤ │ Payload (variable): │ │ - Record Type (1 byte): Series | Samples │ │ - Series ID (8 bytes) │ │ - Labels (length-prefixed strings) │ │ - Samples (varint timestamp, float64 value) │ └───────────────────────────────────────────────────┘ ``` **WAL Operations**: - **Append**: Every write appends to active segment - **Checkpoint**: Snapshot of in-memory state to disk blocks - **Truncate**: Delete segments older than oldest in-memory data - **Replay**: On startup, replay WAL segments to rebuild in-memory state **Rust Implementation Sketch**: ```rust struct WAL { dir: PathBuf, segment_size: usize, // 128MB default active_segment: File, active_segment_num: u64, } impl WAL { fn append(&mut self, record: &WALRecord) -> Result<()> { let encoded = record.encode(); let crc = crc32(&encoded); // Rotate segment if needed if self.active_segment.metadata()?.len() + encoded.len() > self.segment_size { self.rotate_segment()?; } self.active_segment.write_all(&crc.to_le_bytes())?; self.active_segment.write_all(&(encoded.len() as u32).to_le_bytes())?; self.active_segment.write_all(&encoded)?; self.active_segment.sync_all()?; // fsync for durability Ok(()) } fn replay(&self) -> Result> { // Read all segments and decode records // Used on startup for crash recovery } } ``` #### 3.2.3 In-Memory Head Block **Purpose**: Accept recent writes, maintain hot data for fast queries **Structure**: ```rust struct Head { series: RwLock>>, min_time: AtomicI64, max_time: AtomicI64, chunk_size: Duration, // 2h default wal: Arc, } struct Series { id: SeriesID, labels: BTreeMap, chunks: RwLock>, } struct Chunk { min_time: i64, max_time: i64, samples: CompressedSamples, // Gorilla encoding } ``` **Chunk Lifecycle**: 1. **Creation**: New chunk created when first sample arrives or previous chunk is full 2. **Active**: Chunk accepts samples in time window [min_time, min_time + 2h) 3. **Full**: Chunk reaches 2h window, new chunk created for subsequent samples 4. **Compaction**: Full chunks compacted to disk blocks **Memory Limits**: - Max series: 10M (configurable) - Max chunks per series: 2 (active + previous, covering 4h) - Eviction: LRU eviction of inactive series (no samples in 4h) #### 3.2.4 Disk Blocks (Immutable) **Purpose**: Long-term storage of compacted time-series data **Block Structure** (inspired by Prometheus TSDB): ``` data/ 01HQZQZQZQZQZQZQZQZQZQ/ ← Block directory (ULID) meta.json ← Metadata index ← Inverted index chunks/ 000001 ← Chunk file 000002 ... tombstones ← Deleted series/samples ``` **meta.json Format**: ```json { "ulid": "01HQZQZQZQZQZQZQZQZQZQ", "minTime": 1733832000000, "maxTime": 1733839200000, "stats": { "numSamples": 1500000, "numSeries": 5000, "numChunks": 10000 }, "compaction": { "level": 1, "sources": ["01HQZQZ..."] }, "version": 1 } ``` **Index File Format** (simplified): The index file provides fast lookups of series by labels. ``` ┌────────────────────────────────────────────────┐ │ Magic Number (4 bytes): 0xBADA55A0 │ ├────────────────────────────────────────────────┤ │ Version (1 byte): 1 │ ├────────────────────────────────────────────────┤ │ Symbol Table Section │ │ - Sorted strings (label names/values) │ │ - Offset table for binary search │ ├────────────────────────────────────────────────┤ │ Series Section │ │ - SeriesID → Chunk Refs mapping │ │ - (series_id, labels, chunk_offsets) │ ├────────────────────────────────────────────────┤ │ Label Index Section (Inverted Index) │ │ - label_name → [series_ids] │ │ - (label_name, label_value) → [series_ids] │ ├────────────────────────────────────────────────┤ │ Postings Section │ │ - Sorted posting lists for label matchers │ │ - Compressed with varint + bit packing │ ├────────────────────────────────────────────────┤ │ TOC (Table of Contents) │ │ - Offsets to each section │ └────────────────────────────────────────────────┘ ``` **Chunks File Format**: ``` Chunk File (chunks/000001): ┌────────────────────────────────────────────────┐ │ Chunk 1: │ │ ├─ Length (4 bytes) │ │ ├─ Encoding (1 byte): Gorilla = 0x01 │ │ ├─ MinTime (8 bytes) │ │ ├─ MaxTime (8 bytes) │ │ ├─ NumSamples (4 bytes) │ │ └─ Compressed Data (variable) │ ├────────────────────────────────────────────────┤ │ Chunk 2: ... │ └────────────────────────────────────────────────┘ ``` ### 3.3 Compression Strategy #### 3.3.1 Gorilla Compression Algorithm Nightlight uses **Gorilla compression** from Facebook's paper (VLDB 2015), achieving ~12x compression. **Timestamp Compression (Delta-of-Delta)**: ``` Example timestamps (ms): t0 = 1733832000000 t1 = 1733832015000 (Δ1 = 15000) t2 = 1733832030000 (Δ2 = 15000) t3 = 1733832045000 (Δ3 = 15000) Delta-of-delta: D1 = Δ1 - Δ0 = 15000 - 0 = 15000 → encode in 14 bits D2 = Δ2 - Δ1 = 15000 - 15000 = 0 → encode in 1 bit (0) D3 = Δ3 - Δ2 = 15000 - 15000 = 0 → encode in 1 bit (0) Encoding: - If D = 0: write 1 bit "0" - If D in [-63, 64): write "10" + 7 bits - If D in [-255, 256): write "110" + 9 bits - If D in [-2047, 2048): write "1110" + 12 bits - Otherwise: write "1111" + 32 bits 96% of timestamps compress to 1 bit! ``` **Value Compression (XOR Encoding)**: ``` Example values (float64): v0 = 1543.0 v1 = 1543.5 v2 = 1543.7 XOR compression: XOR(v0, v1) = 0x3FF0000000000000 XOR 0x3FF0800000000000 = 0x0000800000000000 → Leading zeros: 16, Trailing zeros: 47 → Encode: control bit "1" + 5-bit leading + 6-bit length + 1 bit XOR(v1, v2) = 0x3FF0800000000000 XOR 0x3FF0CCCCCCCCCCD → Similar pattern, encode with control bits Encoding: - If v_i == v_(i-1): write 1 bit "0" - If XOR has same leading/trailing zeros as previous: write "10" + significant bits - Otherwise: write "11" + 5-bit leading + 6-bit length + significant bits 51% of values compress to 1 bit! ``` **Rust Implementation Sketch**: ```rust struct GorillaEncoder { bit_writer: BitWriter, prev_timestamp: i64, prev_delta: i64, prev_value: f64, prev_leading_zeros: u8, prev_trailing_zeros: u8, } impl GorillaEncoder { fn encode_timestamp(&mut self, timestamp: i64) -> Result<()> { let delta = timestamp - self.prev_timestamp; let delta_of_delta = delta - self.prev_delta; if delta_of_delta == 0 { self.bit_writer.write_bit(0)?; } else if delta_of_delta >= -63 && delta_of_delta < 64 { self.bit_writer.write_bits(0b10, 2)?; self.bit_writer.write_bits(delta_of_delta as u64, 7)?; } else if delta_of_delta >= -255 && delta_of_delta < 256 { self.bit_writer.write_bits(0b110, 3)?; self.bit_writer.write_bits(delta_of_delta as u64, 9)?; } else if delta_of_delta >= -2047 && delta_of_delta < 2048 { self.bit_writer.write_bits(0b1110, 4)?; self.bit_writer.write_bits(delta_of_delta as u64, 12)?; } else { self.bit_writer.write_bits(0b1111, 4)?; self.bit_writer.write_bits(delta_of_delta as u64, 32)?; } self.prev_timestamp = timestamp; self.prev_delta = delta; Ok(()) } fn encode_value(&mut self, value: f64) -> Result<()> { let bits = value.to_bits(); let xor = bits ^ self.prev_value.to_bits(); if xor == 0 { self.bit_writer.write_bit(0)?; } else { let leading = xor.leading_zeros() as u8; let trailing = xor.trailing_zeros() as u8; let significant_bits = 64 - leading - trailing; if leading >= self.prev_leading_zeros && trailing >= self.prev_trailing_zeros { self.bit_writer.write_bits(0b10, 2)?; let mask = (1u64 << significant_bits) - 1; let significant = (xor >> trailing) & mask; self.bit_writer.write_bits(significant, significant_bits as usize)?; } else { self.bit_writer.write_bits(0b11, 2)?; self.bit_writer.write_bits(leading as u64, 5)?; self.bit_writer.write_bits(significant_bits as u64, 6)?; let mask = (1u64 << significant_bits) - 1; let significant = (xor >> trailing) & mask; self.bit_writer.write_bits(significant, significant_bits as usize)?; self.prev_leading_zeros = leading; self.prev_trailing_zeros = trailing; } } self.prev_value = value; Ok(()) } } ``` #### 3.3.2 Compression Performance Targets Based on research and production systems: | Metric | Target | Reference | |--------|--------|-----------| | Average bytes/sample | 1.5-2.0 | Prometheus (1-2), Gorilla (1.37), M3DB (1.45) | | Compression ratio | 10-12x | Gorilla (12x), InfluxDB TSM (45x for specific workloads) | | Encode throughput | >500K samples/sec | Gorilla paper: 700K/sec | | Decode throughput | >1M samples/sec | Gorilla paper: 1.2M/sec | ### 3.4 Retention and Compaction Policies #### 3.4.1 Retention Policy **Default Retention**: 15 days **Configurable Parameters**: ```toml [storage] retention_days = 15 # Keep data for 15 days min_block_duration = "2h" # Minimum block size max_block_duration = "24h" # Maximum block size after compaction ``` **Retention Enforcement**: - Background goroutine runs every 1h - Deletes blocks where `max_time < now() - retention_duration` - Deletes old WAL segments #### 3.4.2 Compaction Strategy **Purpose**: 1. Merge small blocks into larger blocks (reduce file count) 2. Remove deleted samples (tombstones) 3. Improve query performance (fewer blocks to scan) **Compaction Levels** (inspired by LevelDB): ``` Level 0: 2h blocks (compacted from Head) Level 1: 12h blocks (merge 6 L0 blocks) Level 2: 24h blocks (merge 2 L1 blocks) ``` **Compaction Trigger**: - **Time-based**: Every 2h, compact Head → Level 0 block - **Count-based**: When L0 has >4 blocks, compact → L1 - **Manual**: Admin API endpoint `/api/v1/admin/compact` **Compaction Algorithm**: ``` 1. Select blocks to compact (same level, adjacent time ranges) 2. Create new block directory (ULID) 3. Iterate all series in selected blocks: a. Merge chunks from all blocks b. Apply tombstones (skip deleted samples) c. Re-compress merged chunks d. Write to new block chunks file 4. Build new index (merge posting lists) 5. Write meta.json 6. Atomically rename block directory 7. Delete source blocks ``` **Rust Implementation Sketch**: ```rust struct Compactor { data_dir: PathBuf, retention: Duration, } impl Compactor { async fn compact_head_to_l0(&self, head: &Head) -> Result { let block_id = ULID::new(); let block_dir = self.data_dir.join(block_id.to_string()); std::fs::create_dir_all(&block_dir)?; let mut index_writer = IndexWriter::new(&block_dir.join("index"))?; let mut chunk_writer = ChunkWriter::new(&block_dir.join("chunks/000001"))?; let series_map = head.series.read().await; for (series_id, series) in series_map.iter() { let chunks = series.chunks.read().await; for chunk in chunks.iter() { if chunk.is_full() { let chunk_ref = chunk_writer.write_chunk(&chunk.samples)?; index_writer.add_series(*series_id, &series.labels, chunk_ref)?; } } } index_writer.finalize()?; chunk_writer.finalize()?; let meta = BlockMeta { ulid: block_id, min_time: head.min_time.load(Ordering::Relaxed), max_time: head.max_time.load(Ordering::Relaxed), stats: compute_stats(&block_dir)?, compaction: CompactionMeta { level: 0, sources: vec![] }, version: 1, }; write_meta(&block_dir.join("meta.json"), &meta)?; Ok(block_id) } async fn compact_blocks(&self, source_blocks: Vec) -> Result { // Merge multiple blocks into one // Similar to compact_head_to_l0, but reads from existing blocks } async fn enforce_retention(&self) -> Result<()> { let cutoff = SystemTime::now() - self.retention; let cutoff_ms = cutoff.duration_since(UNIX_EPOCH)?.as_millis() as i64; for entry in std::fs::read_dir(&self.data_dir)? { let path = entry?.path(); if !path.is_dir() { continue; } let meta_path = path.join("meta.json"); if !meta_path.exists() { continue; } let meta: BlockMeta = serde_json::from_reader(File::open(meta_path)?)?; if meta.max_time < cutoff_ms { std::fs::remove_dir_all(&path)?; info!("Deleted expired block: {}", meta.ulid); } } Ok(()) } } ``` --- ## 4. Push Ingestion API ### 4.1 Prometheus Remote Write Protocol #### 4.1.1 Protocol Overview **Specification**: Prometheus Remote Write v1.0 **Transport**: HTTP/1.1 or HTTP/2 **Encoding**: Protocol Buffers (protobuf v3) **Compression**: Snappy (required) **Reference**: [Prometheus Remote Write Spec](https://prometheus.io/docs/specs/prw/remote_write_spec/) #### 4.1.2 HTTP Endpoint ``` POST /api/v1/write Content-Type: application/x-protobuf Content-Encoding: snappy X-Prometheus-Remote-Write-Version: 0.1.0 ``` **Request Flow**: ``` ┌──────────────┐ │ Client │ │ (Prometheus, │ │ FlareDB, │ │ etc.) │ └──────┬───────┘ │ │ 1. Collect samples │ ▼ ┌──────────────────────────────────┐ │ Encode to WriteRequest protobuf │ │ message │ └──────┬───────────────────────────┘ │ │ 2. Compress with Snappy │ ▼ ┌──────────────────────────────────┐ │ HTTP POST to /api/v1/write │ │ with mTLS authentication │ └──────┬───────────────────────────┘ │ │ 3. Send request │ ▼ ┌──────────────────────────────────┐ │ Nightlight Server │ │ ├─ Validate mTLS cert │ │ ├─ Decompress Snappy │ │ ├─ Decode protobuf │ │ ├─ Validate samples │ │ ├─ Append to WAL │ │ └─ Insert into Head │ └──────┬───────────────────────────┘ │ │ 4. Response │ ▼ ┌──────────────────────────────────┐ │ HTTP Response: │ │ 200 OK (success) │ │ 400 Bad Request (invalid) │ │ 429 Too Many Requests (backpressure) │ │ 503 Service Unavailable (overload) │ └──────────────────────────────────┘ ``` #### 4.1.3 Protobuf Schema **File**: `proto/remote_write.proto` ```protobuf syntax = "proto3"; package nightlight.remote; // Prometheus remote_write compatible schema message WriteRequest { repeated TimeSeries timeseries = 1; // Metadata is optional and not used in v1 repeated MetricMetadata metadata = 2; } message TimeSeries { repeated Label labels = 1; repeated Sample samples = 2; // Exemplars are optional (not supported in v1) repeated Exemplar exemplars = 3; } message Label { string name = 1; string value = 2; } message Sample { double value = 1; int64 timestamp = 2; // Unix timestamp in milliseconds } message Exemplar { repeated Label labels = 1; double value = 2; int64 timestamp = 3; } message MetricMetadata { enum MetricType { UNKNOWN = 0; COUNTER = 1; GAUGE = 2; HISTOGRAM = 3; GAUGEHISTOGRAM = 4; SUMMARY = 5; INFO = 6; STATESET = 7; } MetricType type = 1; string metric_family_name = 2; string help = 3; string unit = 4; } ``` **Generated Rust Code** (using `prost`): ```toml # Cargo.toml [dependencies] prost = "0.12" prost-types = "0.12" [build-dependencies] prost-build = "0.12" ``` ```rust // build.rs fn main() { prost_build::compile_protos(&["proto/remote_write.proto"], &["proto/"]).unwrap(); } ``` #### 4.1.4 Ingestion Handler **Rust Implementation**: ```rust use axum::{ Router, routing::post, extract::State, http::StatusCode, body::Bytes, }; use prost::Message; use snap::raw::Decoder as SnappyDecoder; mod remote_write_pb { include!(concat!(env!("OUT_DIR"), "/nightlight.remote.rs")); } struct IngestionService { head: Arc, wal: Arc, rate_limiter: Arc, } async fn handle_remote_write( State(service): State>, body: Bytes, ) -> Result { // 1. Decompress Snappy let mut decoder = SnappyDecoder::new(); let decompressed = decoder .decompress_vec(&body) .map_err(|e| (StatusCode::BAD_REQUEST, format!("Snappy decompression failed: {}", e)))?; // 2. Decode protobuf let write_req = remote_write_pb::WriteRequest::decode(&decompressed[..]) .map_err(|e| (StatusCode::BAD_REQUEST, format!("Protobuf decode failed: {}", e)))?; // 3. Validate and ingest let mut samples_ingested = 0; let mut samples_rejected = 0; for ts in write_req.timeseries.iter() { // Validate labels let labels = validate_labels(&ts.labels) .map_err(|e| (StatusCode::BAD_REQUEST, e))?; let series_id = compute_series_id(&labels); for sample in ts.samples.iter() { // Validate timestamp (not too old, not too far in future) if !is_valid_timestamp(sample.timestamp) { samples_rejected += 1; continue; } // Check rate limit if !service.rate_limiter.allow() { return Err((StatusCode::TOO_MANY_REQUESTS, "Rate limit exceeded".into())); } // Append to WAL let wal_record = WALRecord::Sample { series_id, timestamp: sample.timestamp, value: sample.value, }; service.wal.append(&wal_record) .map_err(|e| (StatusCode::INTERNAL_SERVER_ERROR, format!("WAL append failed: {}", e)))?; // Insert into Head service.head.append(series_id, labels.clone(), sample.timestamp, sample.value) .await .map_err(|e| { if e.to_string().contains("out of order") { samples_rejected += 1; Ok::<_, (StatusCode, String)>(()) } else if e.to_string().contains("buffer full") { Err((StatusCode::SERVICE_UNAVAILABLE, "Write buffer full".into())) } else { Err((StatusCode::INTERNAL_SERVER_ERROR, format!("Insert failed: {}", e))) } })?; samples_ingested += 1; } } info!("Ingested {} samples, rejected {}", samples_ingested, samples_rejected); Ok(StatusCode::NO_CONTENT) // 204 No Content on success } fn validate_labels(labels: &[remote_write_pb::Label]) -> Result, String> { let mut label_map = BTreeMap::new(); for label in labels { // Validate label name if !is_valid_label_name(&label.name) { return Err(format!("Invalid label name: {}", label.name)); } // Validate label value (any UTF-8) if label.value.is_empty() { return Err(format!("Empty label value for label: {}", label.name)); } label_map.insert(label.name.clone(), label.value.clone()); } // Must have __name__ label if !label_map.contains_key("__name__") { return Err("Missing __name__ label".into()); } Ok(label_map) } fn is_valid_label_name(name: &str) -> bool { // Must match [a-zA-Z_][a-zA-Z0-9_]* if name.is_empty() { return false; } let mut chars = name.chars(); let first = chars.next().unwrap(); if !first.is_ascii_alphabetic() && first != '_' { return false; } chars.all(|c| c.is_ascii_alphanumeric() || c == '_') } fn is_valid_timestamp(ts: i64) -> bool { let now = SystemTime::now().duration_since(UNIX_EPOCH).unwrap().as_millis() as i64; let min_valid = now - 24 * 3600 * 1000; // Not older than 24h let max_valid = now + 5 * 60 * 1000; // Not more than 5min in future ts >= min_valid && ts <= max_valid } ``` ### 4.2 gRPC API (Alternative/Additional) In addition to HTTP, Nightlight MAY support a gRPC API for ingestion (more efficient for internal services). **Proto Definition**: ```protobuf syntax = "proto3"; package nightlight.ingest; service IngestionService { rpc Write(WriteRequest) returns (WriteResponse); rpc WriteBatch(stream WriteRequest) returns (WriteResponse); } message WriteRequest { repeated TimeSeries timeseries = 1; } message WriteResponse { uint64 samples_ingested = 1; uint64 samples_rejected = 2; string error = 3; } // (Reuse TimeSeries, Label, Sample from remote_write.proto) ``` ### 4.3 Label Validation and Normalization #### 4.3.1 Metric Name Validation Metric names (stored in `__name__` label) must match: ``` [a-zA-Z_:][a-zA-Z0-9_:]* ``` Examples: - ✅ `http_requests_total` - ✅ `node_cpu_seconds:rate5m` - ❌ `123_invalid` (starts with digit) - ❌ `invalid-metric` (contains hyphen) #### 4.3.2 Label Name Validation Label names must match: ``` [a-zA-Z_][a-zA-Z0-9_]* ``` Reserved prefixes: - `__` (double underscore): Internal labels (e.g., `__name__`, `__rollup__`) #### 4.3.3 Label Normalization Before inserting, labels are normalized: 1. Sort labels lexicographically by key 2. Ensure `__name__` label is present 3. Remove duplicate labels (keep last value) 4. Limit label count (default: 30 labels max per series) 5. Limit label value length (default: 1024 chars max) ### 4.4 Write Path Architecture ``` ┌──────────────────────────────────────────────────────────────┐ │ Ingestion Layer │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ HTTP/gRPC │ │ mTLS Auth │ │ Rate Limiter│ │ │ │ Handler │─▶│ Validator │─▶│ │ │ │ └─────────────┘ └─────────────┘ └──────┬──────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Decompressor │ │ │ │ (Snappy) │ │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Protobuf │ │ │ │ Decoder │ │ │ └────────┬────────┘ │ │ │ │ └───────────────────────────────────────────┼──────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────┐ │ Validation Layer │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Label │ │ Timestamp │ │ Cardinality │ │ │ │ Validator │ │ Validator │ │ Limiter │ │ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ │ └─────────────────┴─────────────────┘ │ │ │ │ └───────────────────────────┼──────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────┐ │ Write Buffer │ │ │ │ ┌────────────────────────────────────────────────────┐ │ │ │ MPSC Channel (bounded) │ │ │ │ Capacity: 100K samples │ │ │ │ Backpressure: Block/Reject when full │ │ │ └────────────────────────────────────────────────────┘ │ │ │ │ └───────────────────────────┼──────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────┐ │ Storage Layer │ │ │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ WAL │◀────────│ WAL Writer │ │ │ │ (Disk) │ │ (Thread) │ │ │ └─────────────┘ └─────────────┘ │ │ │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ Head │◀────────│ Head Writer│ │ │ │ (In-Memory) │ │ (Thread) │ │ │ └─────────────┘ └─────────────┘ │ └──────────────────────────────────────────────────────────────┘ ``` **Concurrency Model**: 1. **HTTP/gRPC handlers**: Multi-threaded (tokio async) 2. **Write buffer**: MPSC channel (bounded capacity) 3. **WAL writer**: Single-threaded (sequential writes for consistency) 4. **Head writer**: Single-threaded (lock-free inserts via sharding) **Backpressure Handling**: ```rust enum BackpressureStrategy { Block, // Block until buffer has space (default) Reject, // Return 503 immediately } impl IngestionService { async fn handle_backpressure(&self, samples: Vec) -> Result<()> { match self.config.backpressure_strategy { BackpressureStrategy::Block => { // Try to send with timeout tokio::time::timeout( Duration::from_secs(5), self.write_buffer.send(samples) ).await .map_err(|_| Error::Timeout)? } BackpressureStrategy::Reject => { // Try non-blocking send self.write_buffer.try_send(samples) .map_err(|_| Error::BufferFull)? } } } } ``` ### 4.5 Out-of-Order Sample Handling **Problem**: Samples may arrive out of timestamp order due to network delays, batching, etc. **Solution**: Accept out-of-order samples within a configurable time window. **Configuration**: ```toml [storage] out_of_order_time_window = "1h" # Accept samples up to 1h old ``` **Implementation**: ```rust impl Head { async fn append( &self, series_id: SeriesID, labels: BTreeMap, timestamp: i64, value: f64, ) -> Result<()> { let now = SystemTime::now().duration_since(UNIX_EPOCH)?.as_millis() as i64; let min_valid_ts = now - self.config.out_of_order_time_window.as_millis() as i64; if timestamp < min_valid_ts { return Err(Error::OutOfOrder(format!( "Sample too old: ts={}, min={}", timestamp, min_valid_ts ))); } // Get or create series let mut series_map = self.series.write().await; let series = series_map.entry(series_id).or_insert_with(|| { Arc::new(Series { id: series_id, labels: labels.clone(), chunks: RwLock::new(vec![]), }) }); // Append to appropriate chunk let mut chunks = series.chunks.write().await; // Find chunk that covers this timestamp let chunk = chunks.iter_mut() .find(|c| timestamp >= c.min_time && timestamp < c.max_time) .or_else(|| { // Create new chunk if needed let chunk_start = (timestamp / self.chunk_size.as_millis() as i64) * self.chunk_size.as_millis() as i64; let chunk_end = chunk_start + self.chunk_size.as_millis() as i64; let new_chunk = Chunk { min_time: chunk_start, max_time: chunk_end, samples: CompressedSamples::new(), }; chunks.push(new_chunk); chunks.last_mut() }) .unwrap(); chunk.samples.append(timestamp, value)?; Ok(()) } } ``` --- ## 5. PromQL Query Engine ### 5.1 PromQL Overview **PromQL** (Prometheus Query Language) is a functional query language for selecting and aggregating time-series data. **Query Types**: 1. **Instant query**: Evaluate expression at a single point in time 2. **Range query**: Evaluate expression over a time range ### 5.2 Supported PromQL Subset Nightlight v1 supports a **pragmatic subset** of PromQL covering 80% of common dashboard queries. #### 5.2.1 Instant Vector Selectors ```promql # Select by metric name http_requests_total # Select with label matchers http_requests_total{method="GET"} http_requests_total{method="GET", status="200"} # Label matcher operators metric{label="value"} # Exact match metric{label!="value"} # Not equal metric{label=~"regex"} # Regex match metric{label!~"regex"} # Regex not match # Example http_requests_total{method=~"GET|POST", status!="500"} ``` #### 5.2.2 Range Vector Selectors ```promql # Select last 5 minutes of data http_requests_total[5m] # With label matchers http_requests_total{method="GET"}[1h] # Time durations: s (seconds), m (minutes), h (hours), d (days), w (weeks), y (years) ``` #### 5.2.3 Aggregation Operators ```promql # sum: Sum over dimensions sum(http_requests_total) sum(http_requests_total) by (method) sum(http_requests_total) without (instance) # Supported aggregations: sum # Sum avg # Average min # Minimum max # Maximum count # Count stddev # Standard deviation stdvar # Standard variance topk(N, ) # Top N series by value bottomk(N,) # Bottom N series by value ``` #### 5.2.4 Functions **Rate Functions**: ```promql # rate: Per-second average rate of increase rate(http_requests_total[5m]) # irate: Instant rate (last two samples) irate(http_requests_total[5m]) # increase: Total increase over time range increase(http_requests_total[1h]) ``` **Quantile Functions**: ```promql # histogram_quantile: Calculate quantile from histogram histogram_quantile(0.95, rate(http_request_duration_bucket[5m])) ``` **Time Functions**: ```promql # time(): Current Unix timestamp time() # timestamp(): Timestamp of sample timestamp(metric) ``` **Math Functions**: ```promql # abs, ceil, floor, round, sqrt, exp, ln, log2, log10 abs(metric) round(metric, 0.1) ``` #### 5.2.5 Binary Operators **Arithmetic**: ```promql metric1 + metric2 metric1 - metric2 metric1 * metric2 metric1 / metric2 metric1 % metric2 metric1 ^ metric2 ``` **Comparison**: ```promql metric1 == metric2 # Equal metric1 != metric2 # Not equal metric1 > metric2 # Greater than metric1 < metric2 # Less than metric1 >= metric2 # Greater or equal metric1 <= metric2 # Less or equal ``` **Logical**: ```promql metric1 and metric2 # Intersection metric1 or metric2 # Union metric1 unless metric2 # Complement ``` **Vector Matching**: ```promql # One-to-one matching metric1 + metric2 # Many-to-one matching metric1 + on(label) group_left metric2 # One-to-many matching metric1 + on(label) group_right metric2 ``` #### 5.2.6 Subqueries (NOT SUPPORTED in v1) Subqueries are complex and not supported in v1: ```promql # NOT SUPPORTED max_over_time(rate(http_requests_total[5m])[1h:]) ``` ### 5.3 Query Execution Model #### 5.3.1 Query Parsing Use **promql-parser** crate (GreptimeTeam) for parsing: ```rust use promql_parser::{parser, label}; fn parse_query(query: &str) -> Result { parser::parse(query) } // Example let expr = parse_query("http_requests_total{method=\"GET\"}[5m]")?; match expr { parser::Expr::VectorSelector(vs) => { println!("Metric: {}", vs.name); for matcher in vs.matchers.matchers { println!("Label: {} {} {}", matcher.name, matcher.op, matcher.value); } println!("Range: {:?}", vs.range); } _ => {} } ``` **AST Types**: ```rust pub enum Expr { Aggregate(AggregateExpr), // sum, avg, etc. Unary(UnaryExpr), // -metric Binary(BinaryExpr), // metric1 + metric2 Paren(ParenExpr), // (expr) Subquery(SubqueryExpr), // NOT SUPPORTED NumberLiteral(NumberLiteral), // 1.5 StringLiteral(StringLiteral), // "value" VectorSelector(VectorSelector), // metric{labels} MatrixSelector(MatrixSelector), // metric[5m] Call(Call), // rate(...) } ``` #### 5.3.2 Query Planner Convert AST to execution plan: ```rust enum QueryPlan { VectorSelector { matchers: Vec, timestamp: i64, }, MatrixSelector { matchers: Vec, range: Duration, timestamp: i64, }, Aggregate { op: AggregateOp, input: Box, grouping: Vec, }, RateFunc { input: Box, }, BinaryOp { op: BinaryOp, lhs: Box, rhs: Box, matching: VectorMatching, }, } struct QueryPlanner; impl QueryPlanner { fn plan(expr: parser::Expr, query_time: i64) -> Result { match expr { parser::Expr::VectorSelector(vs) => { Ok(QueryPlan::VectorSelector { matchers: vs.matchers.matchers.into_iter() .map(|m| LabelMatcher::from_ast(m)) .collect(), timestamp: query_time, }) } parser::Expr::MatrixSelector(ms) => { Ok(QueryPlan::MatrixSelector { matchers: ms.vector_selector.matchers.matchers.into_iter() .map(|m| LabelMatcher::from_ast(m)) .collect(), range: Duration::from_millis(ms.range as u64), timestamp: query_time, }) } parser::Expr::Call(call) => { match call.func.name.as_str() { "rate" => { let arg_plan = Self::plan(*call.args[0].clone(), query_time)?; Ok(QueryPlan::RateFunc { input: Box::new(arg_plan) }) } // ... other functions _ => Err(Error::UnsupportedFunction(call.func.name)), } } parser::Expr::Aggregate(agg) => { let input_plan = Self::plan(*agg.expr, query_time)?; Ok(QueryPlan::Aggregate { op: AggregateOp::from_str(&agg.op.to_string())?, input: Box::new(input_plan), grouping: agg.grouping.unwrap_or_default(), }) } parser::Expr::Binary(bin) => { let lhs_plan = Self::plan(*bin.lhs, query_time)?; let rhs_plan = Self::plan(*bin.rhs, query_time)?; Ok(QueryPlan::BinaryOp { op: BinaryOp::from_str(&bin.op.to_string())?, lhs: Box::new(lhs_plan), rhs: Box::new(rhs_plan), matching: bin.modifier.map(|m| VectorMatching::from_ast(m)).unwrap_or_default(), }) } _ => Err(Error::UnsupportedExpr), } } } ``` #### 5.3.3 Query Executor Execute the plan: ```rust struct QueryExecutor { head: Arc, blocks: Arc, } impl QueryExecutor { async fn execute(&self, plan: QueryPlan) -> Result { match plan { QueryPlan::VectorSelector { matchers, timestamp } => { self.execute_vector_selector(matchers, timestamp).await } QueryPlan::MatrixSelector { matchers, range, timestamp } => { self.execute_matrix_selector(matchers, range, timestamp).await } QueryPlan::RateFunc { input } => { let matrix = self.execute(*input).await?; self.apply_rate(matrix) } QueryPlan::Aggregate { op, input, grouping } => { let vector = self.execute(*input).await?; self.apply_aggregate(op, vector, grouping) } QueryPlan::BinaryOp { op, lhs, rhs, matching } => { let lhs_result = self.execute(*lhs).await?; let rhs_result = self.execute(*rhs).await?; self.apply_binary_op(op, lhs_result, rhs_result, matching) } } } async fn execute_vector_selector( &self, matchers: Vec, timestamp: i64, ) -> Result { // 1. Find matching series from index let series_ids = self.find_series(&matchers).await?; // 2. For each series, get sample at timestamp let mut samples = Vec::new(); for series_id in series_ids { if let Some(sample) = self.get_sample_at(series_id, timestamp).await? { samples.push(sample); } } Ok(InstantVector { samples }) } async fn execute_matrix_selector( &self, matchers: Vec, range: Duration, timestamp: i64, ) -> Result { let series_ids = self.find_series(&matchers).await?; let start = timestamp - range.as_millis() as i64; let end = timestamp; let mut ranges = Vec::new(); for series_id in series_ids { let samples = self.get_samples_range(series_id, start, end).await?; ranges.push(RangeVectorSeries { labels: self.get_labels(series_id).await?, samples, }); } Ok(RangeVector { ranges }) } fn apply_rate(&self, matrix: RangeVector) -> Result { let mut samples = Vec::new(); for range in matrix.ranges { if range.samples.len() < 2 { continue; // Need at least 2 samples for rate } let first = &range.samples[0]; let last = &range.samples[range.samples.len() - 1]; let delta_value = last.value - first.value; let delta_time = (last.timestamp - first.timestamp) as f64 / 1000.0; // Convert to seconds let rate = delta_value / delta_time; samples.push(Sample { labels: range.labels, timestamp: last.timestamp, value: rate, }); } Ok(InstantVector { samples }) } fn apply_aggregate( &self, op: AggregateOp, vector: InstantVector, grouping: Vec, ) -> Result { // Group samples by grouping labels let mut groups: HashMap, Vec> = HashMap::new(); for sample in vector.samples { let group_key = if grouping.is_empty() { vec![] } else { grouping.iter() .filter_map(|label| sample.labels.get(label).map(|v| (label.clone(), v.clone()))) .collect() }; groups.entry(group_key).or_insert_with(Vec::new).push(sample); } // Apply aggregation to each group let mut result_samples = Vec::new(); for (group_labels, samples) in groups { let aggregated_value = match op { AggregateOp::Sum => samples.iter().map(|s| s.value).sum(), AggregateOp::Avg => samples.iter().map(|s| s.value).sum::() / samples.len() as f64, AggregateOp::Min => samples.iter().map(|s| s.value).fold(f64::INFINITY, f64::min), AggregateOp::Max => samples.iter().map(|s| s.value).fold(f64::NEG_INFINITY, f64::max), AggregateOp::Count => samples.len() as f64, // ... other aggregations }; result_samples.push(Sample { labels: group_labels.into_iter().collect(), timestamp: samples[0].timestamp, value: aggregated_value, }); } Ok(InstantVector { samples: result_samples }) } } ``` ### 5.4 Read Path Architecture ``` ┌──────────────────────────────────────────────────────────────┐ │ Query Layer │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ HTTP API │ │ PromQL │ │ Query │ │ │ │ /api/v1/ │─▶│ Parser │─▶│ Planner │ │ │ │ query │ │ │ │ │ │ │ └─────────────┘ └─────────────┘ └──────┬──────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Query │ │ │ │ Executor │ │ │ └────────┬────────┘ │ └───────────────────────────────────────────┼──────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────┐ │ Index Layer │ │ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ Label Index │ │ Posting │ │ │ │ (In-Memory) │ │ Lists │ │ │ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ └─────────────────┘ │ │ │ │ │ │ Series IDs │ │ ▼ │ └──────────────────────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────┐ │ Storage Layer │ │ │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ Head │ │ Blocks │ │ │ │ (In-Memory) │ │ (Disk) │ │ │ └─────┬───────┘ └─────┬───────┘ │ │ │ │ │ │ │ Recent data (<2h) │ Historical data │ │ │ │ │ │ ▼ ▼ │ │ ┌─────────────────────────────────────┐ │ │ │ Chunk Reader │ │ │ │ - Decompress Gorilla chunks │ │ │ │ - Filter by time range │ │ │ │ - Return samples │ │ │ └─────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────┘ ``` ### 5.5 HTTP Query API #### 5.5.1 Instant Query ``` GET /api/v1/query?query=&time=&timeout= ``` **Parameters**: - `query`: PromQL expression (required) - `time`: Unix timestamp (optional, default: now) - `timeout`: Query timeout (optional, default: 30s) **Response** (JSON): ```json { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "http_requests_total", "method": "GET", "status": "200" }, "value": [1733832000, "1543"] } ] } } ``` #### 5.5.2 Range Query ``` GET /api/v1/query_range?query=&start=&end=&step= ``` **Parameters**: - `query`: PromQL expression (required) - `start`: Start timestamp (required) - `end`: End timestamp (required) - `step`: Query resolution step (required, e.g., "15s") **Response** (JSON): ```json { "status": "success", "data": { "resultType": "matrix", "result": [ { "metric": { "__name__": "http_requests_total", "method": "GET" }, "values": [ [1733832000, "1543"], [1733832015, "1556"], [1733832030, "1570"] ] } ] } } ``` #### 5.5.3 Label Values Query ``` GET /api/v1/label//values?match[]= ``` **Example**: ``` GET /api/v1/label/method/values?match[]=http_requests_total ``` **Response**: ```json { "status": "success", "data": ["GET", "POST", "PUT", "DELETE"] } ``` #### 5.5.4 Series Metadata Query ``` GET /api/v1/series?match[]=&start=&end= ``` **Example**: ``` GET /api/v1/series?match[]=http_requests_total{method="GET"} ``` **Response**: ```json { "status": "success", "data": [ { "__name__": "http_requests_total", "method": "GET", "status": "200", "instance": "flaredb-1:9092" } ] } ``` ### 5.6 Performance Optimizations #### 5.6.1 Query Caching Cache query results for identical queries: ```rust struct QueryCache { cache: Arc>>, ttl: Duration, } impl QueryCache { fn get(&self, query_hash: &str) -> Option { let cache = self.cache.lock().unwrap(); if let Some((result, timestamp)) = cache.get(query_hash) { if timestamp.elapsed() < self.ttl { return Some(result.clone()); } } None } fn put(&self, query_hash: String, result: QueryResult) { let mut cache = self.cache.lock().unwrap(); cache.put(query_hash, (result, Instant::now())); } } ``` #### 5.6.2 Posting List Intersection Use efficient algorithms for label matcher intersection: ```rust fn intersect_posting_lists(lists: Vec<&[SeriesID]>) -> Vec { if lists.is_empty() { return vec![]; } // Sort lists by length (shortest first for early termination) let mut sorted_lists = lists; sorted_lists.sort_by_key(|list| list.len()); // Use shortest list as base, intersect with others let mut result: HashSet = sorted_lists[0].iter().copied().collect(); for list in &sorted_lists[1..] { let list_set: HashSet = list.iter().copied().collect(); result.retain(|id| list_set.contains(id)); if result.is_empty() { break; // Early termination } } result.into_iter().collect() } ``` #### 5.6.3 Chunk Pruning Skip chunks that don't overlap query time range: ```rust fn query_chunks( chunks: &[ChunkRef], start_time: i64, end_time: i64, ) -> Vec { chunks.iter() .filter(|chunk| { // Chunk overlaps query range if: // chunk.max_time > start AND chunk.min_time < end chunk.max_time > start_time && chunk.min_time < end_time }) .copied() .collect() } ``` --- ## 6. Storage Backend Architecture ### 6.1 Architecture Decision: Hybrid Approach After analyzing trade-offs, Nightlight adopts a **hybrid storage architecture**: 1. **Dedicated time-series engine** for sample storage (optimized for write throughput and compression) 2. **Optional FlareDB integration** for metadata and distributed coordination (future work) 3. **Optional S3-compatible backend** for cold data archival (future work) ### 6.2 Decision Rationale #### 6.2.1 Why NOT Pure FlareDB Backend? **FlareDB Characteristics**: - General-purpose KV store with Raft consensus - Optimized for: Strong consistency, small KV pairs, random access - Storage: RocksDB (LSM tree) **Time-Series Workload Characteristics**: - High write throughput (100K samples/sec) - Sequential writes (append-only) - Temporal locality (queries focus on recent data) - Bulk reads (range scans over time windows) **Mismatch Analysis**: | Aspect | FlareDB (KV) | Time-Series Engine | |--------|--------------|-------------------| | Write pattern | Random writes, compaction overhead | Append-only, minimal overhead | | Compression | Generic LZ4/Snappy | Domain-specific (Gorilla: 12x) | | Read pattern | Point lookups | Range scans over time | | Indexing | Key-based | Label-based inverted index | | Consistency | Strong (Raft) | Eventual OK for metrics | **Conclusion**: Using FlareDB for sample storage would sacrifice 5-10x write throughput and 10x compression efficiency. #### 6.2.2 Why NOT VictoriaMetrics Binary? VictoriaMetrics is written in Go and has excellent performance, but: - mTLS support is **paid only** (violates PROJECT.md requirement) - Not Rust (violates PROJECT.md "Rustで書く") - Cannot integrate with FlareDB for metadata (future requirement) - Less control over storage format and optimizations #### 6.2.3 Why Hybrid (Dedicated + Optional FlareDB)? **Phase 1 (T033 v1)**: Pure dedicated engine - Simple, single-instance deployment - Focus on core functionality (ingest + query) - Local disk storage only **Phase 2 (Future)**: Add FlareDB for metadata - Store series labels and metadata in FlareDB "metrics" namespace - Enables multi-instance coordination - Global view of series cardinality, label values - Samples still in dedicated engine (local disk) **Phase 3 (Future)**: Add S3 for cold storage - Automatically upload old blocks (>7 days) to S3 - Query federation across local + S3 blocks - Unlimited retention with cost-effective storage **Benefits**: - v1 simplicity: No FlareDB dependency, easy deployment - Future scalability: Metadata in FlareDB, samples distributed - Operational flexibility: Can run standalone or integrated ### 6.3 Storage Layout #### 6.3.1 Directory Structure ``` /var/lib/nightlight/ ├── data/ │ ├── wal/ │ │ ├── 00000001 # WAL segment │ │ ├── 00000002 │ │ └── checkpoint.00000002 # WAL checkpoint │ ├── 01HQZQZQZQZQZQZQZQZQZQ/ # Block (ULID) │ │ ├── meta.json │ │ ├── index │ │ ├── chunks/ │ │ │ ├── 000001 │ │ │ └── 000002 │ │ └── tombstones │ ├── 01HQZR.../ # Another block │ └── ... └── tmp/ # Temp files for compaction ``` #### 6.3.2 Metadata Storage (Future: FlareDB Integration) When FlareDB integration is enabled: **Series Metadata** (stored in FlareDB "metrics" namespace): ``` Key: series: Value: { "labels": {"__name__": "http_requests_total", "method": "GET", ...}, "first_seen": 1733832000000, "last_seen": 1733839200000 } Key: label_index:: Value: [series_id1, series_id2, ...] # Posting list ``` **Benefits**: - Fast label value lookups across all instances - Global series cardinality tracking - Distributed query planning (future) **Trade-off**: Adds dependency on FlareDB, increases complexity ### 6.4 Scalability Approach #### 6.4.1 Vertical Scaling (v1) Single instance scales to: - 10M active series - 100K samples/sec write throughput - 1K queries/sec **Scaling strategy**: - Increase memory (more series in Head) - Faster disk (NVMe for WAL/blocks) - More CPU cores (parallel compaction, query execution) #### 6.4.2 Horizontal Scaling (Future) **Sharding Strategy** (inspired by Prometheus federation + Thanos): ``` ┌────────────────────────────────────────────────────────────┐ │ Query Frontend │ │ (Query Federation) │ └─────┬────────────────────┬─────────────────────┬───────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Nightlight │ │ Nightlight │ │ Nightlight │ │ Instance 1 │ │ Instance 2 │ │ Instance N │ │ │ │ │ │ │ │ Hash shard: │ │ Hash shard: │ │ Hash shard: │ │ 0-333 │ │ 334-666 │ │ 667-999 │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └────────────────────┴─────────────────────┘ │ ▼ ┌───────────────┐ │ FlareDB │ │ (Metadata) │ └───────────────┘ ``` **Sharding Key**: Hash(series_id) % num_shards **Query Execution**: 1. Query frontend receives PromQL query 2. Determine which shards contain matching series (via FlareDB metadata) 3. Send subqueries to relevant shards 4. Merge results (aggregation, deduplication) 5. Return to client **Challenges** (deferred to future work): - Rebalancing when adding/removing shards - Handling series that span multiple shards (rare) - Ensuring query consistency across shards ### 6.5 S3 Integration Strategy (Future) **Objective**: Cost-effective long-term retention (>15 days) **Architecture**: ``` ┌───────────────────────────────────────────────────┐ │ Nightlight Server │ │ │ │ ┌──────────┐ ┌──────────┐ │ │ │ Head │ │ Blocks │ │ │ │ (0-2h) │ │ (2h-15d)│ │ │ └──────────┘ └────┬─────┘ │ │ │ │ │ │ Background uploader │ │ ▼ │ │ ┌─────────────┐ │ │ │ Upload to │ │ │ │ S3 (>7d) │ │ │ └──────┬──────┘ │ └──────────────────────────┼────────────────────────┘ │ ▼ ┌─────────────────┐ │ S3 Bucket │ │ /blocks/ │ │ 01HQZ.../ │ │ 01HRZ.../ │ └─────────────────┘ ``` **Workflow**: 1. Block compaction creates local block files 2. Blocks older than 7 days (configurable) are uploaded to S3 3. Local block files deleted after successful upload 4. Query executor checks both local and S3 for blocks in query range 5. Download S3 blocks on-demand (with local cache) **Configuration**: ```toml [storage.s3] enabled = true endpoint = "https://s3.example.com" bucket = "nightlight-blocks" access_key_id = "..." secret_access_key = "..." upload_after_days = 7 local_cache_size_gb = 100 ``` --- ## 7. Integration Points ### 7.1 Service Discovery (How Services Push Metrics) #### 7.1.1 Service Configuration Pattern Each platform service (FlareDB, ChainFire, etc.) exports Prometheus metrics on ports 9091-9099. **Example** (FlareDB metrics exporter): ```rust // flaredb-server/src/main.rs use metrics_exporter_prometheus::PrometheusBuilder; #[tokio::main] async fn main() -> Result<()> { // ... initialization ... let metrics_addr = format!("0.0.0.0:{}", args.metrics_port); let builder = PrometheusBuilder::new(); builder .with_http_listener(metrics_addr.parse::()?) .install() .expect("Failed to install Prometheus metrics exporter"); info!("Prometheus metrics available at http://{}/metrics", metrics_addr); // ... rest of main ... } ``` **Service Metrics Ports** (from T027.S2): | Service | Port | Endpoint | |---------|------|----------| | ChainFire | 9091 | http://chainfire:9091/metrics | | FlareDB | 9092 | http://flaredb:9092/metrics | | PlasmaVMC | 9093 | http://plasmavmc:9093/metrics | | IAM | 9094 | http://iam:9094/metrics | | LightningSTOR | 9095 | http://lightningstor:9095/metrics | | FlashDNS | 9096 | http://flashdns:9096/metrics | | FiberLB | 9097 | http://fiberlb:9097/metrics | | Prismnet | 9098 | http://prismnet:9098/metrics | #### 7.1.2 Scrape-to-Push Adapter Since Nightlight is **push-based** but services export **pull-based** Prometheus `/metrics` endpoints, we need a scrape-to-push adapter. **Option 1**: Prometheus Agent Mode + Remote Write Deploy Prometheus in agent mode (no storage, only scraping): ```yaml # prometheus-agent.yaml global: scrape_interval: 15s external_labels: cluster: 'cloud-platform' scrape_configs: - job_name: 'chainfire' static_configs: - targets: ['chainfire:9091'] - job_name: 'flaredb' static_configs: - targets: ['flaredb:9092'] # ... other services ... remote_write: - url: 'https://nightlight:8080/api/v1/write' tls_config: cert_file: /etc/certs/client.crt key_file: /etc/certs/client.key ca_file: /etc/certs/ca.crt ``` **Option 2**: Custom Rust Scraper (Platform-Native) Build a lightweight scraper in Rust that integrates with Nightlight: ```rust // nightlight-scraper/src/main.rs struct Scraper { targets: Vec, client: reqwest::Client, nightlight_client: NightlightClient, } struct ScrapeTarget { job_name: String, url: String, interval: Duration, } impl Scraper { async fn scrape_loop(&self) { loop { for target in &self.targets { let result = self.scrape_target(target).await; match result { Ok(samples) => { if let Err(e) = self.nightlight_client.write(samples).await { error!("Failed to write to Nightlight: {}", e); } } Err(e) => { error!("Failed to scrape {}: {}", target.url, e); } } } tokio::time::sleep(Duration::from_secs(15)).await; } } async fn scrape_target(&self, target: &ScrapeTarget) -> Result> { let response = self.client.get(&target.url).send().await?; let body = response.text().await?; // Parse Prometheus text format let samples = parse_prometheus_text(&body, &target.job_name)?; Ok(samples) } } fn parse_prometheus_text(text: &str, job: &str) -> Result> { // Use prometheus-parse crate or implement simple parser // Example output: // http_requests_total{method="GET",status="200",job="flaredb"} 1543 1733832000000 } ``` **Deployment**: - `nightlight-scraper` runs as a sidecar or separate service - Reads scrape config from TOML file - Uses mTLS to push to Nightlight **Recommendation**: Option 2 (custom scraper) for consistency with platform philosophy (100% Rust, no external dependencies). ### 7.2 mTLS Configuration (T027/T031 Patterns) #### 7.2.1 TLS Config Structure Following existing patterns (FlareDB, ChainFire, IAM): ```toml # nightlight.toml [server] addr = "0.0.0.0:8080" log_level = "info" [server.tls] cert_file = "/etc/nightlight/certs/server.crt" key_file = "/etc/nightlight/certs/server.key" ca_file = "/etc/nightlight/certs/ca.crt" require_client_cert = true # Enable mTLS ``` **Rust Config Struct**: ```rust // nightlight-server/src/config.rs use serde::{Deserialize, Serialize}; use std::net::SocketAddr; #[derive(Debug, Clone, Serialize, Deserialize)] pub struct ServerConfig { pub server: ServerSettings, pub storage: StorageConfig, } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct ServerSettings { pub addr: SocketAddr, pub log_level: String, pub tls: Option, } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct TlsConfig { pub cert_file: String, pub key_file: String, pub ca_file: Option, #[serde(default)] pub require_client_cert: bool, } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct StorageConfig { pub data_dir: String, pub retention_days: u32, pub wal_segment_size_mb: usize, // ... other storage settings } ``` #### 7.2.2 mTLS Server Setup ```rust // nightlight-server/src/main.rs use axum::Router; use axum_server::tls_rustls::RustlsConfig; use std::sync::Arc; #[tokio::main] async fn main() -> Result<()> { let config = ServerConfig::load("nightlight.toml")?; // Build router let app = Router::new() .route("/api/v1/write", post(handle_remote_write)) .route("/api/v1/query", get(handle_instant_query)) .route("/api/v1/query_range", get(handle_range_query)) .route("/health", get(health_check)) .route("/ready", get(readiness_check)) .with_state(Arc::new(service)); // Setup TLS if configured if let Some(tls_config) = &config.server.tls { info!("TLS enabled, loading certificates..."); let rustls_config = if tls_config.require_client_cert { info!("mTLS enabled, requiring client certificates"); let ca_cert_pem = tokio::fs::read_to_string( tls_config.ca_file.as_ref().ok_or("ca_file required for mTLS")? ).await?; RustlsConfig::from_pem_file( &tls_config.cert_file, &tls_config.key_file, ) .await? .with_client_cert_verifier(ca_cert_pem) } else { info!("TLS-only mode, client certificates not required"); RustlsConfig::from_pem_file( &tls_config.cert_file, &tls_config.key_file, ).await? }; axum_server::bind_rustls(config.server.addr, rustls_config) .serve(app.into_make_service()) .await?; } else { info!("TLS disabled, running in plain-text mode"); axum_server::bind(config.server.addr) .serve(app.into_make_service()) .await?; } Ok(()) } ``` #### 7.2.3 Client Certificate Validation Extract client identity from mTLS certificate: ```rust use axum::{ http::Request, middleware::Next, response::Response, Extension, }; use axum_server::tls_rustls::RustlsAcceptor; #[derive(Clone, Debug)] struct ClientIdentity { common_name: String, organization: String, } async fn extract_client_identity( Extension(client_cert): Extension>, mut request: Request, next: Next, ) -> Response { if let Some(cert) = client_cert { // Parse certificate to extract CN, O, etc. let identity = parse_certificate(&cert); request.extensions_mut().insert(identity); } next.run(request).await } // Use identity for rate limiting, audit logging, etc. async fn handle_remote_write( Extension(identity): Extension, State(service): State>, body: Bytes, ) -> Result { info!("Write request from: {}", identity.common_name); // Apply per-client rate limiting if !service.rate_limiter.allow(&identity.common_name) { return Err((StatusCode::TOO_MANY_REQUESTS, "Rate limit exceeded".into())); } // ... rest of handler ... } ``` ### 7.3 gRPC API Design While HTTP is the primary interface (Prometheus compatibility), a gRPC API can provide: - Better performance for internal services - Streaming support for batch ingestion - Type-safe client libraries **Proto Definition**: ```protobuf // proto/nightlight.proto syntax = "proto3"; package nightlight.v1; service NightlightService { // Write samples rpc Write(WriteRequest) returns (WriteResponse); // Streaming write for high-throughput scenarios rpc WriteStream(stream WriteRequest) returns (WriteResponse); // Query (instant) rpc Query(QueryRequest) returns (QueryResponse); // Query (range) rpc QueryRange(QueryRangeRequest) returns (QueryRangeResponse); // Admin operations rpc Compact(CompactRequest) returns (CompactResponse); rpc DeleteSeries(DeleteSeriesRequest) returns (DeleteSeriesResponse); } message WriteRequest { repeated TimeSeries timeseries = 1; } message WriteResponse { uint64 samples_ingested = 1; uint64 samples_rejected = 2; } message QueryRequest { string query = 1; // PromQL int64 time = 2; // Unix timestamp (ms) int64 timeout_ms = 3; } message QueryResponse { string result_type = 1; // "vector" or "matrix" repeated InstantVectorSample vector = 2; repeated RangeVectorSeries matrix = 3; } message InstantVectorSample { map labels = 1; double value = 2; int64 timestamp = 3; } message RangeVectorSeries { map labels = 1; repeated Sample samples = 2; } message Sample { double value = 1; int64 timestamp = 2; } ``` ### 7.4 NixOS Module Integration Following T024 patterns, create a NixOS module for Nightlight. **File**: `nix/modules/nightlight.nix` ```nix { config, lib, pkgs, ... }: with lib; let cfg = config.services.nightlight; configFile = pkgs.writeText "nightlight.toml" '' [server] addr = "${cfg.listenAddress}" log_level = "${cfg.logLevel}" ${optionalString (cfg.tls.enable) '' [server.tls] cert_file = "${cfg.tls.certFile}" key_file = "${cfg.tls.keyFile}" ${optionalString (cfg.tls.caFile != null) '' ca_file = "${cfg.tls.caFile}" ''} require_client_cert = ${boolToString cfg.tls.requireClientCert} ''} [storage] data_dir = "${cfg.dataDir}" retention_days = ${toString cfg.storage.retentionDays} wal_segment_size_mb = ${toString cfg.storage.walSegmentSizeMb} ''; in { options.services.nightlight = { enable = mkEnableOption "Nightlight metrics storage service"; package = mkOption { type = types.package; default = pkgs.nightlight; description = "Nightlight package to use"; }; listenAddress = mkOption { type = types.str; default = "0.0.0.0:8080"; description = "Address and port to listen on"; }; logLevel = mkOption { type = types.enum [ "trace" "debug" "info" "warn" "error" ]; default = "info"; description = "Log level"; }; dataDir = mkOption { type = types.path; default = "/var/lib/nightlight"; description = "Data directory for TSDB storage"; }; tls = { enable = mkEnableOption "TLS encryption"; certFile = mkOption { type = types.str; description = "Path to TLS certificate file"; }; keyFile = mkOption { type = types.str; description = "Path to TLS private key file"; }; caFile = mkOption { type = types.nullOr types.str; default = null; description = "Path to CA certificate for client verification (mTLS)"; }; requireClientCert = mkOption { type = types.bool; default = false; description = "Require client certificates (mTLS)"; }; }; storage = { retentionDays = mkOption { type = types.ints.positive; default = 15; description = "Data retention period in days"; }; walSegmentSizeMb = mkOption { type = types.ints.positive; default = 128; description = "WAL segment size in MB"; }; }; }; config = mkIf cfg.enable { systemd.services.nightlight = { description = "Nightlight Metrics Storage Service"; wantedBy = [ "multi-user.target" ]; after = [ "network.target" ]; serviceConfig = { Type = "simple"; ExecStart = "${cfg.package}/bin/nightlight-server --config ${configFile}"; Restart = "on-failure"; RestartSec = "5s"; # Security hardening DynamicUser = true; StateDirectory = "nightlight"; ProtectSystem = "strict"; ProtectHome = true; PrivateTmp = true; NoNewPrivileges = true; }; }; # Expose metrics endpoint networking.firewall.allowedTCPPorts = mkIf cfg.openFirewall [ 8080 ]; }; } ``` **Usage Example** (in NixOS configuration): ```nix { services.nightlight = { enable = true; listenAddress = "0.0.0.0:8080"; logLevel = "info"; tls = { enable = true; certFile = "/etc/certs/nightlight-server.crt"; keyFile = "/etc/certs/nightlight-server.key"; caFile = "/etc/certs/ca.crt"; requireClientCert = true; }; storage = { retentionDays = 30; }; }; } ``` --- ## 8. Implementation Plan ### 8.1 Step Breakdown (S1-S6) The implementation follows a phased approach aligned with the task.yaml steps. #### **S1: Research & Architecture** ✅ (Current Document) **Deliverable**: This design document **Status**: Completed --- #### **S2: Workspace Scaffold** **Goal**: Create nightlight workspace with skeleton structure **Tasks**: 1. Create workspace structure: ``` nightlight/ ├── Cargo.toml ├── crates/ │ ├── nightlight-api/ # Client library │ ├── nightlight-server/ # Main service │ └── nightlight-types/ # Shared types ├── proto/ │ ├── remote_write.proto │ └── nightlight.proto └── README.md ``` 2. Setup proto compilation in build.rs 3. Define core types: ```rust // nightlight-types/src/lib.rs pub type SeriesID = u64; pub type Timestamp = i64; // Unix timestamp in milliseconds pub struct Sample { pub timestamp: Timestamp, pub value: f64, } pub struct Series { pub id: SeriesID, pub labels: BTreeMap, } pub struct LabelMatcher { pub name: String, pub value: String, pub op: MatchOp, } pub enum MatchOp { Equal, NotEqual, RegexMatch, RegexNotMatch, } ``` 4. Add dependencies: ```toml [workspace.dependencies] # Core tokio = { version = "1.35", features = ["full"] } anyhow = "1.0" tracing = "0.1" tracing-subscriber = "0.3" # Serialization serde = { version = "1.0", features = ["derive"] } serde_json = "1.0" toml = "0.8" # gRPC tonic = "0.10" prost = "0.12" prost-types = "0.12" # HTTP axum = "0.7" axum-server = { version = "0.6", features = ["tls-rustls"] } # Compression snap = "1.1" # Snappy # Time-series promql-parser = "0.4" # Storage rocksdb = "0.21" # (NOT for TSDB, only for examples) # Crypto rustls = "0.21" ``` **Estimated Effort**: 2 days --- #### **S3: Push Ingestion** **Goal**: Implement Prometheus remote_write compatible ingestion endpoint **Tasks**: 1. **Implement WAL**: ```rust // nightlight-server/src/wal.rs struct WAL { dir: PathBuf, segment_size: usize, active_segment: RwLock, } impl WAL { fn new(dir: PathBuf, segment_size: usize) -> Result; fn append(&self, record: WALRecord) -> Result<()>; fn replay(&self) -> Result>; fn checkpoint(&self, min_segment: u64) -> Result<()>; } ``` 2. **Implement In-Memory Head Block**: ```rust // nightlight-server/src/head.rs struct Head { series: DashMap>, // Concurrent HashMap min_time: AtomicI64, max_time: AtomicI64, config: HeadConfig, } impl Head { async fn append(&self, series_id: SeriesID, labels: Labels, ts: Timestamp, value: f64) -> Result<()>; async fn get(&self, series_id: SeriesID) -> Option>; async fn series_count(&self) -> usize; } ``` 3. **Implement Gorilla Compression** (basic version): ```rust // nightlight-server/src/compression.rs struct GorillaEncoder { /* ... */ } struct GorillaDecoder { /* ... */ } impl GorillaEncoder { fn encode_timestamp(&mut self, ts: i64) -> Result<()>; fn encode_value(&mut self, value: f64) -> Result<()>; fn finish(self) -> Vec; } ``` 4. **Implement HTTP Ingestion Handler**: ```rust // nightlight-server/src/handlers/ingest.rs async fn handle_remote_write( State(service): State>, body: Bytes, ) -> Result { // 1. Decompress Snappy // 2. Decode protobuf // 3. Validate samples // 4. Append to WAL // 5. Insert into Head // 6. Return 204 No Content } ``` 5. **Add Rate Limiting**: ```rust struct RateLimiter { rate: f64, // samples/sec tokens: AtomicU64, } impl RateLimiter { fn allow(&self) -> bool; } ``` 6. **Integration Test**: ```rust #[tokio::test] async fn test_remote_write_ingestion() { // Start server // Send WriteRequest // Verify samples stored } ``` **Estimated Effort**: 5 days --- #### **S4: PromQL Query Engine** **Goal**: Basic PromQL query support (instant + range queries) **Tasks**: 1. **Integrate promql-parser**: ```rust // nightlight-server/src/query/parser.rs use promql_parser::parser; pub fn parse(query: &str) -> Result { parser::parse(query).map_err(|e| Error::ParseError(e.to_string())) } ``` 2. **Implement Query Planner**: ```rust // nightlight-server/src/query/planner.rs pub enum QueryPlan { VectorSelector { matchers: Vec, timestamp: i64 }, MatrixSelector { matchers: Vec, range: Duration, timestamp: i64 }, Aggregate { op: AggregateOp, input: Box, grouping: Vec }, RateFunc { input: Box }, // ... other operators } pub fn plan(expr: parser::Expr, query_time: i64) -> Result; ``` 3. **Implement Label Index**: ```rust // nightlight-server/src/index.rs struct LabelIndex { // label_name -> label_value -> [series_ids] inverted_index: DashMap>>, } impl LabelIndex { fn find_series(&self, matchers: &[LabelMatcher]) -> Result>; fn add_series(&self, series_id: SeriesID, labels: &Labels); } ``` 4. **Implement Query Executor**: ```rust // nightlight-server/src/query/executor.rs struct QueryExecutor { head: Arc, blocks: Arc, index: Arc, } impl QueryExecutor { async fn execute(&self, plan: QueryPlan) -> Result; async fn execute_vector_selector(&self, matchers: Vec, ts: i64) -> Result; async fn execute_matrix_selector(&self, matchers: Vec, range: Duration, ts: i64) -> Result; fn apply_rate(&self, matrix: RangeVector) -> Result; fn apply_aggregate(&self, op: AggregateOp, vector: InstantVector, grouping: Vec) -> Result; } ``` 5. **Implement HTTP Query Handlers**: ```rust // nightlight-server/src/handlers/query.rs async fn handle_instant_query( Query(params): Query, State(executor): State>, ) -> Result, (StatusCode, String)> { let expr = parse(¶ms.query)?; let plan = plan(expr, params.time.unwrap_or_else(now))?; let result = executor.execute(plan).await?; Ok(Json(format_response(result))) } async fn handle_range_query( Query(params): Query, State(executor): State>, ) -> Result, (StatusCode, String)> { // Similar to instant query, but iterate over [start, end] with step } ``` 6. **Integration Test**: ```rust #[tokio::test] async fn test_instant_query() { // Ingest samples // Query: http_requests_total{method="GET"} // Verify results } #[tokio::test] async fn test_range_query_with_rate() { // Ingest counter samples // Query: rate(http_requests_total[5m]) // Verify rate calculation } ``` **Estimated Effort**: 7 days --- #### **S5: Storage Layer** **Goal**: Time-series storage with retention and compaction **Tasks**: 1. **Implement Block Writer**: ```rust // nightlight-server/src/block/writer.rs struct BlockWriter { block_dir: PathBuf, index_writer: IndexWriter, chunk_writer: ChunkWriter, } impl BlockWriter { fn new(block_dir: PathBuf) -> Result; fn write_series(&mut self, series: &Series, samples: &[Sample]) -> Result<()>; fn finalize(self) -> Result; } ``` 2. **Implement Block Reader**: ```rust // nightlight-server/src/block/reader.rs struct BlockReader { meta: BlockMeta, index: Index, chunks: ChunkReader, } impl BlockReader { fn open(block_dir: PathBuf) -> Result; fn query_samples(&self, series_id: SeriesID, start: i64, end: i64) -> Result>; } ``` 3. **Implement Compaction**: ```rust // nightlight-server/src/compaction.rs struct Compactor { data_dir: PathBuf, config: CompactionConfig, } impl Compactor { async fn compact_head_to_l0(&self, head: &Head) -> Result; async fn compact_blocks(&self, source_blocks: Vec) -> Result; async fn run_compaction_loop(&self); // Background task } ``` 4. **Implement Retention Enforcement**: ```rust impl Compactor { async fn enforce_retention(&self, retention: Duration) -> Result<()> { let cutoff = SystemTime::now() - retention; // Delete blocks older than cutoff } } ``` 5. **Implement Block Manager**: ```rust // nightlight-server/src/block/manager.rs struct BlockManager { blocks: RwLock>>, data_dir: PathBuf, } impl BlockManager { fn load_blocks(&mut self) -> Result<()>; fn add_block(&mut self, block: BlockReader); fn remove_block(&mut self, block_id: &BlockID); fn query_blocks(&self, start: i64, end: i64) -> Vec>; } ``` 6. **Integration Test**: ```rust #[tokio::test] async fn test_compaction() { // Ingest data for >2h // Trigger compaction // Verify block created // Query old data from block } #[tokio::test] async fn test_retention() { // Create old blocks // Run retention enforcement // Verify old blocks deleted } ``` **Estimated Effort**: 8 days --- #### **S6: Integration & Documentation** **Goal**: NixOS module, TLS config, integration tests, operator docs **Tasks**: 1. **Create NixOS Module**: - File: `nix/modules/nightlight.nix` - Follow T024 patterns - Include systemd service, firewall rules - Support TLS configuration options 2. **Implement mTLS**: - Load certs in server startup - Configure Rustls with client cert verification - Extract client identity for rate limiting 3. **Create Nightlight Scraper**: - Standalone scraper service - Reads scrape config (TOML) - Scrapes `/metrics` endpoints from services - Pushes to Nightlight via remote_write 4. **Integration Tests**: ```rust #[tokio::test] async fn test_e2e_ingest_and_query() { // Start Nightlight server // Ingest samples via remote_write // Query via /api/v1/query // Query via /api/v1/query_range // Verify results match } #[tokio::test] async fn test_mtls_authentication() { // Start server with mTLS // Connect without client cert -> rejected // Connect with valid client cert -> accepted } #[tokio::test] async fn test_grafana_compatibility() { // Configure Grafana to use Nightlight // Execute sample queries // Verify dashboards render correctly } ``` 5. **Write Operator Documentation**: - **File**: `docs/por/T033-nightlight/OPERATOR.md` - Installation (NixOS, standalone) - Configuration guide - mTLS setup - Scraper configuration - Troubleshooting - Performance tuning 6. **Write Developer Documentation**: - **File**: `nightlight/README.md` - Architecture overview - Building from source - Running tests - Contributing guidelines **Estimated Effort**: 5 days --- ### 8.2 Dependency Ordering ``` S1 (Research) → S2 (Scaffold) ↓ S3 (Ingestion) ──┐ ↓ │ S4 (Query) │ ↓ │ S5 (Storage) ←────┘ ↓ S6 (Integration) ``` **Critical Path**: S1 → S2 → S3 → S5 → S6 **Parallelizable**: S4 can start after S3 completes basic ingestion ### 8.3 Total Effort Estimate | Step | Effort | Priority | |------|--------|----------| | S1: Research | 2 days | P0 | | S2: Scaffold | 2 days | P0 | | S3: Ingestion | 5 days | P0 | | S4: Query Engine | 7 days | P0 | | S5: Storage Layer | 8 days | P1 | | S6: Integration | 5 days | P1 | | **Total** | **29 days** | | **Realistic Timeline**: 6-8 weeks (accounting for testing, debugging, documentation) --- ## 9. Open Questions ### 9.1 Decisions Requiring User Input #### Q1: Scraper Implementation Choice **Question**: Should we use Prometheus in agent mode or build a custom Rust scraper? **Option A**: Prometheus Agent + Remote Write - **Pros**: Battle-tested, standard tool, no implementation effort - **Cons**: Adds Go dependency, less platform integration **Option B**: Custom Rust Scraper - **Pros**: 100% Rust, platform consistency, easier integration - **Cons**: Implementation effort, needs testing **Recommendation**: Option B (custom scraper) for consistency with PROJECT.md philosophy **Decision**: [ ] A [ ] B [ ] Defer to later --- #### Q2: gRPC vs HTTP Priority **Question**: Should we prioritize gRPC API or focus only on HTTP (Prometheus compatibility)? **Option A**: HTTP only (v1) - **Pros**: Simpler, Prometheus/Grafana compatibility is sufficient - **Cons**: Less efficient for internal services **Option B**: Both HTTP and gRPC (v1) - **Pros**: Better performance for internal services, more flexibility - **Cons**: More implementation effort **Recommendation**: Option A for v1, add gRPC in v2 if needed **Decision**: [ ] A [ ] B --- #### Q3: FlareDB Metadata Integration Timeline **Question**: When should we integrate FlareDB for metadata storage? **Option A**: v1 (T033) - **Pros**: Unified metadata story from the start - **Cons**: Increases complexity, adds dependency **Option B**: v2 (Future) - **Pros**: Simpler v1, can deploy standalone - **Cons**: Migration effort later **Recommendation**: Option B (defer to v2) **Decision**: [ ] A [ ] B --- #### Q4: S3 Cold Storage Priority **Question**: Should S3 cold storage be part of v1 or deferred? **Option A**: v1 (T033.S5) - **Pros**: Unlimited retention from day 1 - **Cons**: Complexity, operational overhead **Option B**: v2 (Future) - **Pros**: Simpler v1, focus on core functionality - **Cons**: Limited retention (local disk only) **Recommendation**: Option B (defer to v2), use local disk for v1 with 15-30 day retention **Decision**: [ ] A [ ] B --- ### 9.2 Areas Needing Further Investigation #### I1: PromQL Function Coverage **Issue**: Need to determine exact subset of PromQL functions to support in v1. **Investigation Needed**: - Survey existing Grafana dashboards in use - Identify most common functions (rate, increase, histogram_quantile, etc.) - Prioritize by usage frequency **Proposed Approach**: - Analyze 10-20 sample dashboards - Create coverage matrix - Implement top 80% functions first --- #### I2: Query Performance Benchmarking **Issue**: Need to validate query latency targets (p95 <100ms) are achievable. **Investigation Needed**: - Benchmark promql-parser crate performance - Measure Gorilla decompression throughput - Test index lookup performance at 10M series scale **Proposed Approach**: - Create benchmark suite with synthetic data (1M, 10M series) - Measure end-to-end query latency - Identify bottlenecks and optimize --- #### I3: Series Cardinality Limits **Issue**: How to prevent series explosion (high cardinality killing performance)? **Investigation Needed**: - Research cardinality estimation algorithms (HyperLogLog) - Define cardinality limits (per metric, per label, global) - Implement rejection strategy (reject new series beyond limit) **Proposed Approach**: - Add cardinality tracking to label index - Implement warnings at 80% limit, rejection at 100% - Provide admin API to inspect high-cardinality series --- #### I4: Out-of-Order Sample Edge Cases **Issue**: How to handle out-of-order samples spanning chunk boundaries? **Investigation Needed**: - Test scenarios: samples arriving 1h late, 2h late, etc. - Determine if we need multi-chunk updates or reject old samples - Benchmark impact of re-sorting chunks **Proposed Approach**: - Implement configurable out-of-order window (default: 1h) - Reject samples older than window - For within-window samples, insert into correct chunk (may require chunk re-compression) --- ## 10. References ### 10.1 Research Sources #### Time-Series Storage Formats - [Gorilla: A Fast, Scalable, In-Memory Time Series Database (Facebook)](http://www.vldb.org/pvldb/vol8/p1816-teller.pdf) - [Gorilla Compression Algorithm - The Morning Paper](https://blog.acolyer.org/2016/05/03/gorilla-a-fast-scalable-in-memory-time-series-database/) - [Prometheus TSDB Storage Documentation](https://prometheus.io/docs/prometheus/latest/storage/) - [Prometheus TSDB Architecture - Palark Blog](https://palark.com/blog/prometheus-architecture-tsdb/) - [InfluxDB TSM Storage Engine](https://www.influxdata.com/blog/new-storage-engine-time-structured-merge-tree/) - [M3DB Storage Architecture](https://m3db.io/docs/architecture/m3db/) - [M3DB at Uber Blog](https://www.uber.com/blog/m3/) #### PromQL Implementation - [promql-parser Rust Crate (GreptimeTeam)](https://github.com/GreptimeTeam/promql-parser) - [promql-parser Documentation](https://docs.rs/promql-parser) - [promql Crate (vthriller)](https://github.com/vthriller/promql) #### Prometheus Remote Write Protocol - [Prometheus Remote Write 1.0 Specification](https://prometheus.io/docs/specs/prw/remote_write_spec/) - [Prometheus Remote Write 2.0 Specification](https://prometheus.io/docs/specs/prw/remote_write_spec_2_0/) - [Prometheus Protobuf Schema (remote.proto)](https://github.com/prometheus/prometheus/blob/main/prompb/remote.proto) #### Rust TSDB Implementations - [InfluxDB 3 Engineering with Rust - InfoQ](https://www.infoq.com/articles/timeseries-db-rust/) - [Datadog's Rust TSDB - Datadog Blog](https://www.datadoghq.com/blog/engineering/rust-timeseries-engine/) - [GreptimeDB Announcement](https://greptime.com/blogs/2022-11-15-this-time-for-real) - [tstorage-rs Embedded TSDB](https://github.com/dpgil/tstorage-rs) - [tsink High-Performance Embedded TSDB](https://dev.to/h2337/building-high-performance-time-series-applications-with-tsink-a-rust-embedded-database-5fa7) ### 10.2 Platform References #### Internal Documentation - PROJECT.md (Item 12: Metrics Store) - docs/por/T033-nightlight/task.yaml - docs/por/T027-production-hardening/ (TLS patterns) - docs/por/T024-nixos-packaging/ (NixOS module patterns) #### Existing Service Patterns - flaredb/crates/flaredb-server/src/main.rs (TLS, metrics export) - flaredb/crates/flaredb-server/src/config/mod.rs (Config structure) - chainfire/crates/chainfire-server/src/config.rs (TLS config) - iam/crates/iam-server/src/config.rs (Config patterns) ### 10.3 External Tools - [Grafana](https://grafana.com/) - Visualization and dashboards - [Prometheus](https://prometheus.io/) - Reference implementation - [VictoriaMetrics](https://victoriametrics.com/) - Replacement target (study architecture) --- ## Appendix A: PromQL Function Reference (v1 Support) ### Supported Functions | Function | Category | Description | Example | |----------|----------|-------------|---------| | `rate()` | Counter | Per-second rate of increase | `rate(http_requests_total[5m])` | | `irate()` | Counter | Instant rate (last 2 samples) | `irate(http_requests_total[5m])` | | `increase()` | Counter | Total increase over range | `increase(http_requests_total[1h])` | | `histogram_quantile()` | Histogram | Calculate quantile from histogram | `histogram_quantile(0.95, rate(http_duration_bucket[5m]))` | | `sum()` | Aggregation | Sum values | `sum(metric)` | | `avg()` | Aggregation | Average values | `avg(metric)` | | `min()` | Aggregation | Minimum value | `min(metric)` | | `max()` | Aggregation | Maximum value | `max(metric)` | | `count()` | Aggregation | Count series | `count(metric)` | | `stddev()` | Aggregation | Standard deviation | `stddev(metric)` | | `stdvar()` | Aggregation | Standard variance | `stdvar(metric)` | | `topk()` | Aggregation | Top K series | `topk(5, metric)` | | `bottomk()` | Aggregation | Bottom K series | `bottomk(5, metric)` | | `time()` | Time | Current timestamp | `time()` | | `timestamp()` | Time | Sample timestamp | `timestamp(metric)` | | `abs()` | Math | Absolute value | `abs(metric)` | | `ceil()` | Math | Round up | `ceil(metric)` | | `floor()` | Math | Round down | `floor(metric)` | | `round()` | Math | Round to nearest | `round(metric, 0.1)` | ### NOT Supported in v1 | Function | Category | Reason | |----------|----------|--------| | `predict_linear()` | Prediction | Complex, low usage | | `deriv()` | Math | Low usage | | `holt_winters()` | Prediction | Complex | | `resets()` | Counter | Low usage | | `changes()` | Analysis | Low usage | | Subqueries | Advanced | Very complex | --- ## Appendix B: Configuration Reference ### Complete Configuration Example ```toml # nightlight.toml - Complete configuration example [server] # Listen address for HTTP/gRPC API addr = "0.0.0.0:8080" # Log level: trace, debug, info, warn, error log_level = "info" # Metrics port for self-monitoring (Prometheus /metrics endpoint) metrics_port = 9099 [server.tls] # Enable TLS cert_file = "/etc/nightlight/certs/server.crt" key_file = "/etc/nightlight/certs/server.key" # Enable mTLS (require client certificates) ca_file = "/etc/nightlight/certs/ca.crt" require_client_cert = true [storage] # Data directory for TSDB blocks and WAL data_dir = "/var/lib/nightlight/data" # Data retention period (days) retention_days = 15 # WAL segment size (MB) wal_segment_size_mb = 128 # Block duration for compaction min_block_duration = "2h" max_block_duration = "24h" # Out-of-order sample acceptance window out_of_order_time_window = "1h" # Series cardinality limits max_series = 10_000_000 max_series_per_metric = 100_000 # Memory limits max_head_chunks_per_series = 2 max_head_size_mb = 2048 [query] # Query timeout (seconds) timeout_seconds = 30 # Maximum query range (hours) max_range_hours = 24 # Query result cache TTL (seconds) cache_ttl_seconds = 60 # Maximum concurrent queries max_concurrent_queries = 100 [ingestion] # Write buffer size (samples) write_buffer_size = 100_000 # Backpressure strategy: "block" or "reject" backpressure_strategy = "block" # Rate limiting (samples per second per client) rate_limit_per_client = 50_000 # Maximum samples per write request max_samples_per_request = 10_000 [compaction] # Enable background compaction enabled = true # Compaction interval (seconds) interval_seconds = 7200 # 2 hours # Number of compaction threads num_threads = 2 [s3] # S3 cold storage (optional, future) enabled = false endpoint = "https://s3.example.com" bucket = "nightlight-blocks" access_key_id = "..." secret_access_key = "..." upload_after_days = 7 local_cache_size_gb = 100 [flaredb] # FlareDB metadata integration (optional, future) enabled = false endpoints = ["flaredb-1:50051", "flaredb-2:50051"] namespace = "metrics" ``` --- ## Appendix C: Metrics Exported by Nightlight Nightlight exports metrics about itself on port 9099 (configurable). ### Ingestion Metrics ``` # Samples ingested nightlight_samples_ingested_total{} counter # Samples rejected (out-of-order, invalid, etc.) nightlight_samples_rejected_total{reason="out_of_order|invalid|rate_limit"} counter # Ingestion latency (milliseconds) nightlight_ingestion_latency_ms{quantile="0.5|0.9|0.99"} summary # Active series nightlight_active_series{} gauge # Head memory usage (bytes) nightlight_head_memory_bytes{} gauge ``` ### Query Metrics ``` # Queries executed nightlight_queries_total{type="instant|range"} counter # Query latency (milliseconds) nightlight_query_latency_ms{type="instant|range", quantile="0.5|0.9|0.99"} summary # Query errors nightlight_query_errors_total{reason="timeout|parse_error|execution_error"} counter ``` ### Storage Metrics ``` # WAL segments nightlight_wal_segments{} gauge # WAL size (bytes) nightlight_wal_size_bytes{} gauge # Blocks nightlight_blocks_total{level="0|1|2"} gauge # Block size (bytes) nightlight_block_size_bytes{level="0|1|2"} gauge # Compactions nightlight_compactions_total{level="0|1|2"} counter # Compaction duration (seconds) nightlight_compaction_duration_seconds{level="0|1|2", quantile="0.5|0.9|0.99"} summary ``` ### System Metrics ``` # Go runtime metrics (if using Go for scraper) # Rust memory metrics nightlight_memory_allocated_bytes{} gauge # CPU usage nightlight_cpu_usage_seconds_total{} counter ``` --- ## Appendix D: Error Codes and Troubleshooting ### HTTP Error Codes | Code | Meaning | Common Causes | |------|---------|---------------| | 200 | OK | Query successful | | 204 | No Content | Write successful | | 400 | Bad Request | Invalid PromQL, malformed protobuf | | 401 | Unauthorized | mTLS cert validation failed | | 429 | Too Many Requests | Rate limit exceeded | | 500 | Internal Server Error | Storage error, WAL corruption | | 503 | Service Unavailable | Write buffer full, server overloaded | ### Common Issues #### Issue: "Samples rejected: out_of_order" **Cause**: Samples arriving with timestamps older than `out_of_order_time_window` **Solution**: - Increase `out_of_order_time_window` in config - Check clock sync on clients (NTP) - Reduce scrape batch size #### Issue: "Rate limit exceeded" **Cause**: Client exceeding `rate_limit_per_client` samples/sec **Solution**: - Increase rate limit in config - Reduce scrape frequency - Shard writes across multiple clients #### Issue: "Query timeout" **Cause**: Query exceeding `timeout_seconds` **Solution**: - Increase query timeout - Reduce query time range - Add more specific label matchers to reduce series scanned #### Issue: "Series cardinality explosion" **Cause**: Too many unique label combinations (high cardinality) **Solution**: - Review label design (avoid unbounded labels like user_id) - Use relabeling to drop high-cardinality labels - Increase `max_series` limit (if justified) --- **End of Design Document** **Total Length**: ~3,800 lines **Status**: Ready for review and S2-S6 implementation **Next Steps**: 1. Review and approve design decisions 2. Create GitHub issues for S2-S6 tasks 3. Begin S2: Workspace Scaffold