- Replace form_urlencoded with RFC 3986 compliant URI encoding - Implement aws_uri_encode() matching AWS SigV4 spec exactly - Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded - All other chars percent-encoded with uppercase hex - Preserve slashes in paths, encode in query params - Normalize empty paths to '/' per AWS spec - Fix test expectations (body hash, HMAC values) - Add comprehensive SigV4 signature determinism test This fixes the canonicalization mismatch that caused signature validation failures in T047. Auth can now be enabled for production. Refs: T058.S1
3744 lines
121 KiB
Markdown
3744 lines
121 KiB
Markdown
# Nightlight Design Document
|
|
|
|
**Project:** Nightlight - VictoriaMetrics OSS Replacement
|
|
**Task:** T033.S1 Research & Architecture
|
|
**Version:** 1.0
|
|
**Date:** 2025-12-10
|
|
**Author:** PeerB
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Executive Summary](#1-executive-summary)
|
|
2. [Requirements](#2-requirements)
|
|
3. [Time-Series Storage Model](#3-time-series-storage-model)
|
|
4. [Push Ingestion API](#4-push-ingestion-api)
|
|
5. [PromQL Query Engine](#5-promql-query-engine)
|
|
6. [Storage Backend Architecture](#6-storage-backend-architecture)
|
|
7. [Integration Points](#7-integration-points)
|
|
8. [Implementation Plan](#8-implementation-plan)
|
|
9. [Open Questions](#9-open-questions)
|
|
10. [References](#10-references)
|
|
|
|
---
|
|
|
|
## 1. Executive Summary
|
|
|
|
### 1.1 Overview
|
|
|
|
Nightlight is a fully open-source, distributed time-series database designed as a replacement for VictoriaMetrics, addressing the critical requirement that VictoriaMetrics' mTLS support is a paid feature. As the final component (Item 12/12) of PROJECT.md, Nightlight completes the observability stack for the Japanese cloud platform.
|
|
|
|
### 1.2 High-Level Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Service Mesh │
|
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
|
│ │ FlareDB │ │ ChainFire│ │ PlasmaVMC│ │ IAM │ ... │
|
|
│ │ :9092 │ │ :9091 │ │ :9093 │ │ :9094 │ │
|
|
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
|
|
│ │ │ │ │ │
|
|
│ └────────────┴────────────┴────────────┘ │
|
|
│ │ │
|
|
│ │ Push (remote_write) │
|
|
│ │ mTLS │
|
|
│ ▼ │
|
|
│ ┌──────────────────────┐ │
|
|
│ │ Nightlight Server │ │
|
|
│ │ ┌────────────────┐ │ │
|
|
│ │ │ Ingestion API │ │ ← Prometheus remote_write │
|
|
│ │ │ (gRPC/HTTP) │ │ │
|
|
│ │ └────────┬───────┘ │ │
|
|
│ │ │ │ │
|
|
│ │ ┌────────▼───────┐ │ │
|
|
│ │ │ Write Buffer │ │ │
|
|
│ │ │ (In-Memory) │ │ │
|
|
│ │ └────────┬───────┘ │ │
|
|
│ │ │ │ │
|
|
│ │ ┌────────▼───────┐ │ │
|
|
│ │ │ Storage Engine│ │ │
|
|
│ │ │ ┌──────────┐ │ │ │
|
|
│ │ │ │ Head │ │ │ ← WAL + In-Memory Index │
|
|
│ │ │ │ (Active) │ │ │ │
|
|
│ │ │ └────┬─────┘ │ │ │
|
|
│ │ │ │ │ │ │
|
|
│ │ │ ┌────▼─────┐ │ │ │
|
|
│ │ │ │ Blocks │ │ │ ← Immutable, Compressed │
|
|
│ │ │ │ (TSDB) │ │ │ │
|
|
│ │ │ └──────────┘ │ │ │
|
|
│ │ └────────────────┘ │ │
|
|
│ │ │ │ │
|
|
│ │ ┌────────▼───────┐ │ │
|
|
│ │ │ Query Engine │ │ ← PromQL Execution │
|
|
│ │ │ (PromQL AST) │ │ │
|
|
│ │ └────────┬───────┘ │ │
|
|
│ │ │ │ │
|
|
│ └───────────┼──────────┘ │
|
|
│ │ │
|
|
│ │ Query (HTTP) │
|
|
│ │ mTLS │
|
|
│ ▼ │
|
|
│ ┌──────────────────────┐ │
|
|
│ │ Grafana / Clients │ │
|
|
│ └──────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
|
|
┌─────────────────────┐
|
|
│ FlareDB Cluster │ ← Metadata (optional)
|
|
│ (Metadata Store) │
|
|
└─────────────────────┘
|
|
|
|
┌─────────────────────┐
|
|
│ S3-Compatible │ ← Cold Storage (future)
|
|
│ Object Storage │
|
|
└─────────────────────┘
|
|
```
|
|
|
|
### 1.3 Key Design Decisions
|
|
|
|
1. **Storage Format**: Hybrid approach using Prometheus TSDB block design with Gorilla compression
|
|
- **Rationale**: Battle-tested, excellent compression (1-2 bytes/sample), widely understood
|
|
|
|
2. **Storage Backend**: Dedicated time-series engine with optional FlareDB metadata integration
|
|
- **Rationale**: Time-series workloads have unique access patterns; KV stores not optimal for sample storage
|
|
- FlareDB reserved for metadata (series labels, index) in distributed scenarios
|
|
|
|
3. **PromQL Subset**: Support 80% of common use cases (instant/range queries, basic aggregations, rate/increase)
|
|
- **Rationale**: Full PromQL compatibility is complex; focus on practical operator needs
|
|
|
|
4. **Push Model**: Prometheus remote_write v1.0 protocol via HTTP + gRPC APIs
|
|
- **Rationale**: Standard protocol, Snappy compression built-in, client library availability
|
|
|
|
5. **mTLS Integration**: Consistent with T027/T031 patterns (cert_file, key_file, ca_file, require_client_cert)
|
|
- **Rationale**: Unified security model across all platform services
|
|
|
|
### 1.4 Success Criteria
|
|
|
|
- Accept metrics from 8+ services (ports 9091-9099) via remote_write
|
|
- Query latency <100ms for instant queries (p95)
|
|
- Compression ratio ≥10:1 (target: 1.5-2 bytes/sample)
|
|
- Support 100K samples/sec write throughput per instance
|
|
- PromQL queries cover 80% of Grafana dashboard use cases
|
|
- Zero vendor lock-in (100% OSS, no paid features)
|
|
|
|
---
|
|
|
|
## 2. Requirements
|
|
|
|
### 2.1 Functional Requirements
|
|
|
|
#### FR-1: Push-Based Metric Ingestion
|
|
- **FR-1.1**: Accept Prometheus remote_write v1.0 protocol (HTTP POST)
|
|
- **FR-1.2**: Support Snappy-compressed protobuf payloads
|
|
- **FR-1.3**: Validate metric names and labels per Prometheus naming conventions
|
|
- **FR-1.4**: Handle out-of-order samples within a configurable time window (default: 1h)
|
|
- **FR-1.5**: Deduplicate duplicate samples (same timestamp + labels)
|
|
- **FR-1.6**: Return backpressure signals (HTTP 429/503) when buffer is full
|
|
|
|
#### FR-2: PromQL Query Engine
|
|
- **FR-2.1**: Support instant queries (`/api/v1/query`)
|
|
- **FR-2.2**: Support range queries (`/api/v1/query_range`)
|
|
- **FR-2.3**: Support label queries (`/api/v1/label/<name>/values`, `/api/v1/labels`)
|
|
- **FR-2.4**: Support series metadata queries (`/api/v1/series`)
|
|
- **FR-2.5**: Implement core PromQL functions (see Section 5.2)
|
|
- **FR-2.6**: Support Prometheus HTTP API JSON response format
|
|
|
|
#### FR-3: Time-Series Storage
|
|
- **FR-3.1**: Store samples with millisecond timestamp precision
|
|
- **FR-3.2**: Support configurable retention periods (default: 15 days, configurable 1-365 days)
|
|
- **FR-3.3**: Automatic background compaction of blocks
|
|
- **FR-3.4**: Crash recovery via Write-Ahead Log (WAL)
|
|
- **FR-3.5**: Series cardinality limits to prevent explosion (default: 10M series)
|
|
|
|
#### FR-4: Security & Authentication
|
|
- **FR-4.1**: mTLS support for ingestion and query APIs
|
|
- **FR-4.2**: Optional basic authentication for HTTP endpoints
|
|
- **FR-4.3**: Rate limiting per client (based on mTLS certificate CN or IP)
|
|
|
|
#### FR-5: Operational Features
|
|
- **FR-5.1**: Prometheus-compatible `/metrics` endpoint for self-monitoring
|
|
- **FR-5.2**: Health check endpoints (`/health`, `/ready`)
|
|
- **FR-5.3**: Admin API for series deletion, compaction trigger
|
|
- **FR-5.4**: TOML configuration file support
|
|
- **FR-5.5**: Environment variable overrides
|
|
|
|
### 2.2 Non-Functional Requirements
|
|
|
|
#### NFR-1: Performance
|
|
- **NFR-1.1**: Ingestion throughput: ≥100K samples/sec per instance
|
|
- **NFR-1.2**: Query latency (p95): <100ms for instant queries, <500ms for range queries (1h window)
|
|
- **NFR-1.3**: Compression ratio: ≥10:1 (target: 1.5-2 bytes/sample)
|
|
- **NFR-1.4**: Memory usage: <2GB for 1M active series
|
|
|
|
#### NFR-2: Scalability
|
|
- **NFR-2.1**: Vertical scaling: Support 10M active series per instance
|
|
- **NFR-2.2**: Horizontal scaling: Support sharding across multiple instances (future work)
|
|
- **NFR-2.3**: Storage: Support local disk + optional S3-compatible backend for cold data
|
|
|
|
#### NFR-3: Reliability
|
|
- **NFR-3.1**: No data loss for committed samples (WAL durability)
|
|
- **NFR-3.2**: Graceful degradation under load (reject writes with backpressure, not crash)
|
|
- **NFR-3.3**: Crash recovery time: <30s for 10M series
|
|
|
|
#### NFR-4: Maintainability
|
|
- **NFR-4.1**: Codebase consistency with other platform services (FlareDB, ChainFire patterns)
|
|
- **NFR-4.2**: 100% Rust, no CGO dependencies
|
|
- **NFR-4.3**: Comprehensive unit and integration tests
|
|
- **NFR-4.4**: Operator documentation with runbooks
|
|
|
|
#### NFR-5: Compatibility
|
|
- **NFR-5.1**: Prometheus remote_write v1.0 protocol compatibility
|
|
- **NFR-5.2**: Prometheus HTTP API compatibility (subset: query, query_range, labels, series)
|
|
- **NFR-5.3**: Grafana data source compatibility
|
|
|
|
### 2.3 Out of Scope (Explicitly Not Supported in v1)
|
|
|
|
- Prometheus remote_read protocol (pull-based; platform uses push)
|
|
- Full PromQL compatibility (complex subqueries, advanced functions)
|
|
- Multi-tenancy (single-tenant per instance; use multiple instances for multi-tenant)
|
|
- Distributed query federation (single-instance queries only)
|
|
- Recording rules and alerting (use separate Prometheus/Alertmanager for this)
|
|
|
|
---
|
|
|
|
## 3. Time-Series Storage Model
|
|
|
|
### 3.1 Data Model
|
|
|
|
#### 3.1.1 Metric Structure
|
|
|
|
A time-series metric in Nightlight follows the Prometheus data model:
|
|
|
|
```
|
|
metric_name{label1="value1", label2="value2", ...} value timestamp
|
|
```
|
|
|
|
**Example:**
|
|
```
|
|
http_requests_total{method="GET", status="200", service="flaredb"} 1543 1733832000000
|
|
```
|
|
|
|
Components:
|
|
- **Metric Name**: Identifier for the measurement (e.g., `http_requests_total`)
|
|
- Must match regex: `[a-zA-Z_:][a-zA-Z0-9_:]*`
|
|
|
|
- **Labels**: Key-value pairs for dimensionality (e.g., `{method="GET", status="200"}`)
|
|
- Label names: `[a-zA-Z_][a-zA-Z0-9_]*`
|
|
- Label values: Any UTF-8 string
|
|
- Reserved labels: `__name__` (stores metric name), labels starting with `__` are internal
|
|
|
|
- **Value**: Float64 sample value
|
|
|
|
- **Timestamp**: Millisecond precision (int64 milliseconds since Unix epoch)
|
|
|
|
#### 3.1.2 Series Identification
|
|
|
|
A **series** is uniquely identified by its metric name + label set:
|
|
|
|
```rust
|
|
// Pseudo-code representation
|
|
struct SeriesID {
|
|
hash: u64, // FNV-1a hash of sorted labels
|
|
}
|
|
|
|
struct Series {
|
|
id: SeriesID,
|
|
labels: BTreeMap<String, String>, // Sorted for consistent hashing
|
|
chunks: Vec<ChunkRef>,
|
|
}
|
|
```
|
|
|
|
Series ID calculation:
|
|
1. Sort labels lexicographically (including `__name__` label)
|
|
2. Concatenate as: `label1_name + \0 + label1_value + \0 + label2_name + \0 + ...`
|
|
3. Compute FNV-1a 64-bit hash
|
|
|
|
### 3.2 Storage Format
|
|
|
|
#### 3.2.1 Architecture Overview
|
|
|
|
Nightlight uses a **hybrid storage architecture** inspired by Prometheus TSDB and Gorilla:
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Memory Layer (Head) │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Series Map │ │ WAL Segment │ │ Write Buffer │ │
|
|
│ │ (In-Memory │ │ (Disk) │ │ (MPSC Queue) │ │
|
|
│ │ Index) │ │ │ │ │ │
|
|
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
|
|
│ │ │ │ │
|
|
│ └─────────────────┴─────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────┐ │
|
|
│ │ Active Chunks │ │
|
|
│ │ (Gorilla-compressed) │ │
|
|
│ │ - 2h time windows │ │
|
|
│ │ - Delta-of-delta TS │ │
|
|
│ │ - XOR float encoding │ │
|
|
│ └─────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
│
|
|
│ Compaction Trigger
|
|
│ (every 2h or on shutdown)
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Disk Layer (Blocks) │
|
|
│ │
|
|
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
|
│ │ Block 1 │ │ Block 2 │ │ Block N │ │
|
|
│ │ [0h - 2h) │ │ [2h - 4h) │ │ [Nh - (N+2)h) │ │
|
|
│ │ │ │ │ │ │ │
|
|
│ │ ├─ meta.json │ │ ├─ meta.json │ │ ├─ meta.json │ │
|
|
│ │ ├─ index │ │ ├─ index │ │ ├─ index │ │
|
|
│ │ ├─ chunks/000 │ │ ├─ chunks/000 │ │ ├─ chunks/000 │ │
|
|
│ │ └─ tombstones │ │ └─ tombstones │ │ └─ tombstones │ │
|
|
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
#### 3.2.2 Write-Ahead Log (WAL)
|
|
|
|
**Purpose**: Durability and crash recovery
|
|
|
|
**Format**: Append-only log segments (128MB default size)
|
|
|
|
```
|
|
WAL Structure:
|
|
data/
|
|
wal/
|
|
00000001 ← Segment 1 (128MB)
|
|
00000002 ← Segment 2 (active)
|
|
```
|
|
|
|
**WAL Record Format** (inspired by LevelDB):
|
|
|
|
```
|
|
┌───────────────────────────────────────────────────┐
|
|
│ CRC32 (4 bytes) │
|
|
├───────────────────────────────────────────────────┤
|
|
│ Length (4 bytes, little-endian) │
|
|
├───────────────────────────────────────────────────┤
|
|
│ Type (1 byte): FULL | FIRST | MIDDLE | LAST │
|
|
├───────────────────────────────────────────────────┤
|
|
│ Payload (variable): │
|
|
│ - Record Type (1 byte): Series | Samples │
|
|
│ - Series ID (8 bytes) │
|
|
│ - Labels (length-prefixed strings) │
|
|
│ - Samples (varint timestamp, float64 value) │
|
|
└───────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**WAL Operations**:
|
|
- **Append**: Every write appends to active segment
|
|
- **Checkpoint**: Snapshot of in-memory state to disk blocks
|
|
- **Truncate**: Delete segments older than oldest in-memory data
|
|
- **Replay**: On startup, replay WAL segments to rebuild in-memory state
|
|
|
|
**Rust Implementation Sketch**:
|
|
|
|
```rust
|
|
struct WAL {
|
|
dir: PathBuf,
|
|
segment_size: usize, // 128MB default
|
|
active_segment: File,
|
|
active_segment_num: u64,
|
|
}
|
|
|
|
impl WAL {
|
|
fn append(&mut self, record: &WALRecord) -> Result<()> {
|
|
let encoded = record.encode();
|
|
let crc = crc32(&encoded);
|
|
|
|
// Rotate segment if needed
|
|
if self.active_segment.metadata()?.len() + encoded.len() > self.segment_size {
|
|
self.rotate_segment()?;
|
|
}
|
|
|
|
self.active_segment.write_all(&crc.to_le_bytes())?;
|
|
self.active_segment.write_all(&(encoded.len() as u32).to_le_bytes())?;
|
|
self.active_segment.write_all(&encoded)?;
|
|
self.active_segment.sync_all()?; // fsync for durability
|
|
Ok(())
|
|
}
|
|
|
|
fn replay(&self) -> Result<Vec<WALRecord>> {
|
|
// Read all segments and decode records
|
|
// Used on startup for crash recovery
|
|
}
|
|
}
|
|
```
|
|
|
|
#### 3.2.3 In-Memory Head Block
|
|
|
|
**Purpose**: Accept recent writes, maintain hot data for fast queries
|
|
|
|
**Structure**:
|
|
|
|
```rust
|
|
struct Head {
|
|
series: RwLock<HashMap<SeriesID, Arc<Series>>>,
|
|
min_time: AtomicI64,
|
|
max_time: AtomicI64,
|
|
chunk_size: Duration, // 2h default
|
|
wal: Arc<WAL>,
|
|
}
|
|
|
|
struct Series {
|
|
id: SeriesID,
|
|
labels: BTreeMap<String, String>,
|
|
chunks: RwLock<Vec<Chunk>>,
|
|
}
|
|
|
|
struct Chunk {
|
|
min_time: i64,
|
|
max_time: i64,
|
|
samples: CompressedSamples, // Gorilla encoding
|
|
}
|
|
```
|
|
|
|
**Chunk Lifecycle**:
|
|
1. **Creation**: New chunk created when first sample arrives or previous chunk is full
|
|
2. **Active**: Chunk accepts samples in time window [min_time, min_time + 2h)
|
|
3. **Full**: Chunk reaches 2h window, new chunk created for subsequent samples
|
|
4. **Compaction**: Full chunks compacted to disk blocks
|
|
|
|
**Memory Limits**:
|
|
- Max series: 10M (configurable)
|
|
- Max chunks per series: 2 (active + previous, covering 4h)
|
|
- Eviction: LRU eviction of inactive series (no samples in 4h)
|
|
|
|
#### 3.2.4 Disk Blocks (Immutable)
|
|
|
|
**Purpose**: Long-term storage of compacted time-series data
|
|
|
|
**Block Structure** (inspired by Prometheus TSDB):
|
|
|
|
```
|
|
data/
|
|
01HQZQZQZQZQZQZQZQZQZQ/ ← Block directory (ULID)
|
|
meta.json ← Metadata
|
|
index ← Inverted index
|
|
chunks/
|
|
000001 ← Chunk file
|
|
000002
|
|
...
|
|
tombstones ← Deleted series/samples
|
|
```
|
|
|
|
**meta.json Format**:
|
|
|
|
```json
|
|
{
|
|
"ulid": "01HQZQZQZQZQZQZQZQZQZQ",
|
|
"minTime": 1733832000000,
|
|
"maxTime": 1733839200000,
|
|
"stats": {
|
|
"numSamples": 1500000,
|
|
"numSeries": 5000,
|
|
"numChunks": 10000
|
|
},
|
|
"compaction": {
|
|
"level": 1,
|
|
"sources": ["01HQZQZ..."]
|
|
},
|
|
"version": 1
|
|
}
|
|
```
|
|
|
|
**Index File Format** (simplified):
|
|
|
|
The index file provides fast lookups of series by labels.
|
|
|
|
```
|
|
┌────────────────────────────────────────────────┐
|
|
│ Magic Number (4 bytes): 0xBADA55A0 │
|
|
├────────────────────────────────────────────────┤
|
|
│ Version (1 byte): 1 │
|
|
├────────────────────────────────────────────────┤
|
|
│ Symbol Table Section │
|
|
│ - Sorted strings (label names/values) │
|
|
│ - Offset table for binary search │
|
|
├────────────────────────────────────────────────┤
|
|
│ Series Section │
|
|
│ - SeriesID → Chunk Refs mapping │
|
|
│ - (series_id, labels, chunk_offsets) │
|
|
├────────────────────────────────────────────────┤
|
|
│ Label Index Section (Inverted Index) │
|
|
│ - label_name → [series_ids] │
|
|
│ - (label_name, label_value) → [series_ids] │
|
|
├────────────────────────────────────────────────┤
|
|
│ Postings Section │
|
|
│ - Sorted posting lists for label matchers │
|
|
│ - Compressed with varint + bit packing │
|
|
├────────────────────────────────────────────────┤
|
|
│ TOC (Table of Contents) │
|
|
│ - Offsets to each section │
|
|
└────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**Chunks File Format**:
|
|
|
|
```
|
|
Chunk File (chunks/000001):
|
|
┌────────────────────────────────────────────────┐
|
|
│ Chunk 1: │
|
|
│ ├─ Length (4 bytes) │
|
|
│ ├─ Encoding (1 byte): Gorilla = 0x01 │
|
|
│ ├─ MinTime (8 bytes) │
|
|
│ ├─ MaxTime (8 bytes) │
|
|
│ ├─ NumSamples (4 bytes) │
|
|
│ └─ Compressed Data (variable) │
|
|
├────────────────────────────────────────────────┤
|
|
│ Chunk 2: ... │
|
|
└────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 3.3 Compression Strategy
|
|
|
|
#### 3.3.1 Gorilla Compression Algorithm
|
|
|
|
Nightlight uses **Gorilla compression** from Facebook's paper (VLDB 2015), achieving ~12x compression.
|
|
|
|
**Timestamp Compression (Delta-of-Delta)**:
|
|
|
|
```
|
|
Example timestamps (ms):
|
|
t0 = 1733832000000
|
|
t1 = 1733832015000 (Δ1 = 15000)
|
|
t2 = 1733832030000 (Δ2 = 15000)
|
|
t3 = 1733832045000 (Δ3 = 15000)
|
|
|
|
Delta-of-delta:
|
|
D1 = Δ1 - Δ0 = 15000 - 0 = 15000 → encode in 14 bits
|
|
D2 = Δ2 - Δ1 = 15000 - 15000 = 0 → encode in 1 bit (0)
|
|
D3 = Δ3 - Δ2 = 15000 - 15000 = 0 → encode in 1 bit (0)
|
|
|
|
Encoding:
|
|
- If D = 0: write 1 bit "0"
|
|
- If D in [-63, 64): write "10" + 7 bits
|
|
- If D in [-255, 256): write "110" + 9 bits
|
|
- If D in [-2047, 2048): write "1110" + 12 bits
|
|
- Otherwise: write "1111" + 32 bits
|
|
|
|
96% of timestamps compress to 1 bit!
|
|
```
|
|
|
|
**Value Compression (XOR Encoding)**:
|
|
|
|
```
|
|
Example values (float64):
|
|
v0 = 1543.0
|
|
v1 = 1543.5
|
|
v2 = 1543.7
|
|
|
|
XOR compression:
|
|
XOR(v0, v1) = 0x3FF0000000000000 XOR 0x3FF0800000000000
|
|
= 0x0000800000000000
|
|
→ Leading zeros: 16, Trailing zeros: 47
|
|
→ Encode: control bit "1" + 5-bit leading + 6-bit length + 1 bit
|
|
|
|
XOR(v1, v2) = 0x3FF0800000000000 XOR 0x3FF0CCCCCCCCCCD
|
|
→ Similar pattern, encode with control bits
|
|
|
|
Encoding:
|
|
- If v_i == v_(i-1): write 1 bit "0"
|
|
- If XOR has same leading/trailing zeros as previous: write "10" + significant bits
|
|
- Otherwise: write "11" + 5-bit leading + 6-bit length + significant bits
|
|
|
|
51% of values compress to 1 bit!
|
|
```
|
|
|
|
**Rust Implementation Sketch**:
|
|
|
|
```rust
|
|
struct GorillaEncoder {
|
|
bit_writer: BitWriter,
|
|
prev_timestamp: i64,
|
|
prev_delta: i64,
|
|
prev_value: f64,
|
|
prev_leading_zeros: u8,
|
|
prev_trailing_zeros: u8,
|
|
}
|
|
|
|
impl GorillaEncoder {
|
|
fn encode_timestamp(&mut self, timestamp: i64) -> Result<()> {
|
|
let delta = timestamp - self.prev_timestamp;
|
|
let delta_of_delta = delta - self.prev_delta;
|
|
|
|
if delta_of_delta == 0 {
|
|
self.bit_writer.write_bit(0)?;
|
|
} else if delta_of_delta >= -63 && delta_of_delta < 64 {
|
|
self.bit_writer.write_bits(0b10, 2)?;
|
|
self.bit_writer.write_bits(delta_of_delta as u64, 7)?;
|
|
} else if delta_of_delta >= -255 && delta_of_delta < 256 {
|
|
self.bit_writer.write_bits(0b110, 3)?;
|
|
self.bit_writer.write_bits(delta_of_delta as u64, 9)?;
|
|
} else if delta_of_delta >= -2047 && delta_of_delta < 2048 {
|
|
self.bit_writer.write_bits(0b1110, 4)?;
|
|
self.bit_writer.write_bits(delta_of_delta as u64, 12)?;
|
|
} else {
|
|
self.bit_writer.write_bits(0b1111, 4)?;
|
|
self.bit_writer.write_bits(delta_of_delta as u64, 32)?;
|
|
}
|
|
|
|
self.prev_timestamp = timestamp;
|
|
self.prev_delta = delta;
|
|
Ok(())
|
|
}
|
|
|
|
fn encode_value(&mut self, value: f64) -> Result<()> {
|
|
let bits = value.to_bits();
|
|
let xor = bits ^ self.prev_value.to_bits();
|
|
|
|
if xor == 0 {
|
|
self.bit_writer.write_bit(0)?;
|
|
} else {
|
|
let leading = xor.leading_zeros() as u8;
|
|
let trailing = xor.trailing_zeros() as u8;
|
|
let significant_bits = 64 - leading - trailing;
|
|
|
|
if leading >= self.prev_leading_zeros && trailing >= self.prev_trailing_zeros {
|
|
self.bit_writer.write_bits(0b10, 2)?;
|
|
let mask = (1u64 << significant_bits) - 1;
|
|
let significant = (xor >> trailing) & mask;
|
|
self.bit_writer.write_bits(significant, significant_bits as usize)?;
|
|
} else {
|
|
self.bit_writer.write_bits(0b11, 2)?;
|
|
self.bit_writer.write_bits(leading as u64, 5)?;
|
|
self.bit_writer.write_bits(significant_bits as u64, 6)?;
|
|
let mask = (1u64 << significant_bits) - 1;
|
|
let significant = (xor >> trailing) & mask;
|
|
self.bit_writer.write_bits(significant, significant_bits as usize)?;
|
|
|
|
self.prev_leading_zeros = leading;
|
|
self.prev_trailing_zeros = trailing;
|
|
}
|
|
}
|
|
|
|
self.prev_value = value;
|
|
Ok(())
|
|
}
|
|
}
|
|
```
|
|
|
|
#### 3.3.2 Compression Performance Targets
|
|
|
|
Based on research and production systems:
|
|
|
|
| Metric | Target | Reference |
|
|
|--------|--------|-----------|
|
|
| Average bytes/sample | 1.5-2.0 | Prometheus (1-2), Gorilla (1.37), M3DB (1.45) |
|
|
| Compression ratio | 10-12x | Gorilla (12x), InfluxDB TSM (45x for specific workloads) |
|
|
| Encode throughput | >500K samples/sec | Gorilla paper: 700K/sec |
|
|
| Decode throughput | >1M samples/sec | Gorilla paper: 1.2M/sec |
|
|
|
|
### 3.4 Retention and Compaction Policies
|
|
|
|
#### 3.4.1 Retention Policy
|
|
|
|
**Default Retention**: 15 days
|
|
|
|
**Configurable Parameters**:
|
|
```toml
|
|
[storage]
|
|
retention_days = 15 # Keep data for 15 days
|
|
min_block_duration = "2h" # Minimum block size
|
|
max_block_duration = "24h" # Maximum block size after compaction
|
|
```
|
|
|
|
**Retention Enforcement**:
|
|
- Background goroutine runs every 1h
|
|
- Deletes blocks where `max_time < now() - retention_duration`
|
|
- Deletes old WAL segments
|
|
|
|
#### 3.4.2 Compaction Strategy
|
|
|
|
**Purpose**:
|
|
1. Merge small blocks into larger blocks (reduce file count)
|
|
2. Remove deleted samples (tombstones)
|
|
3. Improve query performance (fewer blocks to scan)
|
|
|
|
**Compaction Levels** (inspired by LevelDB):
|
|
|
|
```
|
|
Level 0: 2h blocks (compacted from Head)
|
|
Level 1: 12h blocks (merge 6 L0 blocks)
|
|
Level 2: 24h blocks (merge 2 L1 blocks)
|
|
```
|
|
|
|
**Compaction Trigger**:
|
|
- **Time-based**: Every 2h, compact Head → Level 0 block
|
|
- **Count-based**: When L0 has >4 blocks, compact → L1
|
|
- **Manual**: Admin API endpoint `/api/v1/admin/compact`
|
|
|
|
**Compaction Algorithm**:
|
|
|
|
```
|
|
1. Select blocks to compact (same level, adjacent time ranges)
|
|
2. Create new block directory (ULID)
|
|
3. Iterate all series in selected blocks:
|
|
a. Merge chunks from all blocks
|
|
b. Apply tombstones (skip deleted samples)
|
|
c. Re-compress merged chunks
|
|
d. Write to new block chunks file
|
|
4. Build new index (merge posting lists)
|
|
5. Write meta.json
|
|
6. Atomically rename block directory
|
|
7. Delete source blocks
|
|
```
|
|
|
|
**Rust Implementation Sketch**:
|
|
|
|
```rust
|
|
struct Compactor {
|
|
data_dir: PathBuf,
|
|
retention: Duration,
|
|
}
|
|
|
|
impl Compactor {
|
|
async fn compact_head_to_l0(&self, head: &Head) -> Result<BlockID> {
|
|
let block_id = ULID::new();
|
|
let block_dir = self.data_dir.join(block_id.to_string());
|
|
std::fs::create_dir_all(&block_dir)?;
|
|
|
|
let mut index_writer = IndexWriter::new(&block_dir.join("index"))?;
|
|
let mut chunk_writer = ChunkWriter::new(&block_dir.join("chunks/000001"))?;
|
|
|
|
let series_map = head.series.read().await;
|
|
for (series_id, series) in series_map.iter() {
|
|
let chunks = series.chunks.read().await;
|
|
for chunk in chunks.iter() {
|
|
if chunk.is_full() {
|
|
let chunk_ref = chunk_writer.write_chunk(&chunk.samples)?;
|
|
index_writer.add_series(*series_id, &series.labels, chunk_ref)?;
|
|
}
|
|
}
|
|
}
|
|
|
|
index_writer.finalize()?;
|
|
chunk_writer.finalize()?;
|
|
|
|
let meta = BlockMeta {
|
|
ulid: block_id,
|
|
min_time: head.min_time.load(Ordering::Relaxed),
|
|
max_time: head.max_time.load(Ordering::Relaxed),
|
|
stats: compute_stats(&block_dir)?,
|
|
compaction: CompactionMeta { level: 0, sources: vec![] },
|
|
version: 1,
|
|
};
|
|
write_meta(&block_dir.join("meta.json"), &meta)?;
|
|
|
|
Ok(block_id)
|
|
}
|
|
|
|
async fn compact_blocks(&self, source_blocks: Vec<BlockID>) -> Result<BlockID> {
|
|
// Merge multiple blocks into one
|
|
// Similar to compact_head_to_l0, but reads from existing blocks
|
|
}
|
|
|
|
async fn enforce_retention(&self) -> Result<()> {
|
|
let cutoff = SystemTime::now() - self.retention;
|
|
let cutoff_ms = cutoff.duration_since(UNIX_EPOCH)?.as_millis() as i64;
|
|
|
|
for entry in std::fs::read_dir(&self.data_dir)? {
|
|
let path = entry?.path();
|
|
if !path.is_dir() { continue; }
|
|
|
|
let meta_path = path.join("meta.json");
|
|
if !meta_path.exists() { continue; }
|
|
|
|
let meta: BlockMeta = serde_json::from_reader(File::open(meta_path)?)?;
|
|
if meta.max_time < cutoff_ms {
|
|
std::fs::remove_dir_all(&path)?;
|
|
info!("Deleted expired block: {}", meta.ulid);
|
|
}
|
|
}
|
|
Ok(())
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Push Ingestion API
|
|
|
|
### 4.1 Prometheus Remote Write Protocol
|
|
|
|
#### 4.1.1 Protocol Overview
|
|
|
|
**Specification**: Prometheus Remote Write v1.0
|
|
**Transport**: HTTP/1.1 or HTTP/2
|
|
**Encoding**: Protocol Buffers (protobuf v3)
|
|
**Compression**: Snappy (required)
|
|
|
|
**Reference**: [Prometheus Remote Write Spec](https://prometheus.io/docs/specs/prw/remote_write_spec/)
|
|
|
|
#### 4.1.2 HTTP Endpoint
|
|
|
|
```
|
|
POST /api/v1/write
|
|
Content-Type: application/x-protobuf
|
|
Content-Encoding: snappy
|
|
X-Prometheus-Remote-Write-Version: 0.1.0
|
|
```
|
|
|
|
**Request Flow**:
|
|
|
|
```
|
|
┌──────────────┐
|
|
│ Client │
|
|
│ (Prometheus, │
|
|
│ FlareDB, │
|
|
│ etc.) │
|
|
└──────┬───────┘
|
|
│
|
|
│ 1. Collect samples
|
|
│
|
|
▼
|
|
┌──────────────────────────────────┐
|
|
│ Encode to WriteRequest protobuf │
|
|
│ message │
|
|
└──────┬───────────────────────────┘
|
|
│
|
|
│ 2. Compress with Snappy
|
|
│
|
|
▼
|
|
┌──────────────────────────────────┐
|
|
│ HTTP POST to /api/v1/write │
|
|
│ with mTLS authentication │
|
|
└──────┬───────────────────────────┘
|
|
│
|
|
│ 3. Send request
|
|
│
|
|
▼
|
|
┌──────────────────────────────────┐
|
|
│ Nightlight Server │
|
|
│ ├─ Validate mTLS cert │
|
|
│ ├─ Decompress Snappy │
|
|
│ ├─ Decode protobuf │
|
|
│ ├─ Validate samples │
|
|
│ ├─ Append to WAL │
|
|
│ └─ Insert into Head │
|
|
└──────┬───────────────────────────┘
|
|
│
|
|
│ 4. Response
|
|
│
|
|
▼
|
|
┌──────────────────────────────────┐
|
|
│ HTTP Response: │
|
|
│ 200 OK (success) │
|
|
│ 400 Bad Request (invalid) │
|
|
│ 429 Too Many Requests (backpressure) │
|
|
│ 503 Service Unavailable (overload) │
|
|
└──────────────────────────────────┘
|
|
```
|
|
|
|
#### 4.1.3 Protobuf Schema
|
|
|
|
**File**: `proto/remote_write.proto`
|
|
|
|
```protobuf
|
|
syntax = "proto3";
|
|
|
|
package nightlight.remote;
|
|
|
|
// Prometheus remote_write compatible schema
|
|
|
|
message WriteRequest {
|
|
repeated TimeSeries timeseries = 1;
|
|
// Metadata is optional and not used in v1
|
|
repeated MetricMetadata metadata = 2;
|
|
}
|
|
|
|
message TimeSeries {
|
|
repeated Label labels = 1;
|
|
repeated Sample samples = 2;
|
|
// Exemplars are optional (not supported in v1)
|
|
repeated Exemplar exemplars = 3;
|
|
}
|
|
|
|
message Label {
|
|
string name = 1;
|
|
string value = 2;
|
|
}
|
|
|
|
message Sample {
|
|
double value = 1;
|
|
int64 timestamp = 2; // Unix timestamp in milliseconds
|
|
}
|
|
|
|
message Exemplar {
|
|
repeated Label labels = 1;
|
|
double value = 2;
|
|
int64 timestamp = 3;
|
|
}
|
|
|
|
message MetricMetadata {
|
|
enum MetricType {
|
|
UNKNOWN = 0;
|
|
COUNTER = 1;
|
|
GAUGE = 2;
|
|
HISTOGRAM = 3;
|
|
GAUGEHISTOGRAM = 4;
|
|
SUMMARY = 5;
|
|
INFO = 6;
|
|
STATESET = 7;
|
|
}
|
|
MetricType type = 1;
|
|
string metric_family_name = 2;
|
|
string help = 3;
|
|
string unit = 4;
|
|
}
|
|
```
|
|
|
|
**Generated Rust Code** (using `prost`):
|
|
|
|
```toml
|
|
# Cargo.toml
|
|
[dependencies]
|
|
prost = "0.12"
|
|
prost-types = "0.12"
|
|
|
|
[build-dependencies]
|
|
prost-build = "0.12"
|
|
```
|
|
|
|
```rust
|
|
// build.rs
|
|
fn main() {
|
|
prost_build::compile_protos(&["proto/remote_write.proto"], &["proto/"]).unwrap();
|
|
}
|
|
```
|
|
|
|
#### 4.1.4 Ingestion Handler
|
|
|
|
**Rust Implementation**:
|
|
|
|
```rust
|
|
use axum::{
|
|
Router,
|
|
routing::post,
|
|
extract::State,
|
|
http::StatusCode,
|
|
body::Bytes,
|
|
};
|
|
use prost::Message;
|
|
use snap::raw::Decoder as SnappyDecoder;
|
|
|
|
mod remote_write_pb {
|
|
include!(concat!(env!("OUT_DIR"), "/nightlight.remote.rs"));
|
|
}
|
|
|
|
struct IngestionService {
|
|
head: Arc<Head>,
|
|
wal: Arc<WAL>,
|
|
rate_limiter: Arc<RateLimiter>,
|
|
}
|
|
|
|
async fn handle_remote_write(
|
|
State(service): State<Arc<IngestionService>>,
|
|
body: Bytes,
|
|
) -> Result<StatusCode, (StatusCode, String)> {
|
|
// 1. Decompress Snappy
|
|
let mut decoder = SnappyDecoder::new();
|
|
let decompressed = decoder
|
|
.decompress_vec(&body)
|
|
.map_err(|e| (StatusCode::BAD_REQUEST, format!("Snappy decompression failed: {}", e)))?;
|
|
|
|
// 2. Decode protobuf
|
|
let write_req = remote_write_pb::WriteRequest::decode(&decompressed[..])
|
|
.map_err(|e| (StatusCode::BAD_REQUEST, format!("Protobuf decode failed: {}", e)))?;
|
|
|
|
// 3. Validate and ingest
|
|
let mut samples_ingested = 0;
|
|
let mut samples_rejected = 0;
|
|
|
|
for ts in write_req.timeseries.iter() {
|
|
// Validate labels
|
|
let labels = validate_labels(&ts.labels)
|
|
.map_err(|e| (StatusCode::BAD_REQUEST, e))?;
|
|
|
|
let series_id = compute_series_id(&labels);
|
|
|
|
for sample in ts.samples.iter() {
|
|
// Validate timestamp (not too old, not too far in future)
|
|
if !is_valid_timestamp(sample.timestamp) {
|
|
samples_rejected += 1;
|
|
continue;
|
|
}
|
|
|
|
// Check rate limit
|
|
if !service.rate_limiter.allow() {
|
|
return Err((StatusCode::TOO_MANY_REQUESTS, "Rate limit exceeded".into()));
|
|
}
|
|
|
|
// Append to WAL
|
|
let wal_record = WALRecord::Sample {
|
|
series_id,
|
|
timestamp: sample.timestamp,
|
|
value: sample.value,
|
|
};
|
|
service.wal.append(&wal_record)
|
|
.map_err(|e| (StatusCode::INTERNAL_SERVER_ERROR, format!("WAL append failed: {}", e)))?;
|
|
|
|
// Insert into Head
|
|
service.head.append(series_id, labels.clone(), sample.timestamp, sample.value)
|
|
.await
|
|
.map_err(|e| {
|
|
if e.to_string().contains("out of order") {
|
|
samples_rejected += 1;
|
|
Ok::<_, (StatusCode, String)>(())
|
|
} else if e.to_string().contains("buffer full") {
|
|
Err((StatusCode::SERVICE_UNAVAILABLE, "Write buffer full".into()))
|
|
} else {
|
|
Err((StatusCode::INTERNAL_SERVER_ERROR, format!("Insert failed: {}", e)))
|
|
}
|
|
})?;
|
|
|
|
samples_ingested += 1;
|
|
}
|
|
}
|
|
|
|
info!("Ingested {} samples, rejected {}", samples_ingested, samples_rejected);
|
|
Ok(StatusCode::NO_CONTENT) // 204 No Content on success
|
|
}
|
|
|
|
fn validate_labels(labels: &[remote_write_pb::Label]) -> Result<BTreeMap<String, String>, String> {
|
|
let mut label_map = BTreeMap::new();
|
|
|
|
for label in labels {
|
|
// Validate label name
|
|
if !is_valid_label_name(&label.name) {
|
|
return Err(format!("Invalid label name: {}", label.name));
|
|
}
|
|
|
|
// Validate label value (any UTF-8)
|
|
if label.value.is_empty() {
|
|
return Err(format!("Empty label value for label: {}", label.name));
|
|
}
|
|
|
|
label_map.insert(label.name.clone(), label.value.clone());
|
|
}
|
|
|
|
// Must have __name__ label
|
|
if !label_map.contains_key("__name__") {
|
|
return Err("Missing __name__ label".into());
|
|
}
|
|
|
|
Ok(label_map)
|
|
}
|
|
|
|
fn is_valid_label_name(name: &str) -> bool {
|
|
// Must match [a-zA-Z_][a-zA-Z0-9_]*
|
|
if name.is_empty() {
|
|
return false;
|
|
}
|
|
|
|
let mut chars = name.chars();
|
|
let first = chars.next().unwrap();
|
|
if !first.is_ascii_alphabetic() && first != '_' {
|
|
return false;
|
|
}
|
|
|
|
chars.all(|c| c.is_ascii_alphanumeric() || c == '_')
|
|
}
|
|
|
|
fn is_valid_timestamp(ts: i64) -> bool {
|
|
let now = SystemTime::now().duration_since(UNIX_EPOCH).unwrap().as_millis() as i64;
|
|
let min_valid = now - 24 * 3600 * 1000; // Not older than 24h
|
|
let max_valid = now + 5 * 60 * 1000; // Not more than 5min in future
|
|
ts >= min_valid && ts <= max_valid
|
|
}
|
|
```
|
|
|
|
### 4.2 gRPC API (Alternative/Additional)
|
|
|
|
In addition to HTTP, Nightlight MAY support a gRPC API for ingestion (more efficient for internal services).
|
|
|
|
**Proto Definition**:
|
|
|
|
```protobuf
|
|
syntax = "proto3";
|
|
|
|
package nightlight.ingest;
|
|
|
|
service IngestionService {
|
|
rpc Write(WriteRequest) returns (WriteResponse);
|
|
rpc WriteBatch(stream WriteRequest) returns (WriteResponse);
|
|
}
|
|
|
|
message WriteRequest {
|
|
repeated TimeSeries timeseries = 1;
|
|
}
|
|
|
|
message WriteResponse {
|
|
uint64 samples_ingested = 1;
|
|
uint64 samples_rejected = 2;
|
|
string error = 3;
|
|
}
|
|
|
|
// (Reuse TimeSeries, Label, Sample from remote_write.proto)
|
|
```
|
|
|
|
### 4.3 Label Validation and Normalization
|
|
|
|
#### 4.3.1 Metric Name Validation
|
|
|
|
Metric names (stored in `__name__` label) must match:
|
|
```
|
|
[a-zA-Z_:][a-zA-Z0-9_:]*
|
|
```
|
|
|
|
Examples:
|
|
- ✅ `http_requests_total`
|
|
- ✅ `node_cpu_seconds:rate5m`
|
|
- ❌ `123_invalid` (starts with digit)
|
|
- ❌ `invalid-metric` (contains hyphen)
|
|
|
|
#### 4.3.2 Label Name Validation
|
|
|
|
Label names must match:
|
|
```
|
|
[a-zA-Z_][a-zA-Z0-9_]*
|
|
```
|
|
|
|
Reserved prefixes:
|
|
- `__` (double underscore): Internal labels (e.g., `__name__`, `__rollup__`)
|
|
|
|
#### 4.3.3 Label Normalization
|
|
|
|
Before inserting, labels are normalized:
|
|
1. Sort labels lexicographically by key
|
|
2. Ensure `__name__` label is present
|
|
3. Remove duplicate labels (keep last value)
|
|
4. Limit label count (default: 30 labels max per series)
|
|
5. Limit label value length (default: 1024 chars max)
|
|
|
|
### 4.4 Write Path Architecture
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ Ingestion Layer │
|
|
│ │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ HTTP/gRPC │ │ mTLS Auth │ │ Rate Limiter│ │
|
|
│ │ Handler │─▶│ Validator │─▶│ │ │
|
|
│ └─────────────┘ └─────────────┘ └──────┬──────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────┐ │
|
|
│ │ Decompressor │ │
|
|
│ │ (Snappy) │ │
|
|
│ └────────┬────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────┐ │
|
|
│ │ Protobuf │ │
|
|
│ │ Decoder │ │
|
|
│ └────────┬────────┘ │
|
|
│ │ │
|
|
└───────────────────────────────────────────┼──────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ Validation Layer │
|
|
│ │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Label │ │ Timestamp │ │ Cardinality │ │
|
|
│ │ Validator │ │ Validator │ │ Limiter │ │
|
|
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
|
|
│ │ │ │ │
|
|
│ └─────────────────┴─────────────────┘ │
|
|
│ │ │
|
|
└───────────────────────────┼──────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ Write Buffer │
|
|
│ │
|
|
│ ┌────────────────────────────────────────────────────┐ │
|
|
│ │ MPSC Channel (bounded) │ │
|
|
│ │ Capacity: 100K samples │ │
|
|
│ │ Backpressure: Block/Reject when full │ │
|
|
│ └────────────────────────────────────────────────────┘ │
|
|
│ │ │
|
|
└───────────────────────────┼──────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ Storage Layer │
|
|
│ │
|
|
│ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ WAL │◀────────│ WAL Writer │ │
|
|
│ │ (Disk) │ │ (Thread) │ │
|
|
│ └─────────────┘ └─────────────┘ │
|
|
│ │
|
|
│ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ Head │◀────────│ Head Writer│ │
|
|
│ │ (In-Memory) │ │ (Thread) │ │
|
|
│ └─────────────┘ └─────────────┘ │
|
|
└──────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**Concurrency Model**:
|
|
|
|
1. **HTTP/gRPC handlers**: Multi-threaded (tokio async)
|
|
2. **Write buffer**: MPSC channel (bounded capacity)
|
|
3. **WAL writer**: Single-threaded (sequential writes for consistency)
|
|
4. **Head writer**: Single-threaded (lock-free inserts via sharding)
|
|
|
|
**Backpressure Handling**:
|
|
|
|
```rust
|
|
enum BackpressureStrategy {
|
|
Block, // Block until buffer has space (default)
|
|
Reject, // Return 503 immediately
|
|
}
|
|
|
|
impl IngestionService {
|
|
async fn handle_backpressure(&self, samples: Vec<Sample>) -> Result<()> {
|
|
match self.config.backpressure_strategy {
|
|
BackpressureStrategy::Block => {
|
|
// Try to send with timeout
|
|
tokio::time::timeout(
|
|
Duration::from_secs(5),
|
|
self.write_buffer.send(samples)
|
|
).await
|
|
.map_err(|_| Error::Timeout)?
|
|
}
|
|
BackpressureStrategy::Reject => {
|
|
// Try non-blocking send
|
|
self.write_buffer.try_send(samples)
|
|
.map_err(|_| Error::BufferFull)?
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### 4.5 Out-of-Order Sample Handling
|
|
|
|
**Problem**: Samples may arrive out of timestamp order due to network delays, batching, etc.
|
|
|
|
**Solution**: Accept out-of-order samples within a configurable time window.
|
|
|
|
**Configuration**:
|
|
```toml
|
|
[storage]
|
|
out_of_order_time_window = "1h" # Accept samples up to 1h old
|
|
```
|
|
|
|
**Implementation**:
|
|
|
|
```rust
|
|
impl Head {
|
|
async fn append(
|
|
&self,
|
|
series_id: SeriesID,
|
|
labels: BTreeMap<String, String>,
|
|
timestamp: i64,
|
|
value: f64,
|
|
) -> Result<()> {
|
|
let now = SystemTime::now().duration_since(UNIX_EPOCH)?.as_millis() as i64;
|
|
let min_valid_ts = now - self.config.out_of_order_time_window.as_millis() as i64;
|
|
|
|
if timestamp < min_valid_ts {
|
|
return Err(Error::OutOfOrder(format!(
|
|
"Sample too old: ts={}, min={}",
|
|
timestamp, min_valid_ts
|
|
)));
|
|
}
|
|
|
|
// Get or create series
|
|
let mut series_map = self.series.write().await;
|
|
let series = series_map.entry(series_id).or_insert_with(|| {
|
|
Arc::new(Series {
|
|
id: series_id,
|
|
labels: labels.clone(),
|
|
chunks: RwLock::new(vec![]),
|
|
})
|
|
});
|
|
|
|
// Append to appropriate chunk
|
|
let mut chunks = series.chunks.write().await;
|
|
|
|
// Find chunk that covers this timestamp
|
|
let chunk = chunks.iter_mut()
|
|
.find(|c| timestamp >= c.min_time && timestamp < c.max_time)
|
|
.or_else(|| {
|
|
// Create new chunk if needed
|
|
let chunk_start = (timestamp / self.chunk_size.as_millis() as i64) * self.chunk_size.as_millis() as i64;
|
|
let chunk_end = chunk_start + self.chunk_size.as_millis() as i64;
|
|
let new_chunk = Chunk {
|
|
min_time: chunk_start,
|
|
max_time: chunk_end,
|
|
samples: CompressedSamples::new(),
|
|
};
|
|
chunks.push(new_chunk);
|
|
chunks.last_mut()
|
|
})
|
|
.unwrap();
|
|
|
|
chunk.samples.append(timestamp, value)?;
|
|
|
|
Ok(())
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 5. PromQL Query Engine
|
|
|
|
### 5.1 PromQL Overview
|
|
|
|
**PromQL** (Prometheus Query Language) is a functional query language for selecting and aggregating time-series data.
|
|
|
|
**Query Types**:
|
|
1. **Instant query**: Evaluate expression at a single point in time
|
|
2. **Range query**: Evaluate expression over a time range
|
|
|
|
### 5.2 Supported PromQL Subset
|
|
|
|
Nightlight v1 supports a **pragmatic subset** of PromQL covering 80% of common dashboard queries.
|
|
|
|
#### 5.2.1 Instant Vector Selectors
|
|
|
|
```promql
|
|
# Select by metric name
|
|
http_requests_total
|
|
|
|
# Select with label matchers
|
|
http_requests_total{method="GET"}
|
|
http_requests_total{method="GET", status="200"}
|
|
|
|
# Label matcher operators
|
|
metric{label="value"} # Exact match
|
|
metric{label!="value"} # Not equal
|
|
metric{label=~"regex"} # Regex match
|
|
metric{label!~"regex"} # Regex not match
|
|
|
|
# Example
|
|
http_requests_total{method=~"GET|POST", status!="500"}
|
|
```
|
|
|
|
#### 5.2.2 Range Vector Selectors
|
|
|
|
```promql
|
|
# Select last 5 minutes of data
|
|
http_requests_total[5m]
|
|
|
|
# With label matchers
|
|
http_requests_total{method="GET"}[1h]
|
|
|
|
# Time durations: s (seconds), m (minutes), h (hours), d (days), w (weeks), y (years)
|
|
```
|
|
|
|
#### 5.2.3 Aggregation Operators
|
|
|
|
```promql
|
|
# sum: Sum over dimensions
|
|
sum(http_requests_total)
|
|
sum(http_requests_total) by (method)
|
|
sum(http_requests_total) without (instance)
|
|
|
|
# Supported aggregations:
|
|
sum # Sum
|
|
avg # Average
|
|
min # Minimum
|
|
max # Maximum
|
|
count # Count
|
|
stddev # Standard deviation
|
|
stdvar # Standard variance
|
|
topk(N, ) # Top N series by value
|
|
bottomk(N,) # Bottom N series by value
|
|
```
|
|
|
|
#### 5.2.4 Functions
|
|
|
|
**Rate Functions**:
|
|
```promql
|
|
# rate: Per-second average rate of increase
|
|
rate(http_requests_total[5m])
|
|
|
|
# irate: Instant rate (last two samples)
|
|
irate(http_requests_total[5m])
|
|
|
|
# increase: Total increase over time range
|
|
increase(http_requests_total[1h])
|
|
```
|
|
|
|
**Quantile Functions**:
|
|
```promql
|
|
# histogram_quantile: Calculate quantile from histogram
|
|
histogram_quantile(0.95, rate(http_request_duration_bucket[5m]))
|
|
```
|
|
|
|
**Time Functions**:
|
|
```promql
|
|
# time(): Current Unix timestamp
|
|
time()
|
|
|
|
# timestamp(): Timestamp of sample
|
|
timestamp(metric)
|
|
```
|
|
|
|
**Math Functions**:
|
|
```promql
|
|
# abs, ceil, floor, round, sqrt, exp, ln, log2, log10
|
|
abs(metric)
|
|
round(metric, 0.1)
|
|
```
|
|
|
|
#### 5.2.5 Binary Operators
|
|
|
|
**Arithmetic**:
|
|
```promql
|
|
metric1 + metric2
|
|
metric1 - metric2
|
|
metric1 * metric2
|
|
metric1 / metric2
|
|
metric1 % metric2
|
|
metric1 ^ metric2
|
|
```
|
|
|
|
**Comparison**:
|
|
```promql
|
|
metric1 == metric2 # Equal
|
|
metric1 != metric2 # Not equal
|
|
metric1 > metric2 # Greater than
|
|
metric1 < metric2 # Less than
|
|
metric1 >= metric2 # Greater or equal
|
|
metric1 <= metric2 # Less or equal
|
|
```
|
|
|
|
**Logical**:
|
|
```promql
|
|
metric1 and metric2 # Intersection
|
|
metric1 or metric2 # Union
|
|
metric1 unless metric2 # Complement
|
|
```
|
|
|
|
**Vector Matching**:
|
|
```promql
|
|
# One-to-one matching
|
|
metric1 + metric2
|
|
|
|
# Many-to-one matching
|
|
metric1 + on(label) group_left metric2
|
|
|
|
# One-to-many matching
|
|
metric1 + on(label) group_right metric2
|
|
```
|
|
|
|
#### 5.2.6 Subqueries (NOT SUPPORTED in v1)
|
|
|
|
Subqueries are complex and not supported in v1:
|
|
```promql
|
|
# NOT SUPPORTED
|
|
max_over_time(rate(http_requests_total[5m])[1h:])
|
|
```
|
|
|
|
### 5.3 Query Execution Model
|
|
|
|
#### 5.3.1 Query Parsing
|
|
|
|
Use **promql-parser** crate (GreptimeTeam) for parsing:
|
|
|
|
```rust
|
|
use promql_parser::{parser, label};
|
|
|
|
fn parse_query(query: &str) -> Result<parser::Expr, ParseError> {
|
|
parser::parse(query)
|
|
}
|
|
|
|
// Example
|
|
let expr = parse_query("http_requests_total{method=\"GET\"}[5m]")?;
|
|
match expr {
|
|
parser::Expr::VectorSelector(vs) => {
|
|
println!("Metric: {}", vs.name);
|
|
for matcher in vs.matchers.matchers {
|
|
println!("Label: {} {} {}", matcher.name, matcher.op, matcher.value);
|
|
}
|
|
println!("Range: {:?}", vs.range);
|
|
}
|
|
_ => {}
|
|
}
|
|
```
|
|
|
|
**AST Types**:
|
|
|
|
```rust
|
|
pub enum Expr {
|
|
Aggregate(AggregateExpr), // sum, avg, etc.
|
|
Unary(UnaryExpr), // -metric
|
|
Binary(BinaryExpr), // metric1 + metric2
|
|
Paren(ParenExpr), // (expr)
|
|
Subquery(SubqueryExpr), // NOT SUPPORTED
|
|
NumberLiteral(NumberLiteral), // 1.5
|
|
StringLiteral(StringLiteral), // "value"
|
|
VectorSelector(VectorSelector), // metric{labels}
|
|
MatrixSelector(MatrixSelector), // metric[5m]
|
|
Call(Call), // rate(...)
|
|
}
|
|
```
|
|
|
|
#### 5.3.2 Query Planner
|
|
|
|
Convert AST to execution plan:
|
|
|
|
```rust
|
|
enum QueryPlan {
|
|
VectorSelector {
|
|
matchers: Vec<LabelMatcher>,
|
|
timestamp: i64,
|
|
},
|
|
MatrixSelector {
|
|
matchers: Vec<LabelMatcher>,
|
|
range: Duration,
|
|
timestamp: i64,
|
|
},
|
|
Aggregate {
|
|
op: AggregateOp,
|
|
input: Box<QueryPlan>,
|
|
grouping: Vec<String>,
|
|
},
|
|
RateFunc {
|
|
input: Box<QueryPlan>,
|
|
},
|
|
BinaryOp {
|
|
op: BinaryOp,
|
|
lhs: Box<QueryPlan>,
|
|
rhs: Box<QueryPlan>,
|
|
matching: VectorMatching,
|
|
},
|
|
}
|
|
|
|
struct QueryPlanner;
|
|
|
|
impl QueryPlanner {
|
|
fn plan(expr: parser::Expr, query_time: i64) -> Result<QueryPlan> {
|
|
match expr {
|
|
parser::Expr::VectorSelector(vs) => {
|
|
Ok(QueryPlan::VectorSelector {
|
|
matchers: vs.matchers.matchers.into_iter()
|
|
.map(|m| LabelMatcher::from_ast(m))
|
|
.collect(),
|
|
timestamp: query_time,
|
|
})
|
|
}
|
|
parser::Expr::MatrixSelector(ms) => {
|
|
Ok(QueryPlan::MatrixSelector {
|
|
matchers: ms.vector_selector.matchers.matchers.into_iter()
|
|
.map(|m| LabelMatcher::from_ast(m))
|
|
.collect(),
|
|
range: Duration::from_millis(ms.range as u64),
|
|
timestamp: query_time,
|
|
})
|
|
}
|
|
parser::Expr::Call(call) => {
|
|
match call.func.name.as_str() {
|
|
"rate" => {
|
|
let arg_plan = Self::plan(*call.args[0].clone(), query_time)?;
|
|
Ok(QueryPlan::RateFunc { input: Box::new(arg_plan) })
|
|
}
|
|
// ... other functions
|
|
_ => Err(Error::UnsupportedFunction(call.func.name)),
|
|
}
|
|
}
|
|
parser::Expr::Aggregate(agg) => {
|
|
let input_plan = Self::plan(*agg.expr, query_time)?;
|
|
Ok(QueryPlan::Aggregate {
|
|
op: AggregateOp::from_str(&agg.op.to_string())?,
|
|
input: Box::new(input_plan),
|
|
grouping: agg.grouping.unwrap_or_default(),
|
|
})
|
|
}
|
|
parser::Expr::Binary(bin) => {
|
|
let lhs_plan = Self::plan(*bin.lhs, query_time)?;
|
|
let rhs_plan = Self::plan(*bin.rhs, query_time)?;
|
|
Ok(QueryPlan::BinaryOp {
|
|
op: BinaryOp::from_str(&bin.op.to_string())?,
|
|
lhs: Box::new(lhs_plan),
|
|
rhs: Box::new(rhs_plan),
|
|
matching: bin.modifier.map(|m| VectorMatching::from_ast(m)).unwrap_or_default(),
|
|
})
|
|
}
|
|
_ => Err(Error::UnsupportedExpr),
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
#### 5.3.3 Query Executor
|
|
|
|
Execute the plan:
|
|
|
|
```rust
|
|
struct QueryExecutor {
|
|
head: Arc<Head>,
|
|
blocks: Arc<BlockManager>,
|
|
}
|
|
|
|
impl QueryExecutor {
|
|
async fn execute(&self, plan: QueryPlan) -> Result<QueryResult> {
|
|
match plan {
|
|
QueryPlan::VectorSelector { matchers, timestamp } => {
|
|
self.execute_vector_selector(matchers, timestamp).await
|
|
}
|
|
QueryPlan::MatrixSelector { matchers, range, timestamp } => {
|
|
self.execute_matrix_selector(matchers, range, timestamp).await
|
|
}
|
|
QueryPlan::RateFunc { input } => {
|
|
let matrix = self.execute(*input).await?;
|
|
self.apply_rate(matrix)
|
|
}
|
|
QueryPlan::Aggregate { op, input, grouping } => {
|
|
let vector = self.execute(*input).await?;
|
|
self.apply_aggregate(op, vector, grouping)
|
|
}
|
|
QueryPlan::BinaryOp { op, lhs, rhs, matching } => {
|
|
let lhs_result = self.execute(*lhs).await?;
|
|
let rhs_result = self.execute(*rhs).await?;
|
|
self.apply_binary_op(op, lhs_result, rhs_result, matching)
|
|
}
|
|
}
|
|
}
|
|
|
|
async fn execute_vector_selector(
|
|
&self,
|
|
matchers: Vec<LabelMatcher>,
|
|
timestamp: i64,
|
|
) -> Result<InstantVector> {
|
|
// 1. Find matching series from index
|
|
let series_ids = self.find_series(&matchers).await?;
|
|
|
|
// 2. For each series, get sample at timestamp
|
|
let mut samples = Vec::new();
|
|
for series_id in series_ids {
|
|
if let Some(sample) = self.get_sample_at(series_id, timestamp).await? {
|
|
samples.push(sample);
|
|
}
|
|
}
|
|
|
|
Ok(InstantVector { samples })
|
|
}
|
|
|
|
async fn execute_matrix_selector(
|
|
&self,
|
|
matchers: Vec<LabelMatcher>,
|
|
range: Duration,
|
|
timestamp: i64,
|
|
) -> Result<RangeVector> {
|
|
let series_ids = self.find_series(&matchers).await?;
|
|
|
|
let start = timestamp - range.as_millis() as i64;
|
|
let end = timestamp;
|
|
|
|
let mut ranges = Vec::new();
|
|
for series_id in series_ids {
|
|
let samples = self.get_samples_range(series_id, start, end).await?;
|
|
ranges.push(RangeVectorSeries {
|
|
labels: self.get_labels(series_id).await?,
|
|
samples,
|
|
});
|
|
}
|
|
|
|
Ok(RangeVector { ranges })
|
|
}
|
|
|
|
fn apply_rate(&self, matrix: RangeVector) -> Result<InstantVector> {
|
|
let mut samples = Vec::new();
|
|
|
|
for range in matrix.ranges {
|
|
if range.samples.len() < 2 {
|
|
continue; // Need at least 2 samples for rate
|
|
}
|
|
|
|
let first = &range.samples[0];
|
|
let last = &range.samples[range.samples.len() - 1];
|
|
|
|
let delta_value = last.value - first.value;
|
|
let delta_time = (last.timestamp - first.timestamp) as f64 / 1000.0; // Convert to seconds
|
|
|
|
let rate = delta_value / delta_time;
|
|
|
|
samples.push(Sample {
|
|
labels: range.labels,
|
|
timestamp: last.timestamp,
|
|
value: rate,
|
|
});
|
|
}
|
|
|
|
Ok(InstantVector { samples })
|
|
}
|
|
|
|
fn apply_aggregate(
|
|
&self,
|
|
op: AggregateOp,
|
|
vector: InstantVector,
|
|
grouping: Vec<String>,
|
|
) -> Result<InstantVector> {
|
|
// Group samples by grouping labels
|
|
let mut groups: HashMap<Vec<(String, String)>, Vec<Sample>> = HashMap::new();
|
|
|
|
for sample in vector.samples {
|
|
let group_key = if grouping.is_empty() {
|
|
vec![]
|
|
} else {
|
|
grouping.iter()
|
|
.filter_map(|label| sample.labels.get(label).map(|v| (label.clone(), v.clone())))
|
|
.collect()
|
|
};
|
|
|
|
groups.entry(group_key).or_insert_with(Vec::new).push(sample);
|
|
}
|
|
|
|
// Apply aggregation to each group
|
|
let mut result_samples = Vec::new();
|
|
for (group_labels, samples) in groups {
|
|
let aggregated_value = match op {
|
|
AggregateOp::Sum => samples.iter().map(|s| s.value).sum(),
|
|
AggregateOp::Avg => samples.iter().map(|s| s.value).sum::<f64>() / samples.len() as f64,
|
|
AggregateOp::Min => samples.iter().map(|s| s.value).fold(f64::INFINITY, f64::min),
|
|
AggregateOp::Max => samples.iter().map(|s| s.value).fold(f64::NEG_INFINITY, f64::max),
|
|
AggregateOp::Count => samples.len() as f64,
|
|
// ... other aggregations
|
|
};
|
|
|
|
result_samples.push(Sample {
|
|
labels: group_labels.into_iter().collect(),
|
|
timestamp: samples[0].timestamp,
|
|
value: aggregated_value,
|
|
});
|
|
}
|
|
|
|
Ok(InstantVector { samples: result_samples })
|
|
}
|
|
}
|
|
```
|
|
|
|
### 5.4 Read Path Architecture
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ Query Layer │
|
|
│ │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ HTTP API │ │ PromQL │ │ Query │ │
|
|
│ │ /api/v1/ │─▶│ Parser │─▶│ Planner │ │
|
|
│ │ query │ │ │ │ │ │
|
|
│ └─────────────┘ └─────────────┘ └──────┬──────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────┐ │
|
|
│ │ Query │ │
|
|
│ │ Executor │ │
|
|
│ └────────┬────────┘ │
|
|
└───────────────────────────────────────────┼──────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ Index Layer │
|
|
│ │
|
|
│ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Label Index │ │ Posting │ │
|
|
│ │ (In-Memory) │ │ Lists │ │
|
|
│ └──────┬───────┘ └──────┬───────┘ │
|
|
│ │ │ │
|
|
│ └─────────────────┘ │
|
|
│ │ │
|
|
│ │ Series IDs │
|
|
│ ▼ │
|
|
└──────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ Storage Layer │
|
|
│ │
|
|
│ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ Head │ │ Blocks │ │
|
|
│ │ (In-Memory) │ │ (Disk) │ │
|
|
│ └─────┬───────┘ └─────┬───────┘ │
|
|
│ │ │ │
|
|
│ │ Recent data (<2h) │ Historical data │
|
|
│ │ │ │
|
|
│ ▼ ▼ │
|
|
│ ┌─────────────────────────────────────┐ │
|
|
│ │ Chunk Reader │ │
|
|
│ │ - Decompress Gorilla chunks │ │
|
|
│ │ - Filter by time range │ │
|
|
│ │ - Return samples │ │
|
|
│ └─────────────────────────────────────┘ │
|
|
└──────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 5.5 HTTP Query API
|
|
|
|
#### 5.5.1 Instant Query
|
|
|
|
```
|
|
GET /api/v1/query?query=<promql>&time=<timestamp>&timeout=<duration>
|
|
```
|
|
|
|
**Parameters**:
|
|
- `query`: PromQL expression (required)
|
|
- `time`: Unix timestamp (optional, default: now)
|
|
- `timeout`: Query timeout (optional, default: 30s)
|
|
|
|
**Response** (JSON):
|
|
|
|
```json
|
|
{
|
|
"status": "success",
|
|
"data": {
|
|
"resultType": "vector",
|
|
"result": [
|
|
{
|
|
"metric": {
|
|
"__name__": "http_requests_total",
|
|
"method": "GET",
|
|
"status": "200"
|
|
},
|
|
"value": [1733832000, "1543"]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
#### 5.5.2 Range Query
|
|
|
|
```
|
|
GET /api/v1/query_range?query=<promql>&start=<timestamp>&end=<timestamp>&step=<duration>
|
|
```
|
|
|
|
**Parameters**:
|
|
- `query`: PromQL expression (required)
|
|
- `start`: Start timestamp (required)
|
|
- `end`: End timestamp (required)
|
|
- `step`: Query resolution step (required, e.g., "15s")
|
|
|
|
**Response** (JSON):
|
|
|
|
```json
|
|
{
|
|
"status": "success",
|
|
"data": {
|
|
"resultType": "matrix",
|
|
"result": [
|
|
{
|
|
"metric": {
|
|
"__name__": "http_requests_total",
|
|
"method": "GET"
|
|
},
|
|
"values": [
|
|
[1733832000, "1543"],
|
|
[1733832015, "1556"],
|
|
[1733832030, "1570"]
|
|
]
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
#### 5.5.3 Label Values Query
|
|
|
|
```
|
|
GET /api/v1/label/<label_name>/values?match[]=<matcher>
|
|
```
|
|
|
|
**Example**:
|
|
```
|
|
GET /api/v1/label/method/values?match[]=http_requests_total
|
|
```
|
|
|
|
**Response**:
|
|
```json
|
|
{
|
|
"status": "success",
|
|
"data": ["GET", "POST", "PUT", "DELETE"]
|
|
}
|
|
```
|
|
|
|
#### 5.5.4 Series Metadata Query
|
|
|
|
```
|
|
GET /api/v1/series?match[]=<matcher>&start=<timestamp>&end=<timestamp>
|
|
```
|
|
|
|
**Example**:
|
|
```
|
|
GET /api/v1/series?match[]=http_requests_total{method="GET"}
|
|
```
|
|
|
|
**Response**:
|
|
```json
|
|
{
|
|
"status": "success",
|
|
"data": [
|
|
{
|
|
"__name__": "http_requests_total",
|
|
"method": "GET",
|
|
"status": "200",
|
|
"instance": "flaredb-1:9092"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### 5.6 Performance Optimizations
|
|
|
|
#### 5.6.1 Query Caching
|
|
|
|
Cache query results for identical queries:
|
|
|
|
```rust
|
|
struct QueryCache {
|
|
cache: Arc<Mutex<LruCache<String, (QueryResult, Instant)>>>,
|
|
ttl: Duration,
|
|
}
|
|
|
|
impl QueryCache {
|
|
fn get(&self, query_hash: &str) -> Option<QueryResult> {
|
|
let cache = self.cache.lock().unwrap();
|
|
if let Some((result, timestamp)) = cache.get(query_hash) {
|
|
if timestamp.elapsed() < self.ttl {
|
|
return Some(result.clone());
|
|
}
|
|
}
|
|
None
|
|
}
|
|
|
|
fn put(&self, query_hash: String, result: QueryResult) {
|
|
let mut cache = self.cache.lock().unwrap();
|
|
cache.put(query_hash, (result, Instant::now()));
|
|
}
|
|
}
|
|
```
|
|
|
|
#### 5.6.2 Posting List Intersection
|
|
|
|
Use efficient algorithms for label matcher intersection:
|
|
|
|
```rust
|
|
fn intersect_posting_lists(lists: Vec<&[SeriesID]>) -> Vec<SeriesID> {
|
|
if lists.is_empty() {
|
|
return vec![];
|
|
}
|
|
|
|
// Sort lists by length (shortest first for early termination)
|
|
let mut sorted_lists = lists;
|
|
sorted_lists.sort_by_key(|list| list.len());
|
|
|
|
// Use shortest list as base, intersect with others
|
|
let mut result: HashSet<SeriesID> = sorted_lists[0].iter().copied().collect();
|
|
|
|
for list in &sorted_lists[1..] {
|
|
let list_set: HashSet<SeriesID> = list.iter().copied().collect();
|
|
result.retain(|id| list_set.contains(id));
|
|
|
|
if result.is_empty() {
|
|
break; // Early termination
|
|
}
|
|
}
|
|
|
|
result.into_iter().collect()
|
|
}
|
|
```
|
|
|
|
#### 5.6.3 Chunk Pruning
|
|
|
|
Skip chunks that don't overlap query time range:
|
|
|
|
```rust
|
|
fn query_chunks(
|
|
chunks: &[ChunkRef],
|
|
start_time: i64,
|
|
end_time: i64,
|
|
) -> Vec<ChunkRef> {
|
|
chunks.iter()
|
|
.filter(|chunk| {
|
|
// Chunk overlaps query range if:
|
|
// chunk.max_time > start AND chunk.min_time < end
|
|
chunk.max_time > start_time && chunk.min_time < end_time
|
|
})
|
|
.copied()
|
|
.collect()
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Storage Backend Architecture
|
|
|
|
### 6.1 Architecture Decision: Hybrid Approach
|
|
|
|
After analyzing trade-offs, Nightlight adopts a **hybrid storage architecture**:
|
|
|
|
1. **Dedicated time-series engine** for sample storage (optimized for write throughput and compression)
|
|
2. **Optional FlareDB integration** for metadata and distributed coordination (future work)
|
|
3. **Optional S3-compatible backend** for cold data archival (future work)
|
|
|
|
### 6.2 Decision Rationale
|
|
|
|
#### 6.2.1 Why NOT Pure FlareDB Backend?
|
|
|
|
**FlareDB Characteristics**:
|
|
- General-purpose KV store with Raft consensus
|
|
- Optimized for: Strong consistency, small KV pairs, random access
|
|
- Storage: RocksDB (LSM tree)
|
|
|
|
**Time-Series Workload Characteristics**:
|
|
- High write throughput (100K samples/sec)
|
|
- Sequential writes (append-only)
|
|
- Temporal locality (queries focus on recent data)
|
|
- Bulk reads (range scans over time windows)
|
|
|
|
**Mismatch Analysis**:
|
|
|
|
| Aspect | FlareDB (KV) | Time-Series Engine |
|
|
|--------|--------------|-------------------|
|
|
| Write pattern | Random writes, compaction overhead | Append-only, minimal overhead |
|
|
| Compression | Generic LZ4/Snappy | Domain-specific (Gorilla: 12x) |
|
|
| Read pattern | Point lookups | Range scans over time |
|
|
| Indexing | Key-based | Label-based inverted index |
|
|
| Consistency | Strong (Raft) | Eventual OK for metrics |
|
|
|
|
**Conclusion**: Using FlareDB for sample storage would sacrifice 5-10x write throughput and 10x compression efficiency.
|
|
|
|
#### 6.2.2 Why NOT VictoriaMetrics Binary?
|
|
|
|
VictoriaMetrics is written in Go and has excellent performance, but:
|
|
- mTLS support is **paid only** (violates PROJECT.md requirement)
|
|
- Not Rust (violates PROJECT.md "Rustで書く")
|
|
- Cannot integrate with FlareDB for metadata (future requirement)
|
|
- Less control over storage format and optimizations
|
|
|
|
#### 6.2.3 Why Hybrid (Dedicated + Optional FlareDB)?
|
|
|
|
**Phase 1 (T033 v1)**: Pure dedicated engine
|
|
- Simple, single-instance deployment
|
|
- Focus on core functionality (ingest + query)
|
|
- Local disk storage only
|
|
|
|
**Phase 2 (Future)**: Add FlareDB for metadata
|
|
- Store series labels and metadata in FlareDB "metrics" namespace
|
|
- Enables multi-instance coordination
|
|
- Global view of series cardinality, label values
|
|
- Samples still in dedicated engine (local disk)
|
|
|
|
**Phase 3 (Future)**: Add S3 for cold storage
|
|
- Automatically upload old blocks (>7 days) to S3
|
|
- Query federation across local + S3 blocks
|
|
- Unlimited retention with cost-effective storage
|
|
|
|
**Benefits**:
|
|
- v1 simplicity: No FlareDB dependency, easy deployment
|
|
- Future scalability: Metadata in FlareDB, samples distributed
|
|
- Operational flexibility: Can run standalone or integrated
|
|
|
|
### 6.3 Storage Layout
|
|
|
|
#### 6.3.1 Directory Structure
|
|
|
|
```
|
|
/var/lib/nightlight/
|
|
├── data/
|
|
│ ├── wal/
|
|
│ │ ├── 00000001 # WAL segment
|
|
│ │ ├── 00000002
|
|
│ │ └── checkpoint.00000002 # WAL checkpoint
|
|
│ ├── 01HQZQZQZQZQZQZQZQZQZQ/ # Block (ULID)
|
|
│ │ ├── meta.json
|
|
│ │ ├── index
|
|
│ │ ├── chunks/
|
|
│ │ │ ├── 000001
|
|
│ │ │ └── 000002
|
|
│ │ └── tombstones
|
|
│ ├── 01HQZR.../ # Another block
|
|
│ └── ...
|
|
└── tmp/ # Temp files for compaction
|
|
```
|
|
|
|
#### 6.3.2 Metadata Storage (Future: FlareDB Integration)
|
|
|
|
When FlareDB integration is enabled:
|
|
|
|
**Series Metadata** (stored in FlareDB "metrics" namespace):
|
|
|
|
```
|
|
Key: series:<series_id>
|
|
Value: {
|
|
"labels": {"__name__": "http_requests_total", "method": "GET", ...},
|
|
"first_seen": 1733832000000,
|
|
"last_seen": 1733839200000
|
|
}
|
|
|
|
Key: label_index:<label_name>:<label_value>
|
|
Value: [series_id1, series_id2, ...] # Posting list
|
|
```
|
|
|
|
**Benefits**:
|
|
- Fast label value lookups across all instances
|
|
- Global series cardinality tracking
|
|
- Distributed query planning (future)
|
|
|
|
**Trade-off**: Adds dependency on FlareDB, increases complexity
|
|
|
|
### 6.4 Scalability Approach
|
|
|
|
#### 6.4.1 Vertical Scaling (v1)
|
|
|
|
Single instance scales to:
|
|
- 10M active series
|
|
- 100K samples/sec write throughput
|
|
- 1K queries/sec
|
|
|
|
**Scaling strategy**:
|
|
- Increase memory (more series in Head)
|
|
- Faster disk (NVMe for WAL/blocks)
|
|
- More CPU cores (parallel compaction, query execution)
|
|
|
|
#### 6.4.2 Horizontal Scaling (Future)
|
|
|
|
**Sharding Strategy** (inspired by Prometheus federation + Thanos):
|
|
|
|
```
|
|
┌────────────────────────────────────────────────────────────┐
|
|
│ Query Frontend │
|
|
│ (Query Federation) │
|
|
└─────┬────────────────────┬─────────────────────┬───────────┘
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
│ Nightlight │ │ Nightlight │ │ Nightlight │
|
|
│ Instance 1 │ │ Instance 2 │ │ Instance N │
|
|
│ │ │ │ │ │
|
|
│ Hash shard: │ │ Hash shard: │ │ Hash shard: │
|
|
│ 0-333 │ │ 334-666 │ │ 667-999 │
|
|
└─────────────┘ └─────────────┘ └─────────────┘
|
|
│ │ │
|
|
└────────────────────┴─────────────────────┘
|
|
│
|
|
▼
|
|
┌───────────────┐
|
|
│ FlareDB │
|
|
│ (Metadata) │
|
|
└───────────────┘
|
|
```
|
|
|
|
**Sharding Key**: Hash(series_id) % num_shards
|
|
|
|
**Query Execution**:
|
|
1. Query frontend receives PromQL query
|
|
2. Determine which shards contain matching series (via FlareDB metadata)
|
|
3. Send subqueries to relevant shards
|
|
4. Merge results (aggregation, deduplication)
|
|
5. Return to client
|
|
|
|
**Challenges** (deferred to future work):
|
|
- Rebalancing when adding/removing shards
|
|
- Handling series that span multiple shards (rare)
|
|
- Ensuring query consistency across shards
|
|
|
|
### 6.5 S3 Integration Strategy (Future)
|
|
|
|
**Objective**: Cost-effective long-term retention (>15 days)
|
|
|
|
**Architecture**:
|
|
|
|
```
|
|
┌───────────────────────────────────────────────────┐
|
|
│ Nightlight Server │
|
|
│ │
|
|
│ ┌──────────┐ ┌──────────┐ │
|
|
│ │ Head │ │ Blocks │ │
|
|
│ │ (0-2h) │ │ (2h-15d)│ │
|
|
│ └──────────┘ └────┬─────┘ │
|
|
│ │ │
|
|
│ │ Background uploader │
|
|
│ ▼ │
|
|
│ ┌─────────────┐ │
|
|
│ │ Upload to │ │
|
|
│ │ S3 (>7d) │ │
|
|
│ └──────┬──────┘ │
|
|
└──────────────────────────┼────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ S3 Bucket │
|
|
│ /blocks/ │
|
|
│ 01HQZ.../ │
|
|
│ 01HRZ.../ │
|
|
└─────────────────┘
|
|
```
|
|
|
|
**Workflow**:
|
|
1. Block compaction creates local block files
|
|
2. Blocks older than 7 days (configurable) are uploaded to S3
|
|
3. Local block files deleted after successful upload
|
|
4. Query executor checks both local and S3 for blocks in query range
|
|
5. Download S3 blocks on-demand (with local cache)
|
|
|
|
**Configuration**:
|
|
```toml
|
|
[storage.s3]
|
|
enabled = true
|
|
endpoint = "https://s3.example.com"
|
|
bucket = "nightlight-blocks"
|
|
access_key_id = "..."
|
|
secret_access_key = "..."
|
|
upload_after_days = 7
|
|
local_cache_size_gb = 100
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Integration Points
|
|
|
|
### 7.1 Service Discovery (How Services Push Metrics)
|
|
|
|
#### 7.1.1 Service Configuration Pattern
|
|
|
|
Each platform service (FlareDB, ChainFire, etc.) exports Prometheus metrics on ports 9091-9099.
|
|
|
|
**Example** (FlareDB metrics exporter):
|
|
|
|
```rust
|
|
// flaredb-server/src/main.rs
|
|
use metrics_exporter_prometheus::PrometheusBuilder;
|
|
|
|
#[tokio::main]
|
|
async fn main() -> Result<()> {
|
|
// ... initialization ...
|
|
|
|
let metrics_addr = format!("0.0.0.0:{}", args.metrics_port);
|
|
let builder = PrometheusBuilder::new();
|
|
builder
|
|
.with_http_listener(metrics_addr.parse::<std::net::SocketAddr>()?)
|
|
.install()
|
|
.expect("Failed to install Prometheus metrics exporter");
|
|
|
|
info!("Prometheus metrics available at http://{}/metrics", metrics_addr);
|
|
|
|
// ... rest of main ...
|
|
}
|
|
```
|
|
|
|
**Service Metrics Ports** (from T027.S2):
|
|
|
|
| Service | Port | Endpoint |
|
|
|---------|------|----------|
|
|
| ChainFire | 9091 | http://chainfire:9091/metrics |
|
|
| FlareDB | 9092 | http://flaredb:9092/metrics |
|
|
| PlasmaVMC | 9093 | http://plasmavmc:9093/metrics |
|
|
| IAM | 9094 | http://iam:9094/metrics |
|
|
| LightningSTOR | 9095 | http://lightningstor:9095/metrics |
|
|
| FlashDNS | 9096 | http://flashdns:9096/metrics |
|
|
| FiberLB | 9097 | http://fiberlb:9097/metrics |
|
|
| Prismnet | 9098 | http://prismnet:9098/metrics |
|
|
|
|
#### 7.1.2 Scrape-to-Push Adapter
|
|
|
|
Since Nightlight is **push-based** but services export **pull-based** Prometheus `/metrics` endpoints, we need a scrape-to-push adapter.
|
|
|
|
**Option 1**: Prometheus Agent Mode + Remote Write
|
|
|
|
Deploy Prometheus in agent mode (no storage, only scraping):
|
|
|
|
```yaml
|
|
# prometheus-agent.yaml
|
|
global:
|
|
scrape_interval: 15s
|
|
external_labels:
|
|
cluster: 'cloud-platform'
|
|
|
|
scrape_configs:
|
|
- job_name: 'chainfire'
|
|
static_configs:
|
|
- targets: ['chainfire:9091']
|
|
|
|
- job_name: 'flaredb'
|
|
static_configs:
|
|
- targets: ['flaredb:9092']
|
|
|
|
# ... other services ...
|
|
|
|
remote_write:
|
|
- url: 'https://nightlight:8080/api/v1/write'
|
|
tls_config:
|
|
cert_file: /etc/certs/client.crt
|
|
key_file: /etc/certs/client.key
|
|
ca_file: /etc/certs/ca.crt
|
|
```
|
|
|
|
**Option 2**: Custom Rust Scraper (Platform-Native)
|
|
|
|
Build a lightweight scraper in Rust that integrates with Nightlight:
|
|
|
|
```rust
|
|
// nightlight-scraper/src/main.rs
|
|
|
|
struct Scraper {
|
|
targets: Vec<ScrapeTarget>,
|
|
client: reqwest::Client,
|
|
nightlight_client: NightlightClient,
|
|
}
|
|
|
|
struct ScrapeTarget {
|
|
job_name: String,
|
|
url: String,
|
|
interval: Duration,
|
|
}
|
|
|
|
impl Scraper {
|
|
async fn scrape_loop(&self) {
|
|
loop {
|
|
for target in &self.targets {
|
|
let result = self.scrape_target(target).await;
|
|
match result {
|
|
Ok(samples) => {
|
|
if let Err(e) = self.nightlight_client.write(samples).await {
|
|
error!("Failed to write to Nightlight: {}", e);
|
|
}
|
|
}
|
|
Err(e) => {
|
|
error!("Failed to scrape {}: {}", target.url, e);
|
|
}
|
|
}
|
|
}
|
|
tokio::time::sleep(Duration::from_secs(15)).await;
|
|
}
|
|
}
|
|
|
|
async fn scrape_target(&self, target: &ScrapeTarget) -> Result<Vec<Sample>> {
|
|
let response = self.client.get(&target.url).send().await?;
|
|
let body = response.text().await?;
|
|
|
|
// Parse Prometheus text format
|
|
let samples = parse_prometheus_text(&body, &target.job_name)?;
|
|
Ok(samples)
|
|
}
|
|
}
|
|
|
|
fn parse_prometheus_text(text: &str, job: &str) -> Result<Vec<Sample>> {
|
|
// Use prometheus-parse crate or implement simple parser
|
|
// Example output:
|
|
// http_requests_total{method="GET",status="200",job="flaredb"} 1543 1733832000000
|
|
}
|
|
```
|
|
|
|
**Deployment**:
|
|
- `nightlight-scraper` runs as a sidecar or separate service
|
|
- Reads scrape config from TOML file
|
|
- Uses mTLS to push to Nightlight
|
|
|
|
**Recommendation**: Option 2 (custom scraper) for consistency with platform philosophy (100% Rust, no external dependencies).
|
|
|
|
### 7.2 mTLS Configuration (T027/T031 Patterns)
|
|
|
|
#### 7.2.1 TLS Config Structure
|
|
|
|
Following existing patterns (FlareDB, ChainFire, IAM):
|
|
|
|
```toml
|
|
# nightlight.toml
|
|
|
|
[server]
|
|
addr = "0.0.0.0:8080"
|
|
log_level = "info"
|
|
|
|
[server.tls]
|
|
cert_file = "/etc/nightlight/certs/server.crt"
|
|
key_file = "/etc/nightlight/certs/server.key"
|
|
ca_file = "/etc/nightlight/certs/ca.crt"
|
|
require_client_cert = true # Enable mTLS
|
|
```
|
|
|
|
**Rust Config Struct**:
|
|
|
|
```rust
|
|
// nightlight-server/src/config.rs
|
|
|
|
use serde::{Deserialize, Serialize};
|
|
use std::net::SocketAddr;
|
|
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub struct ServerConfig {
|
|
pub server: ServerSettings,
|
|
pub storage: StorageConfig,
|
|
}
|
|
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub struct ServerSettings {
|
|
pub addr: SocketAddr,
|
|
pub log_level: String,
|
|
pub tls: Option<TlsConfig>,
|
|
}
|
|
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub struct TlsConfig {
|
|
pub cert_file: String,
|
|
pub key_file: String,
|
|
pub ca_file: Option<String>,
|
|
#[serde(default)]
|
|
pub require_client_cert: bool,
|
|
}
|
|
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub struct StorageConfig {
|
|
pub data_dir: String,
|
|
pub retention_days: u32,
|
|
pub wal_segment_size_mb: usize,
|
|
// ... other storage settings
|
|
}
|
|
```
|
|
|
|
#### 7.2.2 mTLS Server Setup
|
|
|
|
```rust
|
|
// nightlight-server/src/main.rs
|
|
|
|
use axum::Router;
|
|
use axum_server::tls_rustls::RustlsConfig;
|
|
use std::sync::Arc;
|
|
|
|
#[tokio::main]
|
|
async fn main() -> Result<()> {
|
|
let config = ServerConfig::load("nightlight.toml")?;
|
|
|
|
// Build router
|
|
let app = Router::new()
|
|
.route("/api/v1/write", post(handle_remote_write))
|
|
.route("/api/v1/query", get(handle_instant_query))
|
|
.route("/api/v1/query_range", get(handle_range_query))
|
|
.route("/health", get(health_check))
|
|
.route("/ready", get(readiness_check))
|
|
.with_state(Arc::new(service));
|
|
|
|
// Setup TLS if configured
|
|
if let Some(tls_config) = &config.server.tls {
|
|
info!("TLS enabled, loading certificates...");
|
|
|
|
let rustls_config = if tls_config.require_client_cert {
|
|
info!("mTLS enabled, requiring client certificates");
|
|
|
|
let ca_cert_pem = tokio::fs::read_to_string(
|
|
tls_config.ca_file.as_ref().ok_or("ca_file required for mTLS")?
|
|
).await?;
|
|
|
|
RustlsConfig::from_pem_file(
|
|
&tls_config.cert_file,
|
|
&tls_config.key_file,
|
|
)
|
|
.await?
|
|
.with_client_cert_verifier(ca_cert_pem)
|
|
} else {
|
|
info!("TLS-only mode, client certificates not required");
|
|
RustlsConfig::from_pem_file(
|
|
&tls_config.cert_file,
|
|
&tls_config.key_file,
|
|
).await?
|
|
};
|
|
|
|
axum_server::bind_rustls(config.server.addr, rustls_config)
|
|
.serve(app.into_make_service())
|
|
.await?;
|
|
} else {
|
|
info!("TLS disabled, running in plain-text mode");
|
|
axum_server::bind(config.server.addr)
|
|
.serve(app.into_make_service())
|
|
.await?;
|
|
}
|
|
|
|
Ok(())
|
|
}
|
|
```
|
|
|
|
#### 7.2.3 Client Certificate Validation
|
|
|
|
Extract client identity from mTLS certificate:
|
|
|
|
```rust
|
|
use axum::{
|
|
http::Request,
|
|
middleware::Next,
|
|
response::Response,
|
|
Extension,
|
|
};
|
|
use axum_server::tls_rustls::RustlsAcceptor;
|
|
|
|
#[derive(Clone, Debug)]
|
|
struct ClientIdentity {
|
|
common_name: String,
|
|
organization: String,
|
|
}
|
|
|
|
async fn extract_client_identity<B>(
|
|
Extension(client_cert): Extension<Option<rustls::Certificate>>,
|
|
mut request: Request<B>,
|
|
next: Next<B>,
|
|
) -> Response {
|
|
if let Some(cert) = client_cert {
|
|
// Parse certificate to extract CN, O, etc.
|
|
let identity = parse_certificate(&cert);
|
|
request.extensions_mut().insert(identity);
|
|
}
|
|
|
|
next.run(request).await
|
|
}
|
|
|
|
// Use identity for rate limiting, audit logging, etc.
|
|
async fn handle_remote_write(
|
|
Extension(identity): Extension<ClientIdentity>,
|
|
State(service): State<Arc<IngestionService>>,
|
|
body: Bytes,
|
|
) -> Result<StatusCode, (StatusCode, String)> {
|
|
info!("Write request from: {}", identity.common_name);
|
|
|
|
// Apply per-client rate limiting
|
|
if !service.rate_limiter.allow(&identity.common_name) {
|
|
return Err((StatusCode::TOO_MANY_REQUESTS, "Rate limit exceeded".into()));
|
|
}
|
|
|
|
// ... rest of handler ...
|
|
}
|
|
```
|
|
|
|
### 7.3 gRPC API Design
|
|
|
|
While HTTP is the primary interface (Prometheus compatibility), a gRPC API can provide:
|
|
- Better performance for internal services
|
|
- Streaming support for batch ingestion
|
|
- Type-safe client libraries
|
|
|
|
**Proto Definition**:
|
|
|
|
```protobuf
|
|
// proto/nightlight.proto
|
|
|
|
syntax = "proto3";
|
|
|
|
package nightlight.v1;
|
|
|
|
service NightlightService {
|
|
// Write samples
|
|
rpc Write(WriteRequest) returns (WriteResponse);
|
|
|
|
// Streaming write for high-throughput scenarios
|
|
rpc WriteStream(stream WriteRequest) returns (WriteResponse);
|
|
|
|
// Query (instant)
|
|
rpc Query(QueryRequest) returns (QueryResponse);
|
|
|
|
// Query (range)
|
|
rpc QueryRange(QueryRangeRequest) returns (QueryRangeResponse);
|
|
|
|
// Admin operations
|
|
rpc Compact(CompactRequest) returns (CompactResponse);
|
|
rpc DeleteSeries(DeleteSeriesRequest) returns (DeleteSeriesResponse);
|
|
}
|
|
|
|
message WriteRequest {
|
|
repeated TimeSeries timeseries = 1;
|
|
}
|
|
|
|
message WriteResponse {
|
|
uint64 samples_ingested = 1;
|
|
uint64 samples_rejected = 2;
|
|
}
|
|
|
|
message QueryRequest {
|
|
string query = 1; // PromQL
|
|
int64 time = 2; // Unix timestamp (ms)
|
|
int64 timeout_ms = 3;
|
|
}
|
|
|
|
message QueryResponse {
|
|
string result_type = 1; // "vector" or "matrix"
|
|
repeated InstantVectorSample vector = 2;
|
|
repeated RangeVectorSeries matrix = 3;
|
|
}
|
|
|
|
message InstantVectorSample {
|
|
map<string, string> labels = 1;
|
|
double value = 2;
|
|
int64 timestamp = 3;
|
|
}
|
|
|
|
message RangeVectorSeries {
|
|
map<string, string> labels = 1;
|
|
repeated Sample samples = 2;
|
|
}
|
|
|
|
message Sample {
|
|
double value = 1;
|
|
int64 timestamp = 2;
|
|
}
|
|
```
|
|
|
|
### 7.4 NixOS Module Integration
|
|
|
|
Following T024 patterns, create a NixOS module for Nightlight.
|
|
|
|
**File**: `nix/modules/nightlight.nix`
|
|
|
|
```nix
|
|
{ config, lib, pkgs, ... }:
|
|
|
|
with lib;
|
|
|
|
let
|
|
cfg = config.services.nightlight;
|
|
|
|
configFile = pkgs.writeText "nightlight.toml" ''
|
|
[server]
|
|
addr = "${cfg.listenAddress}"
|
|
log_level = "${cfg.logLevel}"
|
|
|
|
${optionalString (cfg.tls.enable) ''
|
|
[server.tls]
|
|
cert_file = "${cfg.tls.certFile}"
|
|
key_file = "${cfg.tls.keyFile}"
|
|
${optionalString (cfg.tls.caFile != null) ''
|
|
ca_file = "${cfg.tls.caFile}"
|
|
''}
|
|
require_client_cert = ${boolToString cfg.tls.requireClientCert}
|
|
''}
|
|
|
|
[storage]
|
|
data_dir = "${cfg.dataDir}"
|
|
retention_days = ${toString cfg.storage.retentionDays}
|
|
wal_segment_size_mb = ${toString cfg.storage.walSegmentSizeMb}
|
|
'';
|
|
|
|
in {
|
|
options.services.nightlight = {
|
|
enable = mkEnableOption "Nightlight metrics storage service";
|
|
|
|
package = mkOption {
|
|
type = types.package;
|
|
default = pkgs.nightlight;
|
|
description = "Nightlight package to use";
|
|
};
|
|
|
|
listenAddress = mkOption {
|
|
type = types.str;
|
|
default = "0.0.0.0:8080";
|
|
description = "Address and port to listen on";
|
|
};
|
|
|
|
logLevel = mkOption {
|
|
type = types.enum [ "trace" "debug" "info" "warn" "error" ];
|
|
default = "info";
|
|
description = "Log level";
|
|
};
|
|
|
|
dataDir = mkOption {
|
|
type = types.path;
|
|
default = "/var/lib/nightlight";
|
|
description = "Data directory for TSDB storage";
|
|
};
|
|
|
|
tls = {
|
|
enable = mkEnableOption "TLS encryption";
|
|
|
|
certFile = mkOption {
|
|
type = types.str;
|
|
description = "Path to TLS certificate file";
|
|
};
|
|
|
|
keyFile = mkOption {
|
|
type = types.str;
|
|
description = "Path to TLS private key file";
|
|
};
|
|
|
|
caFile = mkOption {
|
|
type = types.nullOr types.str;
|
|
default = null;
|
|
description = "Path to CA certificate for client verification (mTLS)";
|
|
};
|
|
|
|
requireClientCert = mkOption {
|
|
type = types.bool;
|
|
default = false;
|
|
description = "Require client certificates (mTLS)";
|
|
};
|
|
};
|
|
|
|
storage = {
|
|
retentionDays = mkOption {
|
|
type = types.ints.positive;
|
|
default = 15;
|
|
description = "Data retention period in days";
|
|
};
|
|
|
|
walSegmentSizeMb = mkOption {
|
|
type = types.ints.positive;
|
|
default = 128;
|
|
description = "WAL segment size in MB";
|
|
};
|
|
};
|
|
};
|
|
|
|
config = mkIf cfg.enable {
|
|
systemd.services.nightlight = {
|
|
description = "Nightlight Metrics Storage Service";
|
|
wantedBy = [ "multi-user.target" ];
|
|
after = [ "network.target" ];
|
|
|
|
serviceConfig = {
|
|
Type = "simple";
|
|
ExecStart = "${cfg.package}/bin/nightlight-server --config ${configFile}";
|
|
Restart = "on-failure";
|
|
RestartSec = "5s";
|
|
|
|
# Security hardening
|
|
DynamicUser = true;
|
|
StateDirectory = "nightlight";
|
|
ProtectSystem = "strict";
|
|
ProtectHome = true;
|
|
PrivateTmp = true;
|
|
NoNewPrivileges = true;
|
|
};
|
|
};
|
|
|
|
# Expose metrics endpoint
|
|
networking.firewall.allowedTCPPorts = mkIf cfg.openFirewall [ 8080 ];
|
|
};
|
|
}
|
|
```
|
|
|
|
**Usage Example** (in NixOS configuration):
|
|
|
|
```nix
|
|
{
|
|
services.nightlight = {
|
|
enable = true;
|
|
listenAddress = "0.0.0.0:8080";
|
|
logLevel = "info";
|
|
|
|
tls = {
|
|
enable = true;
|
|
certFile = "/etc/certs/nightlight-server.crt";
|
|
keyFile = "/etc/certs/nightlight-server.key";
|
|
caFile = "/etc/certs/ca.crt";
|
|
requireClientCert = true;
|
|
};
|
|
|
|
storage = {
|
|
retentionDays = 30;
|
|
};
|
|
};
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Implementation Plan
|
|
|
|
### 8.1 Step Breakdown (S1-S6)
|
|
|
|
The implementation follows a phased approach aligned with the task.yaml steps.
|
|
|
|
#### **S1: Research & Architecture** ✅ (Current Document)
|
|
|
|
**Deliverable**: This design document
|
|
|
|
**Status**: Completed
|
|
|
|
---
|
|
|
|
#### **S2: Workspace Scaffold**
|
|
|
|
**Goal**: Create nightlight workspace with skeleton structure
|
|
|
|
**Tasks**:
|
|
1. Create workspace structure:
|
|
```
|
|
nightlight/
|
|
├── Cargo.toml
|
|
├── crates/
|
|
│ ├── nightlight-api/ # Client library
|
|
│ ├── nightlight-server/ # Main service
|
|
│ └── nightlight-types/ # Shared types
|
|
├── proto/
|
|
│ ├── remote_write.proto
|
|
│ └── nightlight.proto
|
|
└── README.md
|
|
```
|
|
|
|
2. Setup proto compilation in build.rs
|
|
|
|
3. Define core types:
|
|
```rust
|
|
// nightlight-types/src/lib.rs
|
|
|
|
pub type SeriesID = u64;
|
|
pub type Timestamp = i64; // Unix timestamp in milliseconds
|
|
|
|
pub struct Sample {
|
|
pub timestamp: Timestamp,
|
|
pub value: f64,
|
|
}
|
|
|
|
pub struct Series {
|
|
pub id: SeriesID,
|
|
pub labels: BTreeMap<String, String>,
|
|
}
|
|
|
|
pub struct LabelMatcher {
|
|
pub name: String,
|
|
pub value: String,
|
|
pub op: MatchOp,
|
|
}
|
|
|
|
pub enum MatchOp {
|
|
Equal,
|
|
NotEqual,
|
|
RegexMatch,
|
|
RegexNotMatch,
|
|
}
|
|
```
|
|
|
|
4. Add dependencies:
|
|
```toml
|
|
[workspace.dependencies]
|
|
# Core
|
|
tokio = { version = "1.35", features = ["full"] }
|
|
anyhow = "1.0"
|
|
tracing = "0.1"
|
|
tracing-subscriber = "0.3"
|
|
|
|
# Serialization
|
|
serde = { version = "1.0", features = ["derive"] }
|
|
serde_json = "1.0"
|
|
toml = "0.8"
|
|
|
|
# gRPC
|
|
tonic = "0.10"
|
|
prost = "0.12"
|
|
prost-types = "0.12"
|
|
|
|
# HTTP
|
|
axum = "0.7"
|
|
axum-server = { version = "0.6", features = ["tls-rustls"] }
|
|
|
|
# Compression
|
|
snap = "1.1" # Snappy
|
|
|
|
# Time-series
|
|
promql-parser = "0.4"
|
|
|
|
# Storage
|
|
rocksdb = "0.21" # (NOT for TSDB, only for examples)
|
|
|
|
# Crypto
|
|
rustls = "0.21"
|
|
```
|
|
|
|
**Estimated Effort**: 2 days
|
|
|
|
---
|
|
|
|
#### **S3: Push Ingestion**
|
|
|
|
**Goal**: Implement Prometheus remote_write compatible ingestion endpoint
|
|
|
|
**Tasks**:
|
|
|
|
1. **Implement WAL**:
|
|
```rust
|
|
// nightlight-server/src/wal.rs
|
|
|
|
struct WAL {
|
|
dir: PathBuf,
|
|
segment_size: usize,
|
|
active_segment: RwLock<WALSegment>,
|
|
}
|
|
|
|
impl WAL {
|
|
fn new(dir: PathBuf, segment_size: usize) -> Result<Self>;
|
|
fn append(&self, record: WALRecord) -> Result<()>;
|
|
fn replay(&self) -> Result<Vec<WALRecord>>;
|
|
fn checkpoint(&self, min_segment: u64) -> Result<()>;
|
|
}
|
|
```
|
|
|
|
2. **Implement In-Memory Head Block**:
|
|
```rust
|
|
// nightlight-server/src/head.rs
|
|
|
|
struct Head {
|
|
series: DashMap<SeriesID, Arc<Series>>, // Concurrent HashMap
|
|
min_time: AtomicI64,
|
|
max_time: AtomicI64,
|
|
config: HeadConfig,
|
|
}
|
|
|
|
impl Head {
|
|
async fn append(&self, series_id: SeriesID, labels: Labels, ts: Timestamp, value: f64) -> Result<()>;
|
|
async fn get(&self, series_id: SeriesID) -> Option<Arc<Series>>;
|
|
async fn series_count(&self) -> usize;
|
|
}
|
|
```
|
|
|
|
3. **Implement Gorilla Compression** (basic version):
|
|
```rust
|
|
// nightlight-server/src/compression.rs
|
|
|
|
struct GorillaEncoder { /* ... */ }
|
|
struct GorillaDecoder { /* ... */ }
|
|
|
|
impl GorillaEncoder {
|
|
fn encode_timestamp(&mut self, ts: i64) -> Result<()>;
|
|
fn encode_value(&mut self, value: f64) -> Result<()>;
|
|
fn finish(self) -> Vec<u8>;
|
|
}
|
|
```
|
|
|
|
4. **Implement HTTP Ingestion Handler**:
|
|
```rust
|
|
// nightlight-server/src/handlers/ingest.rs
|
|
|
|
async fn handle_remote_write(
|
|
State(service): State<Arc<IngestionService>>,
|
|
body: Bytes,
|
|
) -> Result<StatusCode, (StatusCode, String)> {
|
|
// 1. Decompress Snappy
|
|
// 2. Decode protobuf
|
|
// 3. Validate samples
|
|
// 4. Append to WAL
|
|
// 5. Insert into Head
|
|
// 6. Return 204 No Content
|
|
}
|
|
```
|
|
|
|
5. **Add Rate Limiting**:
|
|
```rust
|
|
struct RateLimiter {
|
|
rate: f64, // samples/sec
|
|
tokens: AtomicU64,
|
|
}
|
|
|
|
impl RateLimiter {
|
|
fn allow(&self) -> bool;
|
|
}
|
|
```
|
|
|
|
6. **Integration Test**:
|
|
```rust
|
|
#[tokio::test]
|
|
async fn test_remote_write_ingestion() {
|
|
// Start server
|
|
// Send WriteRequest
|
|
// Verify samples stored
|
|
}
|
|
```
|
|
|
|
**Estimated Effort**: 5 days
|
|
|
|
---
|
|
|
|
#### **S4: PromQL Query Engine**
|
|
|
|
**Goal**: Basic PromQL query support (instant + range queries)
|
|
|
|
**Tasks**:
|
|
|
|
1. **Integrate promql-parser**:
|
|
```rust
|
|
// nightlight-server/src/query/parser.rs
|
|
|
|
use promql_parser::parser;
|
|
|
|
pub fn parse(query: &str) -> Result<parser::Expr> {
|
|
parser::parse(query).map_err(|e| Error::ParseError(e.to_string()))
|
|
}
|
|
```
|
|
|
|
2. **Implement Query Planner**:
|
|
```rust
|
|
// nightlight-server/src/query/planner.rs
|
|
|
|
pub enum QueryPlan {
|
|
VectorSelector { matchers: Vec<LabelMatcher>, timestamp: i64 },
|
|
MatrixSelector { matchers: Vec<LabelMatcher>, range: Duration, timestamp: i64 },
|
|
Aggregate { op: AggregateOp, input: Box<QueryPlan>, grouping: Vec<String> },
|
|
RateFunc { input: Box<QueryPlan> },
|
|
// ... other operators
|
|
}
|
|
|
|
pub fn plan(expr: parser::Expr, query_time: i64) -> Result<QueryPlan>;
|
|
```
|
|
|
|
3. **Implement Label Index**:
|
|
```rust
|
|
// nightlight-server/src/index.rs
|
|
|
|
struct LabelIndex {
|
|
// label_name -> label_value -> [series_ids]
|
|
inverted_index: DashMap<String, DashMap<String, Vec<SeriesID>>>,
|
|
}
|
|
|
|
impl LabelIndex {
|
|
fn find_series(&self, matchers: &[LabelMatcher]) -> Result<Vec<SeriesID>>;
|
|
fn add_series(&self, series_id: SeriesID, labels: &Labels);
|
|
}
|
|
```
|
|
|
|
4. **Implement Query Executor**:
|
|
```rust
|
|
// nightlight-server/src/query/executor.rs
|
|
|
|
struct QueryExecutor {
|
|
head: Arc<Head>,
|
|
blocks: Arc<BlockManager>,
|
|
index: Arc<LabelIndex>,
|
|
}
|
|
|
|
impl QueryExecutor {
|
|
async fn execute(&self, plan: QueryPlan) -> Result<QueryResult>;
|
|
|
|
async fn execute_vector_selector(&self, matchers: Vec<LabelMatcher>, ts: i64) -> Result<InstantVector>;
|
|
async fn execute_matrix_selector(&self, matchers: Vec<LabelMatcher>, range: Duration, ts: i64) -> Result<RangeVector>;
|
|
|
|
fn apply_rate(&self, matrix: RangeVector) -> Result<InstantVector>;
|
|
fn apply_aggregate(&self, op: AggregateOp, vector: InstantVector, grouping: Vec<String>) -> Result<InstantVector>;
|
|
}
|
|
```
|
|
|
|
5. **Implement HTTP Query Handlers**:
|
|
```rust
|
|
// nightlight-server/src/handlers/query.rs
|
|
|
|
async fn handle_instant_query(
|
|
Query(params): Query<QueryParams>,
|
|
State(executor): State<Arc<QueryExecutor>>,
|
|
) -> Result<Json<QueryResponse>, (StatusCode, String)> {
|
|
let expr = parse(¶ms.query)?;
|
|
let plan = plan(expr, params.time.unwrap_or_else(now))?;
|
|
let result = executor.execute(plan).await?;
|
|
Ok(Json(format_response(result)))
|
|
}
|
|
|
|
async fn handle_range_query(
|
|
Query(params): Query<RangeQueryParams>,
|
|
State(executor): State<Arc<QueryExecutor>>,
|
|
) -> Result<Json<QueryResponse>, (StatusCode, String)> {
|
|
// Similar to instant query, but iterate over [start, end] with step
|
|
}
|
|
```
|
|
|
|
6. **Integration Test**:
|
|
```rust
|
|
#[tokio::test]
|
|
async fn test_instant_query() {
|
|
// Ingest samples
|
|
// Query: http_requests_total{method="GET"}
|
|
// Verify results
|
|
}
|
|
|
|
#[tokio::test]
|
|
async fn test_range_query_with_rate() {
|
|
// Ingest counter samples
|
|
// Query: rate(http_requests_total[5m])
|
|
// Verify rate calculation
|
|
}
|
|
```
|
|
|
|
**Estimated Effort**: 7 days
|
|
|
|
---
|
|
|
|
#### **S5: Storage Layer**
|
|
|
|
**Goal**: Time-series storage with retention and compaction
|
|
|
|
**Tasks**:
|
|
|
|
1. **Implement Block Writer**:
|
|
```rust
|
|
// nightlight-server/src/block/writer.rs
|
|
|
|
struct BlockWriter {
|
|
block_dir: PathBuf,
|
|
index_writer: IndexWriter,
|
|
chunk_writer: ChunkWriter,
|
|
}
|
|
|
|
impl BlockWriter {
|
|
fn new(block_dir: PathBuf) -> Result<Self>;
|
|
fn write_series(&mut self, series: &Series, samples: &[Sample]) -> Result<()>;
|
|
fn finalize(self) -> Result<BlockMeta>;
|
|
}
|
|
```
|
|
|
|
2. **Implement Block Reader**:
|
|
```rust
|
|
// nightlight-server/src/block/reader.rs
|
|
|
|
struct BlockReader {
|
|
meta: BlockMeta,
|
|
index: Index,
|
|
chunks: ChunkReader,
|
|
}
|
|
|
|
impl BlockReader {
|
|
fn open(block_dir: PathBuf) -> Result<Self>;
|
|
fn query_samples(&self, series_id: SeriesID, start: i64, end: i64) -> Result<Vec<Sample>>;
|
|
}
|
|
```
|
|
|
|
3. **Implement Compaction**:
|
|
```rust
|
|
// nightlight-server/src/compaction.rs
|
|
|
|
struct Compactor {
|
|
data_dir: PathBuf,
|
|
config: CompactionConfig,
|
|
}
|
|
|
|
impl Compactor {
|
|
async fn compact_head_to_l0(&self, head: &Head) -> Result<BlockID>;
|
|
async fn compact_blocks(&self, source_blocks: Vec<BlockID>) -> Result<BlockID>;
|
|
async fn run_compaction_loop(&self); // Background task
|
|
}
|
|
```
|
|
|
|
4. **Implement Retention Enforcement**:
|
|
```rust
|
|
impl Compactor {
|
|
async fn enforce_retention(&self, retention: Duration) -> Result<()> {
|
|
let cutoff = SystemTime::now() - retention;
|
|
// Delete blocks older than cutoff
|
|
}
|
|
}
|
|
```
|
|
|
|
5. **Implement Block Manager**:
|
|
```rust
|
|
// nightlight-server/src/block/manager.rs
|
|
|
|
struct BlockManager {
|
|
blocks: RwLock<Vec<Arc<BlockReader>>>,
|
|
data_dir: PathBuf,
|
|
}
|
|
|
|
impl BlockManager {
|
|
fn load_blocks(&mut self) -> Result<()>;
|
|
fn add_block(&mut self, block: BlockReader);
|
|
fn remove_block(&mut self, block_id: &BlockID);
|
|
fn query_blocks(&self, start: i64, end: i64) -> Vec<Arc<BlockReader>>;
|
|
}
|
|
```
|
|
|
|
6. **Integration Test**:
|
|
```rust
|
|
#[tokio::test]
|
|
async fn test_compaction() {
|
|
// Ingest data for >2h
|
|
// Trigger compaction
|
|
// Verify block created
|
|
// Query old data from block
|
|
}
|
|
|
|
#[tokio::test]
|
|
async fn test_retention() {
|
|
// Create old blocks
|
|
// Run retention enforcement
|
|
// Verify old blocks deleted
|
|
}
|
|
```
|
|
|
|
**Estimated Effort**: 8 days
|
|
|
|
---
|
|
|
|
#### **S6: Integration & Documentation**
|
|
|
|
**Goal**: NixOS module, TLS config, integration tests, operator docs
|
|
|
|
**Tasks**:
|
|
|
|
1. **Create NixOS Module**:
|
|
- File: `nix/modules/nightlight.nix`
|
|
- Follow T024 patterns
|
|
- Include systemd service, firewall rules
|
|
- Support TLS configuration options
|
|
|
|
2. **Implement mTLS**:
|
|
- Load certs in server startup
|
|
- Configure Rustls with client cert verification
|
|
- Extract client identity for rate limiting
|
|
|
|
3. **Create Nightlight Scraper**:
|
|
- Standalone scraper service
|
|
- Reads scrape config (TOML)
|
|
- Scrapes `/metrics` endpoints from services
|
|
- Pushes to Nightlight via remote_write
|
|
|
|
4. **Integration Tests**:
|
|
```rust
|
|
#[tokio::test]
|
|
async fn test_e2e_ingest_and_query() {
|
|
// Start Nightlight server
|
|
// Ingest samples via remote_write
|
|
// Query via /api/v1/query
|
|
// Query via /api/v1/query_range
|
|
// Verify results match
|
|
}
|
|
|
|
#[tokio::test]
|
|
async fn test_mtls_authentication() {
|
|
// Start server with mTLS
|
|
// Connect without client cert -> rejected
|
|
// Connect with valid client cert -> accepted
|
|
}
|
|
|
|
#[tokio::test]
|
|
async fn test_grafana_compatibility() {
|
|
// Configure Grafana to use Nightlight
|
|
// Execute sample queries
|
|
// Verify dashboards render correctly
|
|
}
|
|
```
|
|
|
|
5. **Write Operator Documentation**:
|
|
- **File**: `docs/por/T033-nightlight/OPERATOR.md`
|
|
- Installation (NixOS, standalone)
|
|
- Configuration guide
|
|
- mTLS setup
|
|
- Scraper configuration
|
|
- Troubleshooting
|
|
- Performance tuning
|
|
|
|
6. **Write Developer Documentation**:
|
|
- **File**: `nightlight/README.md`
|
|
- Architecture overview
|
|
- Building from source
|
|
- Running tests
|
|
- Contributing guidelines
|
|
|
|
**Estimated Effort**: 5 days
|
|
|
|
---
|
|
|
|
### 8.2 Dependency Ordering
|
|
|
|
```
|
|
S1 (Research) → S2 (Scaffold)
|
|
↓
|
|
S3 (Ingestion) ──┐
|
|
↓ │
|
|
S4 (Query) │
|
|
↓ │
|
|
S5 (Storage) ←────┘
|
|
↓
|
|
S6 (Integration)
|
|
```
|
|
|
|
**Critical Path**: S1 → S2 → S3 → S5 → S6
|
|
**Parallelizable**: S4 can start after S3 completes basic ingestion
|
|
|
|
### 8.3 Total Effort Estimate
|
|
|
|
| Step | Effort | Priority |
|
|
|------|--------|----------|
|
|
| S1: Research | 2 days | P0 |
|
|
| S2: Scaffold | 2 days | P0 |
|
|
| S3: Ingestion | 5 days | P0 |
|
|
| S4: Query Engine | 7 days | P0 |
|
|
| S5: Storage Layer | 8 days | P1 |
|
|
| S6: Integration | 5 days | P1 |
|
|
| **Total** | **29 days** | |
|
|
|
|
**Realistic Timeline**: 6-8 weeks (accounting for testing, debugging, documentation)
|
|
|
|
---
|
|
|
|
## 9. Open Questions
|
|
|
|
### 9.1 Decisions Requiring User Input
|
|
|
|
#### Q1: Scraper Implementation Choice
|
|
|
|
**Question**: Should we use Prometheus in agent mode or build a custom Rust scraper?
|
|
|
|
**Option A**: Prometheus Agent + Remote Write
|
|
- **Pros**: Battle-tested, standard tool, no implementation effort
|
|
- **Cons**: Adds Go dependency, less platform integration
|
|
|
|
**Option B**: Custom Rust Scraper
|
|
- **Pros**: 100% Rust, platform consistency, easier integration
|
|
- **Cons**: Implementation effort, needs testing
|
|
|
|
**Recommendation**: Option B (custom scraper) for consistency with PROJECT.md philosophy
|
|
|
|
**Decision**: [ ] A [ ] B [ ] Defer to later
|
|
|
|
---
|
|
|
|
#### Q2: gRPC vs HTTP Priority
|
|
|
|
**Question**: Should we prioritize gRPC API or focus only on HTTP (Prometheus compatibility)?
|
|
|
|
**Option A**: HTTP only (v1)
|
|
- **Pros**: Simpler, Prometheus/Grafana compatibility is sufficient
|
|
- **Cons**: Less efficient for internal services
|
|
|
|
**Option B**: Both HTTP and gRPC (v1)
|
|
- **Pros**: Better performance for internal services, more flexibility
|
|
- **Cons**: More implementation effort
|
|
|
|
**Recommendation**: Option A for v1, add gRPC in v2 if needed
|
|
|
|
**Decision**: [ ] A [ ] B
|
|
|
|
---
|
|
|
|
#### Q3: FlareDB Metadata Integration Timeline
|
|
|
|
**Question**: When should we integrate FlareDB for metadata storage?
|
|
|
|
**Option A**: v1 (T033)
|
|
- **Pros**: Unified metadata story from the start
|
|
- **Cons**: Increases complexity, adds dependency
|
|
|
|
**Option B**: v2 (Future)
|
|
- **Pros**: Simpler v1, can deploy standalone
|
|
- **Cons**: Migration effort later
|
|
|
|
**Recommendation**: Option B (defer to v2)
|
|
|
|
**Decision**: [ ] A [ ] B
|
|
|
|
---
|
|
|
|
#### Q4: S3 Cold Storage Priority
|
|
|
|
**Question**: Should S3 cold storage be part of v1 or deferred?
|
|
|
|
**Option A**: v1 (T033.S5)
|
|
- **Pros**: Unlimited retention from day 1
|
|
- **Cons**: Complexity, operational overhead
|
|
|
|
**Option B**: v2 (Future)
|
|
- **Pros**: Simpler v1, focus on core functionality
|
|
- **Cons**: Limited retention (local disk only)
|
|
|
|
**Recommendation**: Option B (defer to v2), use local disk for v1 with 15-30 day retention
|
|
|
|
**Decision**: [ ] A [ ] B
|
|
|
|
---
|
|
|
|
### 9.2 Areas Needing Further Investigation
|
|
|
|
#### I1: PromQL Function Coverage
|
|
|
|
**Issue**: Need to determine exact subset of PromQL functions to support in v1.
|
|
|
|
**Investigation Needed**:
|
|
- Survey existing Grafana dashboards in use
|
|
- Identify most common functions (rate, increase, histogram_quantile, etc.)
|
|
- Prioritize by usage frequency
|
|
|
|
**Proposed Approach**:
|
|
- Analyze 10-20 sample dashboards
|
|
- Create coverage matrix
|
|
- Implement top 80% functions first
|
|
|
|
---
|
|
|
|
#### I2: Query Performance Benchmarking
|
|
|
|
**Issue**: Need to validate query latency targets (p95 <100ms) are achievable.
|
|
|
|
**Investigation Needed**:
|
|
- Benchmark promql-parser crate performance
|
|
- Measure Gorilla decompression throughput
|
|
- Test index lookup performance at 10M series scale
|
|
|
|
**Proposed Approach**:
|
|
- Create benchmark suite with synthetic data (1M, 10M series)
|
|
- Measure end-to-end query latency
|
|
- Identify bottlenecks and optimize
|
|
|
|
---
|
|
|
|
#### I3: Series Cardinality Limits
|
|
|
|
**Issue**: How to prevent series explosion (high cardinality killing performance)?
|
|
|
|
**Investigation Needed**:
|
|
- Research cardinality estimation algorithms (HyperLogLog)
|
|
- Define cardinality limits (per metric, per label, global)
|
|
- Implement rejection strategy (reject new series beyond limit)
|
|
|
|
**Proposed Approach**:
|
|
- Add cardinality tracking to label index
|
|
- Implement warnings at 80% limit, rejection at 100%
|
|
- Provide admin API to inspect high-cardinality series
|
|
|
|
---
|
|
|
|
#### I4: Out-of-Order Sample Edge Cases
|
|
|
|
**Issue**: How to handle out-of-order samples spanning chunk boundaries?
|
|
|
|
**Investigation Needed**:
|
|
- Test scenarios: samples arriving 1h late, 2h late, etc.
|
|
- Determine if we need multi-chunk updates or reject old samples
|
|
- Benchmark impact of re-sorting chunks
|
|
|
|
**Proposed Approach**:
|
|
- Implement configurable out-of-order window (default: 1h)
|
|
- Reject samples older than window
|
|
- For within-window samples, insert into correct chunk (may require chunk re-compression)
|
|
|
|
---
|
|
|
|
## 10. References
|
|
|
|
### 10.1 Research Sources
|
|
|
|
#### Time-Series Storage Formats
|
|
|
|
- [Gorilla: A Fast, Scalable, In-Memory Time Series Database (Facebook)](http://www.vldb.org/pvldb/vol8/p1816-teller.pdf)
|
|
- [Gorilla Compression Algorithm - The Morning Paper](https://blog.acolyer.org/2016/05/03/gorilla-a-fast-scalable-in-memory-time-series-database/)
|
|
- [Prometheus TSDB Storage Documentation](https://prometheus.io/docs/prometheus/latest/storage/)
|
|
- [Prometheus TSDB Architecture - Palark Blog](https://palark.com/blog/prometheus-architecture-tsdb/)
|
|
- [InfluxDB TSM Storage Engine](https://www.influxdata.com/blog/new-storage-engine-time-structured-merge-tree/)
|
|
- [M3DB Storage Architecture](https://m3db.io/docs/architecture/m3db/)
|
|
- [M3DB at Uber Blog](https://www.uber.com/blog/m3/)
|
|
|
|
#### PromQL Implementation
|
|
|
|
- [promql-parser Rust Crate (GreptimeTeam)](https://github.com/GreptimeTeam/promql-parser)
|
|
- [promql-parser Documentation](https://docs.rs/promql-parser)
|
|
- [promql Crate (vthriller)](https://github.com/vthriller/promql)
|
|
|
|
#### Prometheus Remote Write Protocol
|
|
|
|
- [Prometheus Remote Write 1.0 Specification](https://prometheus.io/docs/specs/prw/remote_write_spec/)
|
|
- [Prometheus Remote Write 2.0 Specification](https://prometheus.io/docs/specs/prw/remote_write_spec_2_0/)
|
|
- [Prometheus Protobuf Schema (remote.proto)](https://github.com/prometheus/prometheus/blob/main/prompb/remote.proto)
|
|
|
|
#### Rust TSDB Implementations
|
|
|
|
- [InfluxDB 3 Engineering with Rust - InfoQ](https://www.infoq.com/articles/timeseries-db-rust/)
|
|
- [Datadog's Rust TSDB - Datadog Blog](https://www.datadoghq.com/blog/engineering/rust-timeseries-engine/)
|
|
- [GreptimeDB Announcement](https://greptime.com/blogs/2022-11-15-this-time-for-real)
|
|
- [tstorage-rs Embedded TSDB](https://github.com/dpgil/tstorage-rs)
|
|
- [tsink High-Performance Embedded TSDB](https://dev.to/h2337/building-high-performance-time-series-applications-with-tsink-a-rust-embedded-database-5fa7)
|
|
|
|
### 10.2 Platform References
|
|
|
|
#### Internal Documentation
|
|
|
|
- PROJECT.md (Item 12: Metrics Store)
|
|
- docs/por/T033-nightlight/task.yaml
|
|
- docs/por/T027-production-hardening/ (TLS patterns)
|
|
- docs/por/T024-nixos-packaging/ (NixOS module patterns)
|
|
|
|
#### Existing Service Patterns
|
|
|
|
- flaredb/crates/flaredb-server/src/main.rs (TLS, metrics export)
|
|
- flaredb/crates/flaredb-server/src/config/mod.rs (Config structure)
|
|
- chainfire/crates/chainfire-server/src/config.rs (TLS config)
|
|
- iam/crates/iam-server/src/config.rs (Config patterns)
|
|
|
|
### 10.3 External Tools
|
|
|
|
- [Grafana](https://grafana.com/) - Visualization and dashboards
|
|
- [Prometheus](https://prometheus.io/) - Reference implementation
|
|
- [VictoriaMetrics](https://victoriametrics.com/) - Replacement target (study architecture)
|
|
|
|
---
|
|
|
|
## Appendix A: PromQL Function Reference (v1 Support)
|
|
|
|
### Supported Functions
|
|
|
|
| Function | Category | Description | Example |
|
|
|----------|----------|-------------|---------|
|
|
| `rate()` | Counter | Per-second rate of increase | `rate(http_requests_total[5m])` |
|
|
| `irate()` | Counter | Instant rate (last 2 samples) | `irate(http_requests_total[5m])` |
|
|
| `increase()` | Counter | Total increase over range | `increase(http_requests_total[1h])` |
|
|
| `histogram_quantile()` | Histogram | Calculate quantile from histogram | `histogram_quantile(0.95, rate(http_duration_bucket[5m]))` |
|
|
| `sum()` | Aggregation | Sum values | `sum(metric)` |
|
|
| `avg()` | Aggregation | Average values | `avg(metric)` |
|
|
| `min()` | Aggregation | Minimum value | `min(metric)` |
|
|
| `max()` | Aggregation | Maximum value | `max(metric)` |
|
|
| `count()` | Aggregation | Count series | `count(metric)` |
|
|
| `stddev()` | Aggregation | Standard deviation | `stddev(metric)` |
|
|
| `stdvar()` | Aggregation | Standard variance | `stdvar(metric)` |
|
|
| `topk()` | Aggregation | Top K series | `topk(5, metric)` |
|
|
| `bottomk()` | Aggregation | Bottom K series | `bottomk(5, metric)` |
|
|
| `time()` | Time | Current timestamp | `time()` |
|
|
| `timestamp()` | Time | Sample timestamp | `timestamp(metric)` |
|
|
| `abs()` | Math | Absolute value | `abs(metric)` |
|
|
| `ceil()` | Math | Round up | `ceil(metric)` |
|
|
| `floor()` | Math | Round down | `floor(metric)` |
|
|
| `round()` | Math | Round to nearest | `round(metric, 0.1)` |
|
|
|
|
### NOT Supported in v1
|
|
|
|
| Function | Category | Reason |
|
|
|----------|----------|--------|
|
|
| `predict_linear()` | Prediction | Complex, low usage |
|
|
| `deriv()` | Math | Low usage |
|
|
| `holt_winters()` | Prediction | Complex |
|
|
| `resets()` | Counter | Low usage |
|
|
| `changes()` | Analysis | Low usage |
|
|
| Subqueries | Advanced | Very complex |
|
|
|
|
---
|
|
|
|
## Appendix B: Configuration Reference
|
|
|
|
### Complete Configuration Example
|
|
|
|
```toml
|
|
# nightlight.toml - Complete configuration example
|
|
|
|
[server]
|
|
# Listen address for HTTP/gRPC API
|
|
addr = "0.0.0.0:8080"
|
|
|
|
# Log level: trace, debug, info, warn, error
|
|
log_level = "info"
|
|
|
|
# Metrics port for self-monitoring (Prometheus /metrics endpoint)
|
|
metrics_port = 9099
|
|
|
|
[server.tls]
|
|
# Enable TLS
|
|
cert_file = "/etc/nightlight/certs/server.crt"
|
|
key_file = "/etc/nightlight/certs/server.key"
|
|
|
|
# Enable mTLS (require client certificates)
|
|
ca_file = "/etc/nightlight/certs/ca.crt"
|
|
require_client_cert = true
|
|
|
|
[storage]
|
|
# Data directory for TSDB blocks and WAL
|
|
data_dir = "/var/lib/nightlight/data"
|
|
|
|
# Data retention period (days)
|
|
retention_days = 15
|
|
|
|
# WAL segment size (MB)
|
|
wal_segment_size_mb = 128
|
|
|
|
# Block duration for compaction
|
|
min_block_duration = "2h"
|
|
max_block_duration = "24h"
|
|
|
|
# Out-of-order sample acceptance window
|
|
out_of_order_time_window = "1h"
|
|
|
|
# Series cardinality limits
|
|
max_series = 10_000_000
|
|
max_series_per_metric = 100_000
|
|
|
|
# Memory limits
|
|
max_head_chunks_per_series = 2
|
|
max_head_size_mb = 2048
|
|
|
|
[query]
|
|
# Query timeout (seconds)
|
|
timeout_seconds = 30
|
|
|
|
# Maximum query range (hours)
|
|
max_range_hours = 24
|
|
|
|
# Query result cache TTL (seconds)
|
|
cache_ttl_seconds = 60
|
|
|
|
# Maximum concurrent queries
|
|
max_concurrent_queries = 100
|
|
|
|
[ingestion]
|
|
# Write buffer size (samples)
|
|
write_buffer_size = 100_000
|
|
|
|
# Backpressure strategy: "block" or "reject"
|
|
backpressure_strategy = "block"
|
|
|
|
# Rate limiting (samples per second per client)
|
|
rate_limit_per_client = 50_000
|
|
|
|
# Maximum samples per write request
|
|
max_samples_per_request = 10_000
|
|
|
|
[compaction]
|
|
# Enable background compaction
|
|
enabled = true
|
|
|
|
# Compaction interval (seconds)
|
|
interval_seconds = 7200 # 2 hours
|
|
|
|
# Number of compaction threads
|
|
num_threads = 2
|
|
|
|
[s3]
|
|
# S3 cold storage (optional, future)
|
|
enabled = false
|
|
endpoint = "https://s3.example.com"
|
|
bucket = "nightlight-blocks"
|
|
access_key_id = "..."
|
|
secret_access_key = "..."
|
|
upload_after_days = 7
|
|
local_cache_size_gb = 100
|
|
|
|
[flaredb]
|
|
# FlareDB metadata integration (optional, future)
|
|
enabled = false
|
|
endpoints = ["flaredb-1:50051", "flaredb-2:50051"]
|
|
namespace = "metrics"
|
|
```
|
|
|
|
---
|
|
|
|
## Appendix C: Metrics Exported by Nightlight
|
|
|
|
Nightlight exports metrics about itself on port 9099 (configurable).
|
|
|
|
### Ingestion Metrics
|
|
|
|
```
|
|
# Samples ingested
|
|
nightlight_samples_ingested_total{} counter
|
|
|
|
# Samples rejected (out-of-order, invalid, etc.)
|
|
nightlight_samples_rejected_total{reason="out_of_order|invalid|rate_limit"} counter
|
|
|
|
# Ingestion latency (milliseconds)
|
|
nightlight_ingestion_latency_ms{quantile="0.5|0.9|0.99"} summary
|
|
|
|
# Active series
|
|
nightlight_active_series{} gauge
|
|
|
|
# Head memory usage (bytes)
|
|
nightlight_head_memory_bytes{} gauge
|
|
```
|
|
|
|
### Query Metrics
|
|
|
|
```
|
|
# Queries executed
|
|
nightlight_queries_total{type="instant|range"} counter
|
|
|
|
# Query latency (milliseconds)
|
|
nightlight_query_latency_ms{type="instant|range", quantile="0.5|0.9|0.99"} summary
|
|
|
|
# Query errors
|
|
nightlight_query_errors_total{reason="timeout|parse_error|execution_error"} counter
|
|
```
|
|
|
|
### Storage Metrics
|
|
|
|
```
|
|
# WAL segments
|
|
nightlight_wal_segments{} gauge
|
|
|
|
# WAL size (bytes)
|
|
nightlight_wal_size_bytes{} gauge
|
|
|
|
# Blocks
|
|
nightlight_blocks_total{level="0|1|2"} gauge
|
|
|
|
# Block size (bytes)
|
|
nightlight_block_size_bytes{level="0|1|2"} gauge
|
|
|
|
# Compactions
|
|
nightlight_compactions_total{level="0|1|2"} counter
|
|
|
|
# Compaction duration (seconds)
|
|
nightlight_compaction_duration_seconds{level="0|1|2", quantile="0.5|0.9|0.99"} summary
|
|
```
|
|
|
|
### System Metrics
|
|
|
|
```
|
|
# Go runtime metrics (if using Go for scraper)
|
|
# Rust memory metrics
|
|
nightlight_memory_allocated_bytes{} gauge
|
|
|
|
# CPU usage
|
|
nightlight_cpu_usage_seconds_total{} counter
|
|
```
|
|
|
|
---
|
|
|
|
## Appendix D: Error Codes and Troubleshooting
|
|
|
|
### HTTP Error Codes
|
|
|
|
| Code | Meaning | Common Causes |
|
|
|------|---------|---------------|
|
|
| 200 | OK | Query successful |
|
|
| 204 | No Content | Write successful |
|
|
| 400 | Bad Request | Invalid PromQL, malformed protobuf |
|
|
| 401 | Unauthorized | mTLS cert validation failed |
|
|
| 429 | Too Many Requests | Rate limit exceeded |
|
|
| 500 | Internal Server Error | Storage error, WAL corruption |
|
|
| 503 | Service Unavailable | Write buffer full, server overloaded |
|
|
|
|
### Common Issues
|
|
|
|
#### Issue: "Samples rejected: out_of_order"
|
|
|
|
**Cause**: Samples arriving with timestamps older than `out_of_order_time_window`
|
|
|
|
**Solution**:
|
|
- Increase `out_of_order_time_window` in config
|
|
- Check clock sync on clients (NTP)
|
|
- Reduce scrape batch size
|
|
|
|
#### Issue: "Rate limit exceeded"
|
|
|
|
**Cause**: Client exceeding `rate_limit_per_client` samples/sec
|
|
|
|
**Solution**:
|
|
- Increase rate limit in config
|
|
- Reduce scrape frequency
|
|
- Shard writes across multiple clients
|
|
|
|
#### Issue: "Query timeout"
|
|
|
|
**Cause**: Query exceeding `timeout_seconds`
|
|
|
|
**Solution**:
|
|
- Increase query timeout
|
|
- Reduce query time range
|
|
- Add more specific label matchers to reduce series scanned
|
|
|
|
#### Issue: "Series cardinality explosion"
|
|
|
|
**Cause**: Too many unique label combinations (high cardinality)
|
|
|
|
**Solution**:
|
|
- Review label design (avoid unbounded labels like user_id)
|
|
- Use relabeling to drop high-cardinality labels
|
|
- Increase `max_series` limit (if justified)
|
|
|
|
---
|
|
|
|
**End of Design Document**
|
|
|
|
**Total Length**: ~3,800 lines
|
|
|
|
**Status**: Ready for review and S2-S6 implementation
|
|
|
|
**Next Steps**:
|
|
1. Review and approve design decisions
|
|
2. Create GitHub issues for S2-S6 tasks
|
|
3. Begin S2: Workspace Scaffold
|