- Replace form_urlencoded with RFC 3986 compliant URI encoding - Implement aws_uri_encode() matching AWS SigV4 spec exactly - Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded - All other chars percent-encoded with uppercase hex - Preserve slashes in paths, encode in query params - Normalize empty paths to '/' per AWS spec - Fix test expectations (body hash, HMAC values) - Add comprehensive SigV4 signature determinism test This fixes the canonicalization mismatch that caused signature validation failures in T047. Auth can now be enabled for production. Refs: T058.S1
121 KiB
Nightlight Design Document
Project: Nightlight - VictoriaMetrics OSS Replacement Task: T033.S1 Research & Architecture Version: 1.0 Date: 2025-12-10 Author: PeerB
Table of Contents
- Executive Summary
- Requirements
- Time-Series Storage Model
- Push Ingestion API
- PromQL Query Engine
- Storage Backend Architecture
- Integration Points
- Implementation Plan
- Open Questions
- References
1. Executive Summary
1.1 Overview
Nightlight is a fully open-source, distributed time-series database designed as a replacement for VictoriaMetrics, addressing the critical requirement that VictoriaMetrics' mTLS support is a paid feature. As the final component (Item 12/12) of PROJECT.md, Nightlight completes the observability stack for the Japanese cloud platform.
1.2 High-Level Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Service Mesh │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ FlareDB │ │ ChainFire│ │ PlasmaVMC│ │ IAM │ ... │
│ │ :9092 │ │ :9091 │ │ :9093 │ │ :9094 │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
│ └────────────┴────────────┴────────────┘ │
│ │ │
│ │ Push (remote_write) │
│ │ mTLS │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Nightlight Server │ │
│ │ ┌────────────────┐ │ │
│ │ │ Ingestion API │ │ ← Prometheus remote_write │
│ │ │ (gRPC/HTTP) │ │ │
│ │ └────────┬───────┘ │ │
│ │ │ │ │
│ │ ┌────────▼───────┐ │ │
│ │ │ Write Buffer │ │ │
│ │ │ (In-Memory) │ │ │
│ │ └────────┬───────┘ │ │
│ │ │ │ │
│ │ ┌────────▼───────┐ │ │
│ │ │ Storage Engine│ │ │
│ │ │ ┌──────────┐ │ │ │
│ │ │ │ Head │ │ │ ← WAL + In-Memory Index │
│ │ │ │ (Active) │ │ │ │
│ │ │ └────┬─────┘ │ │ │
│ │ │ │ │ │ │
│ │ │ ┌────▼─────┐ │ │ │
│ │ │ │ Blocks │ │ │ ← Immutable, Compressed │
│ │ │ │ (TSDB) │ │ │ │
│ │ │ └──────────┘ │ │ │
│ │ └────────────────┘ │ │
│ │ │ │ │
│ │ ┌────────▼───────┐ │ │
│ │ │ Query Engine │ │ ← PromQL Execution │
│ │ │ (PromQL AST) │ │ │
│ │ └────────┬───────┘ │ │
│ │ │ │ │
│ └───────────┼──────────┘ │
│ │ │
│ │ Query (HTTP) │
│ │ mTLS │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Grafana / Clients │ │
│ └──────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────┐
│ FlareDB Cluster │ ← Metadata (optional)
│ (Metadata Store) │
└─────────────────────┘
┌─────────────────────┐
│ S3-Compatible │ ← Cold Storage (future)
│ Object Storage │
└─────────────────────┘
1.3 Key Design Decisions
-
Storage Format: Hybrid approach using Prometheus TSDB block design with Gorilla compression
- Rationale: Battle-tested, excellent compression (1-2 bytes/sample), widely understood
-
Storage Backend: Dedicated time-series engine with optional FlareDB metadata integration
- Rationale: Time-series workloads have unique access patterns; KV stores not optimal for sample storage
- FlareDB reserved for metadata (series labels, index) in distributed scenarios
-
PromQL Subset: Support 80% of common use cases (instant/range queries, basic aggregations, rate/increase)
- Rationale: Full PromQL compatibility is complex; focus on practical operator needs
-
Push Model: Prometheus remote_write v1.0 protocol via HTTP + gRPC APIs
- Rationale: Standard protocol, Snappy compression built-in, client library availability
-
mTLS Integration: Consistent with T027/T031 patterns (cert_file, key_file, ca_file, require_client_cert)
- Rationale: Unified security model across all platform services
1.4 Success Criteria
- Accept metrics from 8+ services (ports 9091-9099) via remote_write
- Query latency <100ms for instant queries (p95)
- Compression ratio ≥10:1 (target: 1.5-2 bytes/sample)
- Support 100K samples/sec write throughput per instance
- PromQL queries cover 80% of Grafana dashboard use cases
- Zero vendor lock-in (100% OSS, no paid features)
2. Requirements
2.1 Functional Requirements
FR-1: Push-Based Metric Ingestion
- FR-1.1: Accept Prometheus remote_write v1.0 protocol (HTTP POST)
- FR-1.2: Support Snappy-compressed protobuf payloads
- FR-1.3: Validate metric names and labels per Prometheus naming conventions
- FR-1.4: Handle out-of-order samples within a configurable time window (default: 1h)
- FR-1.5: Deduplicate duplicate samples (same timestamp + labels)
- FR-1.6: Return backpressure signals (HTTP 429/503) when buffer is full
FR-2: PromQL Query Engine
- FR-2.1: Support instant queries (
/api/v1/query) - FR-2.2: Support range queries (
/api/v1/query_range) - FR-2.3: Support label queries (
/api/v1/label/<name>/values,/api/v1/labels) - FR-2.4: Support series metadata queries (
/api/v1/series) - FR-2.5: Implement core PromQL functions (see Section 5.2)
- FR-2.6: Support Prometheus HTTP API JSON response format
FR-3: Time-Series Storage
- FR-3.1: Store samples with millisecond timestamp precision
- FR-3.2: Support configurable retention periods (default: 15 days, configurable 1-365 days)
- FR-3.3: Automatic background compaction of blocks
- FR-3.4: Crash recovery via Write-Ahead Log (WAL)
- FR-3.5: Series cardinality limits to prevent explosion (default: 10M series)
FR-4: Security & Authentication
- FR-4.1: mTLS support for ingestion and query APIs
- FR-4.2: Optional basic authentication for HTTP endpoints
- FR-4.3: Rate limiting per client (based on mTLS certificate CN or IP)
FR-5: Operational Features
- FR-5.1: Prometheus-compatible
/metricsendpoint for self-monitoring - FR-5.2: Health check endpoints (
/health,/ready) - FR-5.3: Admin API for series deletion, compaction trigger
- FR-5.4: TOML configuration file support
- FR-5.5: Environment variable overrides
2.2 Non-Functional Requirements
NFR-1: Performance
- NFR-1.1: Ingestion throughput: ≥100K samples/sec per instance
- NFR-1.2: Query latency (p95): <100ms for instant queries, <500ms for range queries (1h window)
- NFR-1.3: Compression ratio: ≥10:1 (target: 1.5-2 bytes/sample)
- NFR-1.4: Memory usage: <2GB for 1M active series
NFR-2: Scalability
- NFR-2.1: Vertical scaling: Support 10M active series per instance
- NFR-2.2: Horizontal scaling: Support sharding across multiple instances (future work)
- NFR-2.3: Storage: Support local disk + optional S3-compatible backend for cold data
NFR-3: Reliability
- NFR-3.1: No data loss for committed samples (WAL durability)
- NFR-3.2: Graceful degradation under load (reject writes with backpressure, not crash)
- NFR-3.3: Crash recovery time: <30s for 10M series
NFR-4: Maintainability
- NFR-4.1: Codebase consistency with other platform services (FlareDB, ChainFire patterns)
- NFR-4.2: 100% Rust, no CGO dependencies
- NFR-4.3: Comprehensive unit and integration tests
- NFR-4.4: Operator documentation with runbooks
NFR-5: Compatibility
- NFR-5.1: Prometheus remote_write v1.0 protocol compatibility
- NFR-5.2: Prometheus HTTP API compatibility (subset: query, query_range, labels, series)
- NFR-5.3: Grafana data source compatibility
2.3 Out of Scope (Explicitly Not Supported in v1)
- Prometheus remote_read protocol (pull-based; platform uses push)
- Full PromQL compatibility (complex subqueries, advanced functions)
- Multi-tenancy (single-tenant per instance; use multiple instances for multi-tenant)
- Distributed query federation (single-instance queries only)
- Recording rules and alerting (use separate Prometheus/Alertmanager for this)
3. Time-Series Storage Model
3.1 Data Model
3.1.1 Metric Structure
A time-series metric in Nightlight follows the Prometheus data model:
metric_name{label1="value1", label2="value2", ...} value timestamp
Example:
http_requests_total{method="GET", status="200", service="flaredb"} 1543 1733832000000
Components:
-
Metric Name: Identifier for the measurement (e.g.,
http_requests_total)- Must match regex:
[a-zA-Z_:][a-zA-Z0-9_:]*
- Must match regex:
-
Labels: Key-value pairs for dimensionality (e.g.,
{method="GET", status="200"})- Label names:
[a-zA-Z_][a-zA-Z0-9_]* - Label values: Any UTF-8 string
- Reserved labels:
__name__(stores metric name), labels starting with__are internal
- Label names:
-
Value: Float64 sample value
-
Timestamp: Millisecond precision (int64 milliseconds since Unix epoch)
3.1.2 Series Identification
A series is uniquely identified by its metric name + label set:
// Pseudo-code representation
struct SeriesID {
hash: u64, // FNV-1a hash of sorted labels
}
struct Series {
id: SeriesID,
labels: BTreeMap<String, String>, // Sorted for consistent hashing
chunks: Vec<ChunkRef>,
}
Series ID calculation:
- Sort labels lexicographically (including
__name__label) - Concatenate as:
label1_name + \0 + label1_value + \0 + label2_name + \0 + ... - Compute FNV-1a 64-bit hash
3.2 Storage Format
3.2.1 Architecture Overview
Nightlight uses a hybrid storage architecture inspired by Prometheus TSDB and Gorilla:
┌─────────────────────────────────────────────────────────────────┐
│ Memory Layer (Head) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Series Map │ │ WAL Segment │ │ Write Buffer │ │
│ │ (In-Memory │ │ (Disk) │ │ (MPSC Queue) │ │
│ │ Index) │ │ │ │ │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └─────────────────┴─────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Active Chunks │ │
│ │ (Gorilla-compressed) │ │
│ │ - 2h time windows │ │
│ │ - Delta-of-delta TS │ │
│ │ - XOR float encoding │ │
│ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
│ Compaction Trigger
│ (every 2h or on shutdown)
▼
┌─────────────────────────────────────────────────────────────────┐
│ Disk Layer (Blocks) │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Block 1 │ │ Block 2 │ │ Block N │ │
│ │ [0h - 2h) │ │ [2h - 4h) │ │ [Nh - (N+2)h) │ │
│ │ │ │ │ │ │ │
│ │ ├─ meta.json │ │ ├─ meta.json │ │ ├─ meta.json │ │
│ │ ├─ index │ │ ├─ index │ │ ├─ index │ │
│ │ ├─ chunks/000 │ │ ├─ chunks/000 │ │ ├─ chunks/000 │ │
│ │ └─ tombstones │ │ └─ tombstones │ │ └─ tombstones │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
3.2.2 Write-Ahead Log (WAL)
Purpose: Durability and crash recovery
Format: Append-only log segments (128MB default size)
WAL Structure:
data/
wal/
00000001 ← Segment 1 (128MB)
00000002 ← Segment 2 (active)
WAL Record Format (inspired by LevelDB):
┌───────────────────────────────────────────────────┐
│ CRC32 (4 bytes) │
├───────────────────────────────────────────────────┤
│ Length (4 bytes, little-endian) │
├───────────────────────────────────────────────────┤
│ Type (1 byte): FULL | FIRST | MIDDLE | LAST │
├───────────────────────────────────────────────────┤
│ Payload (variable): │
│ - Record Type (1 byte): Series | Samples │
│ - Series ID (8 bytes) │
│ - Labels (length-prefixed strings) │
│ - Samples (varint timestamp, float64 value) │
└───────────────────────────────────────────────────┘
WAL Operations:
- Append: Every write appends to active segment
- Checkpoint: Snapshot of in-memory state to disk blocks
- Truncate: Delete segments older than oldest in-memory data
- Replay: On startup, replay WAL segments to rebuild in-memory state
Rust Implementation Sketch:
struct WAL {
dir: PathBuf,
segment_size: usize, // 128MB default
active_segment: File,
active_segment_num: u64,
}
impl WAL {
fn append(&mut self, record: &WALRecord) -> Result<()> {
let encoded = record.encode();
let crc = crc32(&encoded);
// Rotate segment if needed
if self.active_segment.metadata()?.len() + encoded.len() > self.segment_size {
self.rotate_segment()?;
}
self.active_segment.write_all(&crc.to_le_bytes())?;
self.active_segment.write_all(&(encoded.len() as u32).to_le_bytes())?;
self.active_segment.write_all(&encoded)?;
self.active_segment.sync_all()?; // fsync for durability
Ok(())
}
fn replay(&self) -> Result<Vec<WALRecord>> {
// Read all segments and decode records
// Used on startup for crash recovery
}
}
3.2.3 In-Memory Head Block
Purpose: Accept recent writes, maintain hot data for fast queries
Structure:
struct Head {
series: RwLock<HashMap<SeriesID, Arc<Series>>>,
min_time: AtomicI64,
max_time: AtomicI64,
chunk_size: Duration, // 2h default
wal: Arc<WAL>,
}
struct Series {
id: SeriesID,
labels: BTreeMap<String, String>,
chunks: RwLock<Vec<Chunk>>,
}
struct Chunk {
min_time: i64,
max_time: i64,
samples: CompressedSamples, // Gorilla encoding
}
Chunk Lifecycle:
- Creation: New chunk created when first sample arrives or previous chunk is full
- Active: Chunk accepts samples in time window [min_time, min_time + 2h)
- Full: Chunk reaches 2h window, new chunk created for subsequent samples
- Compaction: Full chunks compacted to disk blocks
Memory Limits:
- Max series: 10M (configurable)
- Max chunks per series: 2 (active + previous, covering 4h)
- Eviction: LRU eviction of inactive series (no samples in 4h)
3.2.4 Disk Blocks (Immutable)
Purpose: Long-term storage of compacted time-series data
Block Structure (inspired by Prometheus TSDB):
data/
01HQZQZQZQZQZQZQZQZQZQ/ ← Block directory (ULID)
meta.json ← Metadata
index ← Inverted index
chunks/
000001 ← Chunk file
000002
...
tombstones ← Deleted series/samples
meta.json Format:
{
"ulid": "01HQZQZQZQZQZQZQZQZQZQ",
"minTime": 1733832000000,
"maxTime": 1733839200000,
"stats": {
"numSamples": 1500000,
"numSeries": 5000,
"numChunks": 10000
},
"compaction": {
"level": 1,
"sources": ["01HQZQZ..."]
},
"version": 1
}
Index File Format (simplified):
The index file provides fast lookups of series by labels.
┌────────────────────────────────────────────────┐
│ Magic Number (4 bytes): 0xBADA55A0 │
├────────────────────────────────────────────────┤
│ Version (1 byte): 1 │
├────────────────────────────────────────────────┤
│ Symbol Table Section │
│ - Sorted strings (label names/values) │
│ - Offset table for binary search │
├────────────────────────────────────────────────┤
│ Series Section │
│ - SeriesID → Chunk Refs mapping │
│ - (series_id, labels, chunk_offsets) │
├────────────────────────────────────────────────┤
│ Label Index Section (Inverted Index) │
│ - label_name → [series_ids] │
│ - (label_name, label_value) → [series_ids] │
├────────────────────────────────────────────────┤
│ Postings Section │
│ - Sorted posting lists for label matchers │
│ - Compressed with varint + bit packing │
├────────────────────────────────────────────────┤
│ TOC (Table of Contents) │
│ - Offsets to each section │
└────────────────────────────────────────────────┘
Chunks File Format:
Chunk File (chunks/000001):
┌────────────────────────────────────────────────┐
│ Chunk 1: │
│ ├─ Length (4 bytes) │
│ ├─ Encoding (1 byte): Gorilla = 0x01 │
│ ├─ MinTime (8 bytes) │
│ ├─ MaxTime (8 bytes) │
│ ├─ NumSamples (4 bytes) │
│ └─ Compressed Data (variable) │
├────────────────────────────────────────────────┤
│ Chunk 2: ... │
└────────────────────────────────────────────────┘
3.3 Compression Strategy
3.3.1 Gorilla Compression Algorithm
Nightlight uses Gorilla compression from Facebook's paper (VLDB 2015), achieving ~12x compression.
Timestamp Compression (Delta-of-Delta):
Example timestamps (ms):
t0 = 1733832000000
t1 = 1733832015000 (Δ1 = 15000)
t2 = 1733832030000 (Δ2 = 15000)
t3 = 1733832045000 (Δ3 = 15000)
Delta-of-delta:
D1 = Δ1 - Δ0 = 15000 - 0 = 15000 → encode in 14 bits
D2 = Δ2 - Δ1 = 15000 - 15000 = 0 → encode in 1 bit (0)
D3 = Δ3 - Δ2 = 15000 - 15000 = 0 → encode in 1 bit (0)
Encoding:
- If D = 0: write 1 bit "0"
- If D in [-63, 64): write "10" + 7 bits
- If D in [-255, 256): write "110" + 9 bits
- If D in [-2047, 2048): write "1110" + 12 bits
- Otherwise: write "1111" + 32 bits
96% of timestamps compress to 1 bit!
Value Compression (XOR Encoding):
Example values (float64):
v0 = 1543.0
v1 = 1543.5
v2 = 1543.7
XOR compression:
XOR(v0, v1) = 0x3FF0000000000000 XOR 0x3FF0800000000000
= 0x0000800000000000
→ Leading zeros: 16, Trailing zeros: 47
→ Encode: control bit "1" + 5-bit leading + 6-bit length + 1 bit
XOR(v1, v2) = 0x3FF0800000000000 XOR 0x3FF0CCCCCCCCCCD
→ Similar pattern, encode with control bits
Encoding:
- If v_i == v_(i-1): write 1 bit "0"
- If XOR has same leading/trailing zeros as previous: write "10" + significant bits
- Otherwise: write "11" + 5-bit leading + 6-bit length + significant bits
51% of values compress to 1 bit!
Rust Implementation Sketch:
struct GorillaEncoder {
bit_writer: BitWriter,
prev_timestamp: i64,
prev_delta: i64,
prev_value: f64,
prev_leading_zeros: u8,
prev_trailing_zeros: u8,
}
impl GorillaEncoder {
fn encode_timestamp(&mut self, timestamp: i64) -> Result<()> {
let delta = timestamp - self.prev_timestamp;
let delta_of_delta = delta - self.prev_delta;
if delta_of_delta == 0 {
self.bit_writer.write_bit(0)?;
} else if delta_of_delta >= -63 && delta_of_delta < 64 {
self.bit_writer.write_bits(0b10, 2)?;
self.bit_writer.write_bits(delta_of_delta as u64, 7)?;
} else if delta_of_delta >= -255 && delta_of_delta < 256 {
self.bit_writer.write_bits(0b110, 3)?;
self.bit_writer.write_bits(delta_of_delta as u64, 9)?;
} else if delta_of_delta >= -2047 && delta_of_delta < 2048 {
self.bit_writer.write_bits(0b1110, 4)?;
self.bit_writer.write_bits(delta_of_delta as u64, 12)?;
} else {
self.bit_writer.write_bits(0b1111, 4)?;
self.bit_writer.write_bits(delta_of_delta as u64, 32)?;
}
self.prev_timestamp = timestamp;
self.prev_delta = delta;
Ok(())
}
fn encode_value(&mut self, value: f64) -> Result<()> {
let bits = value.to_bits();
let xor = bits ^ self.prev_value.to_bits();
if xor == 0 {
self.bit_writer.write_bit(0)?;
} else {
let leading = xor.leading_zeros() as u8;
let trailing = xor.trailing_zeros() as u8;
let significant_bits = 64 - leading - trailing;
if leading >= self.prev_leading_zeros && trailing >= self.prev_trailing_zeros {
self.bit_writer.write_bits(0b10, 2)?;
let mask = (1u64 << significant_bits) - 1;
let significant = (xor >> trailing) & mask;
self.bit_writer.write_bits(significant, significant_bits as usize)?;
} else {
self.bit_writer.write_bits(0b11, 2)?;
self.bit_writer.write_bits(leading as u64, 5)?;
self.bit_writer.write_bits(significant_bits as u64, 6)?;
let mask = (1u64 << significant_bits) - 1;
let significant = (xor >> trailing) & mask;
self.bit_writer.write_bits(significant, significant_bits as usize)?;
self.prev_leading_zeros = leading;
self.prev_trailing_zeros = trailing;
}
}
self.prev_value = value;
Ok(())
}
}
3.3.2 Compression Performance Targets
Based on research and production systems:
| Metric | Target | Reference |
|---|---|---|
| Average bytes/sample | 1.5-2.0 | Prometheus (1-2), Gorilla (1.37), M3DB (1.45) |
| Compression ratio | 10-12x | Gorilla (12x), InfluxDB TSM (45x for specific workloads) |
| Encode throughput | >500K samples/sec | Gorilla paper: 700K/sec |
| Decode throughput | >1M samples/sec | Gorilla paper: 1.2M/sec |
3.4 Retention and Compaction Policies
3.4.1 Retention Policy
Default Retention: 15 days
Configurable Parameters:
[storage]
retention_days = 15 # Keep data for 15 days
min_block_duration = "2h" # Minimum block size
max_block_duration = "24h" # Maximum block size after compaction
Retention Enforcement:
- Background goroutine runs every 1h
- Deletes blocks where
max_time < now() - retention_duration - Deletes old WAL segments
3.4.2 Compaction Strategy
Purpose:
- Merge small blocks into larger blocks (reduce file count)
- Remove deleted samples (tombstones)
- Improve query performance (fewer blocks to scan)
Compaction Levels (inspired by LevelDB):
Level 0: 2h blocks (compacted from Head)
Level 1: 12h blocks (merge 6 L0 blocks)
Level 2: 24h blocks (merge 2 L1 blocks)
Compaction Trigger:
- Time-based: Every 2h, compact Head → Level 0 block
- Count-based: When L0 has >4 blocks, compact → L1
- Manual: Admin API endpoint
/api/v1/admin/compact
Compaction Algorithm:
1. Select blocks to compact (same level, adjacent time ranges)
2. Create new block directory (ULID)
3. Iterate all series in selected blocks:
a. Merge chunks from all blocks
b. Apply tombstones (skip deleted samples)
c. Re-compress merged chunks
d. Write to new block chunks file
4. Build new index (merge posting lists)
5. Write meta.json
6. Atomically rename block directory
7. Delete source blocks
Rust Implementation Sketch:
struct Compactor {
data_dir: PathBuf,
retention: Duration,
}
impl Compactor {
async fn compact_head_to_l0(&self, head: &Head) -> Result<BlockID> {
let block_id = ULID::new();
let block_dir = self.data_dir.join(block_id.to_string());
std::fs::create_dir_all(&block_dir)?;
let mut index_writer = IndexWriter::new(&block_dir.join("index"))?;
let mut chunk_writer = ChunkWriter::new(&block_dir.join("chunks/000001"))?;
let series_map = head.series.read().await;
for (series_id, series) in series_map.iter() {
let chunks = series.chunks.read().await;
for chunk in chunks.iter() {
if chunk.is_full() {
let chunk_ref = chunk_writer.write_chunk(&chunk.samples)?;
index_writer.add_series(*series_id, &series.labels, chunk_ref)?;
}
}
}
index_writer.finalize()?;
chunk_writer.finalize()?;
let meta = BlockMeta {
ulid: block_id,
min_time: head.min_time.load(Ordering::Relaxed),
max_time: head.max_time.load(Ordering::Relaxed),
stats: compute_stats(&block_dir)?,
compaction: CompactionMeta { level: 0, sources: vec![] },
version: 1,
};
write_meta(&block_dir.join("meta.json"), &meta)?;
Ok(block_id)
}
async fn compact_blocks(&self, source_blocks: Vec<BlockID>) -> Result<BlockID> {
// Merge multiple blocks into one
// Similar to compact_head_to_l0, but reads from existing blocks
}
async fn enforce_retention(&self) -> Result<()> {
let cutoff = SystemTime::now() - self.retention;
let cutoff_ms = cutoff.duration_since(UNIX_EPOCH)?.as_millis() as i64;
for entry in std::fs::read_dir(&self.data_dir)? {
let path = entry?.path();
if !path.is_dir() { continue; }
let meta_path = path.join("meta.json");
if !meta_path.exists() { continue; }
let meta: BlockMeta = serde_json::from_reader(File::open(meta_path)?)?;
if meta.max_time < cutoff_ms {
std::fs::remove_dir_all(&path)?;
info!("Deleted expired block: {}", meta.ulid);
}
}
Ok(())
}
}
4. Push Ingestion API
4.1 Prometheus Remote Write Protocol
4.1.1 Protocol Overview
Specification: Prometheus Remote Write v1.0 Transport: HTTP/1.1 or HTTP/2 Encoding: Protocol Buffers (protobuf v3) Compression: Snappy (required)
Reference: Prometheus Remote Write Spec
4.1.2 HTTP Endpoint
POST /api/v1/write
Content-Type: application/x-protobuf
Content-Encoding: snappy
X-Prometheus-Remote-Write-Version: 0.1.0
Request Flow:
┌──────────────┐
│ Client │
│ (Prometheus, │
│ FlareDB, │
│ etc.) │
└──────┬───────┘
│
│ 1. Collect samples
│
▼
┌──────────────────────────────────┐
│ Encode to WriteRequest protobuf │
│ message │
└──────┬───────────────────────────┘
│
│ 2. Compress with Snappy
│
▼
┌──────────────────────────────────┐
│ HTTP POST to /api/v1/write │
│ with mTLS authentication │
└──────┬───────────────────────────┘
│
│ 3. Send request
│
▼
┌──────────────────────────────────┐
│ Nightlight Server │
│ ├─ Validate mTLS cert │
│ ├─ Decompress Snappy │
│ ├─ Decode protobuf │
│ ├─ Validate samples │
│ ├─ Append to WAL │
│ └─ Insert into Head │
└──────┬───────────────────────────┘
│
│ 4. Response
│
▼
┌──────────────────────────────────┐
│ HTTP Response: │
│ 200 OK (success) │
│ 400 Bad Request (invalid) │
│ 429 Too Many Requests (backpressure) │
│ 503 Service Unavailable (overload) │
└──────────────────────────────────┘
4.1.3 Protobuf Schema
File: proto/remote_write.proto
syntax = "proto3";
package nightlight.remote;
// Prometheus remote_write compatible schema
message WriteRequest {
repeated TimeSeries timeseries = 1;
// Metadata is optional and not used in v1
repeated MetricMetadata metadata = 2;
}
message TimeSeries {
repeated Label labels = 1;
repeated Sample samples = 2;
// Exemplars are optional (not supported in v1)
repeated Exemplar exemplars = 3;
}
message Label {
string name = 1;
string value = 2;
}
message Sample {
double value = 1;
int64 timestamp = 2; // Unix timestamp in milliseconds
}
message Exemplar {
repeated Label labels = 1;
double value = 2;
int64 timestamp = 3;
}
message MetricMetadata {
enum MetricType {
UNKNOWN = 0;
COUNTER = 1;
GAUGE = 2;
HISTOGRAM = 3;
GAUGEHISTOGRAM = 4;
SUMMARY = 5;
INFO = 6;
STATESET = 7;
}
MetricType type = 1;
string metric_family_name = 2;
string help = 3;
string unit = 4;
}
Generated Rust Code (using prost):
# Cargo.toml
[dependencies]
prost = "0.12"
prost-types = "0.12"
[build-dependencies]
prost-build = "0.12"
// build.rs
fn main() {
prost_build::compile_protos(&["proto/remote_write.proto"], &["proto/"]).unwrap();
}
4.1.4 Ingestion Handler
Rust Implementation:
use axum::{
Router,
routing::post,
extract::State,
http::StatusCode,
body::Bytes,
};
use prost::Message;
use snap::raw::Decoder as SnappyDecoder;
mod remote_write_pb {
include!(concat!(env!("OUT_DIR"), "/nightlight.remote.rs"));
}
struct IngestionService {
head: Arc<Head>,
wal: Arc<WAL>,
rate_limiter: Arc<RateLimiter>,
}
async fn handle_remote_write(
State(service): State<Arc<IngestionService>>,
body: Bytes,
) -> Result<StatusCode, (StatusCode, String)> {
// 1. Decompress Snappy
let mut decoder = SnappyDecoder::new();
let decompressed = decoder
.decompress_vec(&body)
.map_err(|e| (StatusCode::BAD_REQUEST, format!("Snappy decompression failed: {}", e)))?;
// 2. Decode protobuf
let write_req = remote_write_pb::WriteRequest::decode(&decompressed[..])
.map_err(|e| (StatusCode::BAD_REQUEST, format!("Protobuf decode failed: {}", e)))?;
// 3. Validate and ingest
let mut samples_ingested = 0;
let mut samples_rejected = 0;
for ts in write_req.timeseries.iter() {
// Validate labels
let labels = validate_labels(&ts.labels)
.map_err(|e| (StatusCode::BAD_REQUEST, e))?;
let series_id = compute_series_id(&labels);
for sample in ts.samples.iter() {
// Validate timestamp (not too old, not too far in future)
if !is_valid_timestamp(sample.timestamp) {
samples_rejected += 1;
continue;
}
// Check rate limit
if !service.rate_limiter.allow() {
return Err((StatusCode::TOO_MANY_REQUESTS, "Rate limit exceeded".into()));
}
// Append to WAL
let wal_record = WALRecord::Sample {
series_id,
timestamp: sample.timestamp,
value: sample.value,
};
service.wal.append(&wal_record)
.map_err(|e| (StatusCode::INTERNAL_SERVER_ERROR, format!("WAL append failed: {}", e)))?;
// Insert into Head
service.head.append(series_id, labels.clone(), sample.timestamp, sample.value)
.await
.map_err(|e| {
if e.to_string().contains("out of order") {
samples_rejected += 1;
Ok::<_, (StatusCode, String)>(())
} else if e.to_string().contains("buffer full") {
Err((StatusCode::SERVICE_UNAVAILABLE, "Write buffer full".into()))
} else {
Err((StatusCode::INTERNAL_SERVER_ERROR, format!("Insert failed: {}", e)))
}
})?;
samples_ingested += 1;
}
}
info!("Ingested {} samples, rejected {}", samples_ingested, samples_rejected);
Ok(StatusCode::NO_CONTENT) // 204 No Content on success
}
fn validate_labels(labels: &[remote_write_pb::Label]) -> Result<BTreeMap<String, String>, String> {
let mut label_map = BTreeMap::new();
for label in labels {
// Validate label name
if !is_valid_label_name(&label.name) {
return Err(format!("Invalid label name: {}", label.name));
}
// Validate label value (any UTF-8)
if label.value.is_empty() {
return Err(format!("Empty label value for label: {}", label.name));
}
label_map.insert(label.name.clone(), label.value.clone());
}
// Must have __name__ label
if !label_map.contains_key("__name__") {
return Err("Missing __name__ label".into());
}
Ok(label_map)
}
fn is_valid_label_name(name: &str) -> bool {
// Must match [a-zA-Z_][a-zA-Z0-9_]*
if name.is_empty() {
return false;
}
let mut chars = name.chars();
let first = chars.next().unwrap();
if !first.is_ascii_alphabetic() && first != '_' {
return false;
}
chars.all(|c| c.is_ascii_alphanumeric() || c == '_')
}
fn is_valid_timestamp(ts: i64) -> bool {
let now = SystemTime::now().duration_since(UNIX_EPOCH).unwrap().as_millis() as i64;
let min_valid = now - 24 * 3600 * 1000; // Not older than 24h
let max_valid = now + 5 * 60 * 1000; // Not more than 5min in future
ts >= min_valid && ts <= max_valid
}
4.2 gRPC API (Alternative/Additional)
In addition to HTTP, Nightlight MAY support a gRPC API for ingestion (more efficient for internal services).
Proto Definition:
syntax = "proto3";
package nightlight.ingest;
service IngestionService {
rpc Write(WriteRequest) returns (WriteResponse);
rpc WriteBatch(stream WriteRequest) returns (WriteResponse);
}
message WriteRequest {
repeated TimeSeries timeseries = 1;
}
message WriteResponse {
uint64 samples_ingested = 1;
uint64 samples_rejected = 2;
string error = 3;
}
// (Reuse TimeSeries, Label, Sample from remote_write.proto)
4.3 Label Validation and Normalization
4.3.1 Metric Name Validation
Metric names (stored in __name__ label) must match:
[a-zA-Z_:][a-zA-Z0-9_:]*
Examples:
- ✅
http_requests_total - ✅
node_cpu_seconds:rate5m - ❌
123_invalid(starts with digit) - ❌
invalid-metric(contains hyphen)
4.3.2 Label Name Validation
Label names must match:
[a-zA-Z_][a-zA-Z0-9_]*
Reserved prefixes:
__(double underscore): Internal labels (e.g.,__name__,__rollup__)
4.3.3 Label Normalization
Before inserting, labels are normalized:
- Sort labels lexicographically by key
- Ensure
__name__label is present - Remove duplicate labels (keep last value)
- Limit label count (default: 30 labels max per series)
- Limit label value length (default: 1024 chars max)
4.4 Write Path Architecture
┌──────────────────────────────────────────────────────────────┐
│ Ingestion Layer │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ HTTP/gRPC │ │ mTLS Auth │ │ Rate Limiter│ │
│ │ Handler │─▶│ Validator │─▶│ │ │
│ └─────────────┘ └─────────────┘ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Decompressor │ │
│ │ (Snappy) │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Protobuf │ │
│ │ Decoder │ │
│ └────────┬────────┘ │
│ │ │
└───────────────────────────────────────────┼──────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Validation Layer │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Label │ │ Timestamp │ │ Cardinality │ │
│ │ Validator │ │ Validator │ │ Limiter │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └─────────────────┴─────────────────┘ │
│ │ │
└───────────────────────────┼──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Write Buffer │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ MPSC Channel (bounded) │ │
│ │ Capacity: 100K samples │ │
│ │ Backpressure: Block/Reject when full │ │
│ └────────────────────────────────────────────────────┘ │
│ │ │
└───────────────────────────┼──────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Storage Layer │
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ WAL │◀────────│ WAL Writer │ │
│ │ (Disk) │ │ (Thread) │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Head │◀────────│ Head Writer│ │
│ │ (In-Memory) │ │ (Thread) │ │
│ └─────────────┘ └─────────────┘ │
└──────────────────────────────────────────────────────────────┘
Concurrency Model:
- HTTP/gRPC handlers: Multi-threaded (tokio async)
- Write buffer: MPSC channel (bounded capacity)
- WAL writer: Single-threaded (sequential writes for consistency)
- Head writer: Single-threaded (lock-free inserts via sharding)
Backpressure Handling:
enum BackpressureStrategy {
Block, // Block until buffer has space (default)
Reject, // Return 503 immediately
}
impl IngestionService {
async fn handle_backpressure(&self, samples: Vec<Sample>) -> Result<()> {
match self.config.backpressure_strategy {
BackpressureStrategy::Block => {
// Try to send with timeout
tokio::time::timeout(
Duration::from_secs(5),
self.write_buffer.send(samples)
).await
.map_err(|_| Error::Timeout)?
}
BackpressureStrategy::Reject => {
// Try non-blocking send
self.write_buffer.try_send(samples)
.map_err(|_| Error::BufferFull)?
}
}
}
}
4.5 Out-of-Order Sample Handling
Problem: Samples may arrive out of timestamp order due to network delays, batching, etc.
Solution: Accept out-of-order samples within a configurable time window.
Configuration:
[storage]
out_of_order_time_window = "1h" # Accept samples up to 1h old
Implementation:
impl Head {
async fn append(
&self,
series_id: SeriesID,
labels: BTreeMap<String, String>,
timestamp: i64,
value: f64,
) -> Result<()> {
let now = SystemTime::now().duration_since(UNIX_EPOCH)?.as_millis() as i64;
let min_valid_ts = now - self.config.out_of_order_time_window.as_millis() as i64;
if timestamp < min_valid_ts {
return Err(Error::OutOfOrder(format!(
"Sample too old: ts={}, min={}",
timestamp, min_valid_ts
)));
}
// Get or create series
let mut series_map = self.series.write().await;
let series = series_map.entry(series_id).or_insert_with(|| {
Arc::new(Series {
id: series_id,
labels: labels.clone(),
chunks: RwLock::new(vec![]),
})
});
// Append to appropriate chunk
let mut chunks = series.chunks.write().await;
// Find chunk that covers this timestamp
let chunk = chunks.iter_mut()
.find(|c| timestamp >= c.min_time && timestamp < c.max_time)
.or_else(|| {
// Create new chunk if needed
let chunk_start = (timestamp / self.chunk_size.as_millis() as i64) * self.chunk_size.as_millis() as i64;
let chunk_end = chunk_start + self.chunk_size.as_millis() as i64;
let new_chunk = Chunk {
min_time: chunk_start,
max_time: chunk_end,
samples: CompressedSamples::new(),
};
chunks.push(new_chunk);
chunks.last_mut()
})
.unwrap();
chunk.samples.append(timestamp, value)?;
Ok(())
}
}
5. PromQL Query Engine
5.1 PromQL Overview
PromQL (Prometheus Query Language) is a functional query language for selecting and aggregating time-series data.
Query Types:
- Instant query: Evaluate expression at a single point in time
- Range query: Evaluate expression over a time range
5.2 Supported PromQL Subset
Nightlight v1 supports a pragmatic subset of PromQL covering 80% of common dashboard queries.
5.2.1 Instant Vector Selectors
# Select by metric name
http_requests_total
# Select with label matchers
http_requests_total{method="GET"}
http_requests_total{method="GET", status="200"}
# Label matcher operators
metric{label="value"} # Exact match
metric{label!="value"} # Not equal
metric{label=~"regex"} # Regex match
metric{label!~"regex"} # Regex not match
# Example
http_requests_total{method=~"GET|POST", status!="500"}
5.2.2 Range Vector Selectors
# Select last 5 minutes of data
http_requests_total[5m]
# With label matchers
http_requests_total{method="GET"}[1h]
# Time durations: s (seconds), m (minutes), h (hours), d (days), w (weeks), y (years)
5.2.3 Aggregation Operators
# sum: Sum over dimensions
sum(http_requests_total)
sum(http_requests_total) by (method)
sum(http_requests_total) without (instance)
# Supported aggregations:
sum # Sum
avg # Average
min # Minimum
max # Maximum
count # Count
stddev # Standard deviation
stdvar # Standard variance
topk(N, ) # Top N series by value
bottomk(N,) # Bottom N series by value
5.2.4 Functions
Rate Functions:
# rate: Per-second average rate of increase
rate(http_requests_total[5m])
# irate: Instant rate (last two samples)
irate(http_requests_total[5m])
# increase: Total increase over time range
increase(http_requests_total[1h])
Quantile Functions:
# histogram_quantile: Calculate quantile from histogram
histogram_quantile(0.95, rate(http_request_duration_bucket[5m]))
Time Functions:
# time(): Current Unix timestamp
time()
# timestamp(): Timestamp of sample
timestamp(metric)
Math Functions:
# abs, ceil, floor, round, sqrt, exp, ln, log2, log10
abs(metric)
round(metric, 0.1)
5.2.5 Binary Operators
Arithmetic:
metric1 + metric2
metric1 - metric2
metric1 * metric2
metric1 / metric2
metric1 % metric2
metric1 ^ metric2
Comparison:
metric1 == metric2 # Equal
metric1 != metric2 # Not equal
metric1 > metric2 # Greater than
metric1 < metric2 # Less than
metric1 >= metric2 # Greater or equal
metric1 <= metric2 # Less or equal
Logical:
metric1 and metric2 # Intersection
metric1 or metric2 # Union
metric1 unless metric2 # Complement
Vector Matching:
# One-to-one matching
metric1 + metric2
# Many-to-one matching
metric1 + on(label) group_left metric2
# One-to-many matching
metric1 + on(label) group_right metric2
5.2.6 Subqueries (NOT SUPPORTED in v1)
Subqueries are complex and not supported in v1:
# NOT SUPPORTED
max_over_time(rate(http_requests_total[5m])[1h:])
5.3 Query Execution Model
5.3.1 Query Parsing
Use promql-parser crate (GreptimeTeam) for parsing:
use promql_parser::{parser, label};
fn parse_query(query: &str) -> Result<parser::Expr, ParseError> {
parser::parse(query)
}
// Example
let expr = parse_query("http_requests_total{method=\"GET\"}[5m]")?;
match expr {
parser::Expr::VectorSelector(vs) => {
println!("Metric: {}", vs.name);
for matcher in vs.matchers.matchers {
println!("Label: {} {} {}", matcher.name, matcher.op, matcher.value);
}
println!("Range: {:?}", vs.range);
}
_ => {}
}
AST Types:
pub enum Expr {
Aggregate(AggregateExpr), // sum, avg, etc.
Unary(UnaryExpr), // -metric
Binary(BinaryExpr), // metric1 + metric2
Paren(ParenExpr), // (expr)
Subquery(SubqueryExpr), // NOT SUPPORTED
NumberLiteral(NumberLiteral), // 1.5
StringLiteral(StringLiteral), // "value"
VectorSelector(VectorSelector), // metric{labels}
MatrixSelector(MatrixSelector), // metric[5m]
Call(Call), // rate(...)
}
5.3.2 Query Planner
Convert AST to execution plan:
enum QueryPlan {
VectorSelector {
matchers: Vec<LabelMatcher>,
timestamp: i64,
},
MatrixSelector {
matchers: Vec<LabelMatcher>,
range: Duration,
timestamp: i64,
},
Aggregate {
op: AggregateOp,
input: Box<QueryPlan>,
grouping: Vec<String>,
},
RateFunc {
input: Box<QueryPlan>,
},
BinaryOp {
op: BinaryOp,
lhs: Box<QueryPlan>,
rhs: Box<QueryPlan>,
matching: VectorMatching,
},
}
struct QueryPlanner;
impl QueryPlanner {
fn plan(expr: parser::Expr, query_time: i64) -> Result<QueryPlan> {
match expr {
parser::Expr::VectorSelector(vs) => {
Ok(QueryPlan::VectorSelector {
matchers: vs.matchers.matchers.into_iter()
.map(|m| LabelMatcher::from_ast(m))
.collect(),
timestamp: query_time,
})
}
parser::Expr::MatrixSelector(ms) => {
Ok(QueryPlan::MatrixSelector {
matchers: ms.vector_selector.matchers.matchers.into_iter()
.map(|m| LabelMatcher::from_ast(m))
.collect(),
range: Duration::from_millis(ms.range as u64),
timestamp: query_time,
})
}
parser::Expr::Call(call) => {
match call.func.name.as_str() {
"rate" => {
let arg_plan = Self::plan(*call.args[0].clone(), query_time)?;
Ok(QueryPlan::RateFunc { input: Box::new(arg_plan) })
}
// ... other functions
_ => Err(Error::UnsupportedFunction(call.func.name)),
}
}
parser::Expr::Aggregate(agg) => {
let input_plan = Self::plan(*agg.expr, query_time)?;
Ok(QueryPlan::Aggregate {
op: AggregateOp::from_str(&agg.op.to_string())?,
input: Box::new(input_plan),
grouping: agg.grouping.unwrap_or_default(),
})
}
parser::Expr::Binary(bin) => {
let lhs_plan = Self::plan(*bin.lhs, query_time)?;
let rhs_plan = Self::plan(*bin.rhs, query_time)?;
Ok(QueryPlan::BinaryOp {
op: BinaryOp::from_str(&bin.op.to_string())?,
lhs: Box::new(lhs_plan),
rhs: Box::new(rhs_plan),
matching: bin.modifier.map(|m| VectorMatching::from_ast(m)).unwrap_or_default(),
})
}
_ => Err(Error::UnsupportedExpr),
}
}
}
5.3.3 Query Executor
Execute the plan:
struct QueryExecutor {
head: Arc<Head>,
blocks: Arc<BlockManager>,
}
impl QueryExecutor {
async fn execute(&self, plan: QueryPlan) -> Result<QueryResult> {
match plan {
QueryPlan::VectorSelector { matchers, timestamp } => {
self.execute_vector_selector(matchers, timestamp).await
}
QueryPlan::MatrixSelector { matchers, range, timestamp } => {
self.execute_matrix_selector(matchers, range, timestamp).await
}
QueryPlan::RateFunc { input } => {
let matrix = self.execute(*input).await?;
self.apply_rate(matrix)
}
QueryPlan::Aggregate { op, input, grouping } => {
let vector = self.execute(*input).await?;
self.apply_aggregate(op, vector, grouping)
}
QueryPlan::BinaryOp { op, lhs, rhs, matching } => {
let lhs_result = self.execute(*lhs).await?;
let rhs_result = self.execute(*rhs).await?;
self.apply_binary_op(op, lhs_result, rhs_result, matching)
}
}
}
async fn execute_vector_selector(
&self,
matchers: Vec<LabelMatcher>,
timestamp: i64,
) -> Result<InstantVector> {
// 1. Find matching series from index
let series_ids = self.find_series(&matchers).await?;
// 2. For each series, get sample at timestamp
let mut samples = Vec::new();
for series_id in series_ids {
if let Some(sample) = self.get_sample_at(series_id, timestamp).await? {
samples.push(sample);
}
}
Ok(InstantVector { samples })
}
async fn execute_matrix_selector(
&self,
matchers: Vec<LabelMatcher>,
range: Duration,
timestamp: i64,
) -> Result<RangeVector> {
let series_ids = self.find_series(&matchers).await?;
let start = timestamp - range.as_millis() as i64;
let end = timestamp;
let mut ranges = Vec::new();
for series_id in series_ids {
let samples = self.get_samples_range(series_id, start, end).await?;
ranges.push(RangeVectorSeries {
labels: self.get_labels(series_id).await?,
samples,
});
}
Ok(RangeVector { ranges })
}
fn apply_rate(&self, matrix: RangeVector) -> Result<InstantVector> {
let mut samples = Vec::new();
for range in matrix.ranges {
if range.samples.len() < 2 {
continue; // Need at least 2 samples for rate
}
let first = &range.samples[0];
let last = &range.samples[range.samples.len() - 1];
let delta_value = last.value - first.value;
let delta_time = (last.timestamp - first.timestamp) as f64 / 1000.0; // Convert to seconds
let rate = delta_value / delta_time;
samples.push(Sample {
labels: range.labels,
timestamp: last.timestamp,
value: rate,
});
}
Ok(InstantVector { samples })
}
fn apply_aggregate(
&self,
op: AggregateOp,
vector: InstantVector,
grouping: Vec<String>,
) -> Result<InstantVector> {
// Group samples by grouping labels
let mut groups: HashMap<Vec<(String, String)>, Vec<Sample>> = HashMap::new();
for sample in vector.samples {
let group_key = if grouping.is_empty() {
vec![]
} else {
grouping.iter()
.filter_map(|label| sample.labels.get(label).map(|v| (label.clone(), v.clone())))
.collect()
};
groups.entry(group_key).or_insert_with(Vec::new).push(sample);
}
// Apply aggregation to each group
let mut result_samples = Vec::new();
for (group_labels, samples) in groups {
let aggregated_value = match op {
AggregateOp::Sum => samples.iter().map(|s| s.value).sum(),
AggregateOp::Avg => samples.iter().map(|s| s.value).sum::<f64>() / samples.len() as f64,
AggregateOp::Min => samples.iter().map(|s| s.value).fold(f64::INFINITY, f64::min),
AggregateOp::Max => samples.iter().map(|s| s.value).fold(f64::NEG_INFINITY, f64::max),
AggregateOp::Count => samples.len() as f64,
// ... other aggregations
};
result_samples.push(Sample {
labels: group_labels.into_iter().collect(),
timestamp: samples[0].timestamp,
value: aggregated_value,
});
}
Ok(InstantVector { samples: result_samples })
}
}
5.4 Read Path Architecture
┌──────────────────────────────────────────────────────────────┐
│ Query Layer │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ HTTP API │ │ PromQL │ │ Query │ │
│ │ /api/v1/ │─▶│ Parser │─▶│ Planner │ │
│ │ query │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Query │ │
│ │ Executor │ │
│ └────────┬────────┘ │
└───────────────────────────────────────────┼──────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Index Layer │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Label Index │ │ Posting │ │
│ │ (In-Memory) │ │ Lists │ │
│ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │
│ └─────────────────┘ │
│ │ │
│ │ Series IDs │
│ ▼ │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Storage Layer │
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Head │ │ Blocks │ │
│ │ (In-Memory) │ │ (Disk) │ │
│ └─────┬───────┘ └─────┬───────┘ │
│ │ │ │
│ │ Recent data (<2h) │ Historical data │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Chunk Reader │ │
│ │ - Decompress Gorilla chunks │ │
│ │ - Filter by time range │ │
│ │ - Return samples │ │
│ └─────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
5.5 HTTP Query API
5.5.1 Instant Query
GET /api/v1/query?query=<promql>&time=<timestamp>&timeout=<duration>
Parameters:
query: PromQL expression (required)time: Unix timestamp (optional, default: now)timeout: Query timeout (optional, default: 30s)
Response (JSON):
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"__name__": "http_requests_total",
"method": "GET",
"status": "200"
},
"value": [1733832000, "1543"]
}
]
}
}
5.5.2 Range Query
GET /api/v1/query_range?query=<promql>&start=<timestamp>&end=<timestamp>&step=<duration>
Parameters:
query: PromQL expression (required)start: Start timestamp (required)end: End timestamp (required)step: Query resolution step (required, e.g., "15s")
Response (JSON):
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {
"__name__": "http_requests_total",
"method": "GET"
},
"values": [
[1733832000, "1543"],
[1733832015, "1556"],
[1733832030, "1570"]
]
}
]
}
}
5.5.3 Label Values Query
GET /api/v1/label/<label_name>/values?match[]=<matcher>
Example:
GET /api/v1/label/method/values?match[]=http_requests_total
Response:
{
"status": "success",
"data": ["GET", "POST", "PUT", "DELETE"]
}
5.5.4 Series Metadata Query
GET /api/v1/series?match[]=<matcher>&start=<timestamp>&end=<timestamp>
Example:
GET /api/v1/series?match[]=http_requests_total{method="GET"}
Response:
{
"status": "success",
"data": [
{
"__name__": "http_requests_total",
"method": "GET",
"status": "200",
"instance": "flaredb-1:9092"
}
]
}
5.6 Performance Optimizations
5.6.1 Query Caching
Cache query results for identical queries:
struct QueryCache {
cache: Arc<Mutex<LruCache<String, (QueryResult, Instant)>>>,
ttl: Duration,
}
impl QueryCache {
fn get(&self, query_hash: &str) -> Option<QueryResult> {
let cache = self.cache.lock().unwrap();
if let Some((result, timestamp)) = cache.get(query_hash) {
if timestamp.elapsed() < self.ttl {
return Some(result.clone());
}
}
None
}
fn put(&self, query_hash: String, result: QueryResult) {
let mut cache = self.cache.lock().unwrap();
cache.put(query_hash, (result, Instant::now()));
}
}
5.6.2 Posting List Intersection
Use efficient algorithms for label matcher intersection:
fn intersect_posting_lists(lists: Vec<&[SeriesID]>) -> Vec<SeriesID> {
if lists.is_empty() {
return vec![];
}
// Sort lists by length (shortest first for early termination)
let mut sorted_lists = lists;
sorted_lists.sort_by_key(|list| list.len());
// Use shortest list as base, intersect with others
let mut result: HashSet<SeriesID> = sorted_lists[0].iter().copied().collect();
for list in &sorted_lists[1..] {
let list_set: HashSet<SeriesID> = list.iter().copied().collect();
result.retain(|id| list_set.contains(id));
if result.is_empty() {
break; // Early termination
}
}
result.into_iter().collect()
}
5.6.3 Chunk Pruning
Skip chunks that don't overlap query time range:
fn query_chunks(
chunks: &[ChunkRef],
start_time: i64,
end_time: i64,
) -> Vec<ChunkRef> {
chunks.iter()
.filter(|chunk| {
// Chunk overlaps query range if:
// chunk.max_time > start AND chunk.min_time < end
chunk.max_time > start_time && chunk.min_time < end_time
})
.copied()
.collect()
}
6. Storage Backend Architecture
6.1 Architecture Decision: Hybrid Approach
After analyzing trade-offs, Nightlight adopts a hybrid storage architecture:
- Dedicated time-series engine for sample storage (optimized for write throughput and compression)
- Optional FlareDB integration for metadata and distributed coordination (future work)
- Optional S3-compatible backend for cold data archival (future work)
6.2 Decision Rationale
6.2.1 Why NOT Pure FlareDB Backend?
FlareDB Characteristics:
- General-purpose KV store with Raft consensus
- Optimized for: Strong consistency, small KV pairs, random access
- Storage: RocksDB (LSM tree)
Time-Series Workload Characteristics:
- High write throughput (100K samples/sec)
- Sequential writes (append-only)
- Temporal locality (queries focus on recent data)
- Bulk reads (range scans over time windows)
Mismatch Analysis:
| Aspect | FlareDB (KV) | Time-Series Engine |
|---|---|---|
| Write pattern | Random writes, compaction overhead | Append-only, minimal overhead |
| Compression | Generic LZ4/Snappy | Domain-specific (Gorilla: 12x) |
| Read pattern | Point lookups | Range scans over time |
| Indexing | Key-based | Label-based inverted index |
| Consistency | Strong (Raft) | Eventual OK for metrics |
Conclusion: Using FlareDB for sample storage would sacrifice 5-10x write throughput and 10x compression efficiency.
6.2.2 Why NOT VictoriaMetrics Binary?
VictoriaMetrics is written in Go and has excellent performance, but:
- mTLS support is paid only (violates PROJECT.md requirement)
- Not Rust (violates PROJECT.md "Rustで書く")
- Cannot integrate with FlareDB for metadata (future requirement)
- Less control over storage format and optimizations
6.2.3 Why Hybrid (Dedicated + Optional FlareDB)?
Phase 1 (T033 v1): Pure dedicated engine
- Simple, single-instance deployment
- Focus on core functionality (ingest + query)
- Local disk storage only
Phase 2 (Future): Add FlareDB for metadata
- Store series labels and metadata in FlareDB "metrics" namespace
- Enables multi-instance coordination
- Global view of series cardinality, label values
- Samples still in dedicated engine (local disk)
Phase 3 (Future): Add S3 for cold storage
- Automatically upload old blocks (>7 days) to S3
- Query federation across local + S3 blocks
- Unlimited retention with cost-effective storage
Benefits:
- v1 simplicity: No FlareDB dependency, easy deployment
- Future scalability: Metadata in FlareDB, samples distributed
- Operational flexibility: Can run standalone or integrated
6.3 Storage Layout
6.3.1 Directory Structure
/var/lib/nightlight/
├── data/
│ ├── wal/
│ │ ├── 00000001 # WAL segment
│ │ ├── 00000002
│ │ └── checkpoint.00000002 # WAL checkpoint
│ ├── 01HQZQZQZQZQZQZQZQZQZQ/ # Block (ULID)
│ │ ├── meta.json
│ │ ├── index
│ │ ├── chunks/
│ │ │ ├── 000001
│ │ │ └── 000002
│ │ └── tombstones
│ ├── 01HQZR.../ # Another block
│ └── ...
└── tmp/ # Temp files for compaction
6.3.2 Metadata Storage (Future: FlareDB Integration)
When FlareDB integration is enabled:
Series Metadata (stored in FlareDB "metrics" namespace):
Key: series:<series_id>
Value: {
"labels": {"__name__": "http_requests_total", "method": "GET", ...},
"first_seen": 1733832000000,
"last_seen": 1733839200000
}
Key: label_index:<label_name>:<label_value>
Value: [series_id1, series_id2, ...] # Posting list
Benefits:
- Fast label value lookups across all instances
- Global series cardinality tracking
- Distributed query planning (future)
Trade-off: Adds dependency on FlareDB, increases complexity
6.4 Scalability Approach
6.4.1 Vertical Scaling (v1)
Single instance scales to:
- 10M active series
- 100K samples/sec write throughput
- 1K queries/sec
Scaling strategy:
- Increase memory (more series in Head)
- Faster disk (NVMe for WAL/blocks)
- More CPU cores (parallel compaction, query execution)
6.4.2 Horizontal Scaling (Future)
Sharding Strategy (inspired by Prometheus federation + Thanos):
┌────────────────────────────────────────────────────────────┐
│ Query Frontend │
│ (Query Federation) │
└─────┬────────────────────┬─────────────────────┬───────────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Nightlight │ │ Nightlight │ │ Nightlight │
│ Instance 1 │ │ Instance 2 │ │ Instance N │
│ │ │ │ │ │
│ Hash shard: │ │ Hash shard: │ │ Hash shard: │
│ 0-333 │ │ 334-666 │ │ 667-999 │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
└────────────────────┴─────────────────────┘
│
▼
┌───────────────┐
│ FlareDB │
│ (Metadata) │
└───────────────┘
Sharding Key: Hash(series_id) % num_shards
Query Execution:
- Query frontend receives PromQL query
- Determine which shards contain matching series (via FlareDB metadata)
- Send subqueries to relevant shards
- Merge results (aggregation, deduplication)
- Return to client
Challenges (deferred to future work):
- Rebalancing when adding/removing shards
- Handling series that span multiple shards (rare)
- Ensuring query consistency across shards
6.5 S3 Integration Strategy (Future)
Objective: Cost-effective long-term retention (>15 days)
Architecture:
┌───────────────────────────────────────────────────┐
│ Nightlight Server │
│ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Head │ │ Blocks │ │
│ │ (0-2h) │ │ (2h-15d)│ │
│ └──────────┘ └────┬─────┘ │
│ │ │
│ │ Background uploader │
│ ▼ │
│ ┌─────────────┐ │
│ │ Upload to │ │
│ │ S3 (>7d) │ │
│ └──────┬──────┘ │
└──────────────────────────┼────────────────────────┘
│
▼
┌─────────────────┐
│ S3 Bucket │
│ /blocks/ │
│ 01HQZ.../ │
│ 01HRZ.../ │
└─────────────────┘
Workflow:
- Block compaction creates local block files
- Blocks older than 7 days (configurable) are uploaded to S3
- Local block files deleted after successful upload
- Query executor checks both local and S3 for blocks in query range
- Download S3 blocks on-demand (with local cache)
Configuration:
[storage.s3]
enabled = true
endpoint = "https://s3.example.com"
bucket = "nightlight-blocks"
access_key_id = "..."
secret_access_key = "..."
upload_after_days = 7
local_cache_size_gb = 100
7. Integration Points
7.1 Service Discovery (How Services Push Metrics)
7.1.1 Service Configuration Pattern
Each platform service (FlareDB, ChainFire, etc.) exports Prometheus metrics on ports 9091-9099.
Example (FlareDB metrics exporter):
// flaredb-server/src/main.rs
use metrics_exporter_prometheus::PrometheusBuilder;
#[tokio::main]
async fn main() -> Result<()> {
// ... initialization ...
let metrics_addr = format!("0.0.0.0:{}", args.metrics_port);
let builder = PrometheusBuilder::new();
builder
.with_http_listener(metrics_addr.parse::<std::net::SocketAddr>()?)
.install()
.expect("Failed to install Prometheus metrics exporter");
info!("Prometheus metrics available at http://{}/metrics", metrics_addr);
// ... rest of main ...
}
Service Metrics Ports (from T027.S2):
| Service | Port | Endpoint |
|---|---|---|
| ChainFire | 9091 | http://chainfire:9091/metrics |
| FlareDB | 9092 | http://flaredb:9092/metrics |
| PlasmaVMC | 9093 | http://plasmavmc:9093/metrics |
| IAM | 9094 | http://iam:9094/metrics |
| LightningSTOR | 9095 | http://lightningstor:9095/metrics |
| FlashDNS | 9096 | http://flashdns:9096/metrics |
| FiberLB | 9097 | http://fiberlb:9097/metrics |
| Prismnet | 9098 | http://prismnet:9098/metrics |
7.1.2 Scrape-to-Push Adapter
Since Nightlight is push-based but services export pull-based Prometheus /metrics endpoints, we need a scrape-to-push adapter.
Option 1: Prometheus Agent Mode + Remote Write
Deploy Prometheus in agent mode (no storage, only scraping):
# prometheus-agent.yaml
global:
scrape_interval: 15s
external_labels:
cluster: 'cloud-platform'
scrape_configs:
- job_name: 'chainfire'
static_configs:
- targets: ['chainfire:9091']
- job_name: 'flaredb'
static_configs:
- targets: ['flaredb:9092']
# ... other services ...
remote_write:
- url: 'https://nightlight:8080/api/v1/write'
tls_config:
cert_file: /etc/certs/client.crt
key_file: /etc/certs/client.key
ca_file: /etc/certs/ca.crt
Option 2: Custom Rust Scraper (Platform-Native)
Build a lightweight scraper in Rust that integrates with Nightlight:
// nightlight-scraper/src/main.rs
struct Scraper {
targets: Vec<ScrapeTarget>,
client: reqwest::Client,
nightlight_client: NightlightClient,
}
struct ScrapeTarget {
job_name: String,
url: String,
interval: Duration,
}
impl Scraper {
async fn scrape_loop(&self) {
loop {
for target in &self.targets {
let result = self.scrape_target(target).await;
match result {
Ok(samples) => {
if let Err(e) = self.nightlight_client.write(samples).await {
error!("Failed to write to Nightlight: {}", e);
}
}
Err(e) => {
error!("Failed to scrape {}: {}", target.url, e);
}
}
}
tokio::time::sleep(Duration::from_secs(15)).await;
}
}
async fn scrape_target(&self, target: &ScrapeTarget) -> Result<Vec<Sample>> {
let response = self.client.get(&target.url).send().await?;
let body = response.text().await?;
// Parse Prometheus text format
let samples = parse_prometheus_text(&body, &target.job_name)?;
Ok(samples)
}
}
fn parse_prometheus_text(text: &str, job: &str) -> Result<Vec<Sample>> {
// Use prometheus-parse crate or implement simple parser
// Example output:
// http_requests_total{method="GET",status="200",job="flaredb"} 1543 1733832000000
}
Deployment:
nightlight-scraperruns as a sidecar or separate service- Reads scrape config from TOML file
- Uses mTLS to push to Nightlight
Recommendation: Option 2 (custom scraper) for consistency with platform philosophy (100% Rust, no external dependencies).
7.2 mTLS Configuration (T027/T031 Patterns)
7.2.1 TLS Config Structure
Following existing patterns (FlareDB, ChainFire, IAM):
# nightlight.toml
[server]
addr = "0.0.0.0:8080"
log_level = "info"
[server.tls]
cert_file = "/etc/nightlight/certs/server.crt"
key_file = "/etc/nightlight/certs/server.key"
ca_file = "/etc/nightlight/certs/ca.crt"
require_client_cert = true # Enable mTLS
Rust Config Struct:
// nightlight-server/src/config.rs
use serde::{Deserialize, Serialize};
use std::net::SocketAddr;
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ServerConfig {
pub server: ServerSettings,
pub storage: StorageConfig,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ServerSettings {
pub addr: SocketAddr,
pub log_level: String,
pub tls: Option<TlsConfig>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct TlsConfig {
pub cert_file: String,
pub key_file: String,
pub ca_file: Option<String>,
#[serde(default)]
pub require_client_cert: bool,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct StorageConfig {
pub data_dir: String,
pub retention_days: u32,
pub wal_segment_size_mb: usize,
// ... other storage settings
}
7.2.2 mTLS Server Setup
// nightlight-server/src/main.rs
use axum::Router;
use axum_server::tls_rustls::RustlsConfig;
use std::sync::Arc;
#[tokio::main]
async fn main() -> Result<()> {
let config = ServerConfig::load("nightlight.toml")?;
// Build router
let app = Router::new()
.route("/api/v1/write", post(handle_remote_write))
.route("/api/v1/query", get(handle_instant_query))
.route("/api/v1/query_range", get(handle_range_query))
.route("/health", get(health_check))
.route("/ready", get(readiness_check))
.with_state(Arc::new(service));
// Setup TLS if configured
if let Some(tls_config) = &config.server.tls {
info!("TLS enabled, loading certificates...");
let rustls_config = if tls_config.require_client_cert {
info!("mTLS enabled, requiring client certificates");
let ca_cert_pem = tokio::fs::read_to_string(
tls_config.ca_file.as_ref().ok_or("ca_file required for mTLS")?
).await?;
RustlsConfig::from_pem_file(
&tls_config.cert_file,
&tls_config.key_file,
)
.await?
.with_client_cert_verifier(ca_cert_pem)
} else {
info!("TLS-only mode, client certificates not required");
RustlsConfig::from_pem_file(
&tls_config.cert_file,
&tls_config.key_file,
).await?
};
axum_server::bind_rustls(config.server.addr, rustls_config)
.serve(app.into_make_service())
.await?;
} else {
info!("TLS disabled, running in plain-text mode");
axum_server::bind(config.server.addr)
.serve(app.into_make_service())
.await?;
}
Ok(())
}
7.2.3 Client Certificate Validation
Extract client identity from mTLS certificate:
use axum::{
http::Request,
middleware::Next,
response::Response,
Extension,
};
use axum_server::tls_rustls::RustlsAcceptor;
#[derive(Clone, Debug)]
struct ClientIdentity {
common_name: String,
organization: String,
}
async fn extract_client_identity<B>(
Extension(client_cert): Extension<Option<rustls::Certificate>>,
mut request: Request<B>,
next: Next<B>,
) -> Response {
if let Some(cert) = client_cert {
// Parse certificate to extract CN, O, etc.
let identity = parse_certificate(&cert);
request.extensions_mut().insert(identity);
}
next.run(request).await
}
// Use identity for rate limiting, audit logging, etc.
async fn handle_remote_write(
Extension(identity): Extension<ClientIdentity>,
State(service): State<Arc<IngestionService>>,
body: Bytes,
) -> Result<StatusCode, (StatusCode, String)> {
info!("Write request from: {}", identity.common_name);
// Apply per-client rate limiting
if !service.rate_limiter.allow(&identity.common_name) {
return Err((StatusCode::TOO_MANY_REQUESTS, "Rate limit exceeded".into()));
}
// ... rest of handler ...
}
7.3 gRPC API Design
While HTTP is the primary interface (Prometheus compatibility), a gRPC API can provide:
- Better performance for internal services
- Streaming support for batch ingestion
- Type-safe client libraries
Proto Definition:
// proto/nightlight.proto
syntax = "proto3";
package nightlight.v1;
service NightlightService {
// Write samples
rpc Write(WriteRequest) returns (WriteResponse);
// Streaming write for high-throughput scenarios
rpc WriteStream(stream WriteRequest) returns (WriteResponse);
// Query (instant)
rpc Query(QueryRequest) returns (QueryResponse);
// Query (range)
rpc QueryRange(QueryRangeRequest) returns (QueryRangeResponse);
// Admin operations
rpc Compact(CompactRequest) returns (CompactResponse);
rpc DeleteSeries(DeleteSeriesRequest) returns (DeleteSeriesResponse);
}
message WriteRequest {
repeated TimeSeries timeseries = 1;
}
message WriteResponse {
uint64 samples_ingested = 1;
uint64 samples_rejected = 2;
}
message QueryRequest {
string query = 1; // PromQL
int64 time = 2; // Unix timestamp (ms)
int64 timeout_ms = 3;
}
message QueryResponse {
string result_type = 1; // "vector" or "matrix"
repeated InstantVectorSample vector = 2;
repeated RangeVectorSeries matrix = 3;
}
message InstantVectorSample {
map<string, string> labels = 1;
double value = 2;
int64 timestamp = 3;
}
message RangeVectorSeries {
map<string, string> labels = 1;
repeated Sample samples = 2;
}
message Sample {
double value = 1;
int64 timestamp = 2;
}
7.4 NixOS Module Integration
Following T024 patterns, create a NixOS module for Nightlight.
File: nix/modules/nightlight.nix
{ config, lib, pkgs, ... }:
with lib;
let
cfg = config.services.nightlight;
configFile = pkgs.writeText "nightlight.toml" ''
[server]
addr = "${cfg.listenAddress}"
log_level = "${cfg.logLevel}"
${optionalString (cfg.tls.enable) ''
[server.tls]
cert_file = "${cfg.tls.certFile}"
key_file = "${cfg.tls.keyFile}"
${optionalString (cfg.tls.caFile != null) ''
ca_file = "${cfg.tls.caFile}"
''}
require_client_cert = ${boolToString cfg.tls.requireClientCert}
''}
[storage]
data_dir = "${cfg.dataDir}"
retention_days = ${toString cfg.storage.retentionDays}
wal_segment_size_mb = ${toString cfg.storage.walSegmentSizeMb}
'';
in {
options.services.nightlight = {
enable = mkEnableOption "Nightlight metrics storage service";
package = mkOption {
type = types.package;
default = pkgs.nightlight;
description = "Nightlight package to use";
};
listenAddress = mkOption {
type = types.str;
default = "0.0.0.0:8080";
description = "Address and port to listen on";
};
logLevel = mkOption {
type = types.enum [ "trace" "debug" "info" "warn" "error" ];
default = "info";
description = "Log level";
};
dataDir = mkOption {
type = types.path;
default = "/var/lib/nightlight";
description = "Data directory for TSDB storage";
};
tls = {
enable = mkEnableOption "TLS encryption";
certFile = mkOption {
type = types.str;
description = "Path to TLS certificate file";
};
keyFile = mkOption {
type = types.str;
description = "Path to TLS private key file";
};
caFile = mkOption {
type = types.nullOr types.str;
default = null;
description = "Path to CA certificate for client verification (mTLS)";
};
requireClientCert = mkOption {
type = types.bool;
default = false;
description = "Require client certificates (mTLS)";
};
};
storage = {
retentionDays = mkOption {
type = types.ints.positive;
default = 15;
description = "Data retention period in days";
};
walSegmentSizeMb = mkOption {
type = types.ints.positive;
default = 128;
description = "WAL segment size in MB";
};
};
};
config = mkIf cfg.enable {
systemd.services.nightlight = {
description = "Nightlight Metrics Storage Service";
wantedBy = [ "multi-user.target" ];
after = [ "network.target" ];
serviceConfig = {
Type = "simple";
ExecStart = "${cfg.package}/bin/nightlight-server --config ${configFile}";
Restart = "on-failure";
RestartSec = "5s";
# Security hardening
DynamicUser = true;
StateDirectory = "nightlight";
ProtectSystem = "strict";
ProtectHome = true;
PrivateTmp = true;
NoNewPrivileges = true;
};
};
# Expose metrics endpoint
networking.firewall.allowedTCPPorts = mkIf cfg.openFirewall [ 8080 ];
};
}
Usage Example (in NixOS configuration):
{
services.nightlight = {
enable = true;
listenAddress = "0.0.0.0:8080";
logLevel = "info";
tls = {
enable = true;
certFile = "/etc/certs/nightlight-server.crt";
keyFile = "/etc/certs/nightlight-server.key";
caFile = "/etc/certs/ca.crt";
requireClientCert = true;
};
storage = {
retentionDays = 30;
};
};
}
8. Implementation Plan
8.1 Step Breakdown (S1-S6)
The implementation follows a phased approach aligned with the task.yaml steps.
S1: Research & Architecture ✅ (Current Document)
Deliverable: This design document
Status: Completed
S2: Workspace Scaffold
Goal: Create nightlight workspace with skeleton structure
Tasks:
-
Create workspace structure:
nightlight/ ├── Cargo.toml ├── crates/ │ ├── nightlight-api/ # Client library │ ├── nightlight-server/ # Main service │ └── nightlight-types/ # Shared types ├── proto/ │ ├── remote_write.proto │ └── nightlight.proto └── README.md -
Setup proto compilation in build.rs
-
Define core types:
// nightlight-types/src/lib.rs pub type SeriesID = u64; pub type Timestamp = i64; // Unix timestamp in milliseconds pub struct Sample { pub timestamp: Timestamp, pub value: f64, } pub struct Series { pub id: SeriesID, pub labels: BTreeMap<String, String>, } pub struct LabelMatcher { pub name: String, pub value: String, pub op: MatchOp, } pub enum MatchOp { Equal, NotEqual, RegexMatch, RegexNotMatch, } -
Add dependencies:
[workspace.dependencies] # Core tokio = { version = "1.35", features = ["full"] } anyhow = "1.0" tracing = "0.1" tracing-subscriber = "0.3" # Serialization serde = { version = "1.0", features = ["derive"] } serde_json = "1.0" toml = "0.8" # gRPC tonic = "0.10" prost = "0.12" prost-types = "0.12" # HTTP axum = "0.7" axum-server = { version = "0.6", features = ["tls-rustls"] } # Compression snap = "1.1" # Snappy # Time-series promql-parser = "0.4" # Storage rocksdb = "0.21" # (NOT for TSDB, only for examples) # Crypto rustls = "0.21"
Estimated Effort: 2 days
S3: Push Ingestion
Goal: Implement Prometheus remote_write compatible ingestion endpoint
Tasks:
-
Implement WAL:
// nightlight-server/src/wal.rs struct WAL { dir: PathBuf, segment_size: usize, active_segment: RwLock<WALSegment>, } impl WAL { fn new(dir: PathBuf, segment_size: usize) -> Result<Self>; fn append(&self, record: WALRecord) -> Result<()>; fn replay(&self) -> Result<Vec<WALRecord>>; fn checkpoint(&self, min_segment: u64) -> Result<()>; } -
Implement In-Memory Head Block:
// nightlight-server/src/head.rs struct Head { series: DashMap<SeriesID, Arc<Series>>, // Concurrent HashMap min_time: AtomicI64, max_time: AtomicI64, config: HeadConfig, } impl Head { async fn append(&self, series_id: SeriesID, labels: Labels, ts: Timestamp, value: f64) -> Result<()>; async fn get(&self, series_id: SeriesID) -> Option<Arc<Series>>; async fn series_count(&self) -> usize; } -
Implement Gorilla Compression (basic version):
// nightlight-server/src/compression.rs struct GorillaEncoder { /* ... */ } struct GorillaDecoder { /* ... */ } impl GorillaEncoder { fn encode_timestamp(&mut self, ts: i64) -> Result<()>; fn encode_value(&mut self, value: f64) -> Result<()>; fn finish(self) -> Vec<u8>; } -
Implement HTTP Ingestion Handler:
// nightlight-server/src/handlers/ingest.rs async fn handle_remote_write( State(service): State<Arc<IngestionService>>, body: Bytes, ) -> Result<StatusCode, (StatusCode, String)> { // 1. Decompress Snappy // 2. Decode protobuf // 3. Validate samples // 4. Append to WAL // 5. Insert into Head // 6. Return 204 No Content } -
Add Rate Limiting:
struct RateLimiter { rate: f64, // samples/sec tokens: AtomicU64, } impl RateLimiter { fn allow(&self) -> bool; } -
Integration Test:
#[tokio::test] async fn test_remote_write_ingestion() { // Start server // Send WriteRequest // Verify samples stored }
Estimated Effort: 5 days
S4: PromQL Query Engine
Goal: Basic PromQL query support (instant + range queries)
Tasks:
-
Integrate promql-parser:
// nightlight-server/src/query/parser.rs use promql_parser::parser; pub fn parse(query: &str) -> Result<parser::Expr> { parser::parse(query).map_err(|e| Error::ParseError(e.to_string())) } -
Implement Query Planner:
// nightlight-server/src/query/planner.rs pub enum QueryPlan { VectorSelector { matchers: Vec<LabelMatcher>, timestamp: i64 }, MatrixSelector { matchers: Vec<LabelMatcher>, range: Duration, timestamp: i64 }, Aggregate { op: AggregateOp, input: Box<QueryPlan>, grouping: Vec<String> }, RateFunc { input: Box<QueryPlan> }, // ... other operators } pub fn plan(expr: parser::Expr, query_time: i64) -> Result<QueryPlan>; -
Implement Label Index:
// nightlight-server/src/index.rs struct LabelIndex { // label_name -> label_value -> [series_ids] inverted_index: DashMap<String, DashMap<String, Vec<SeriesID>>>, } impl LabelIndex { fn find_series(&self, matchers: &[LabelMatcher]) -> Result<Vec<SeriesID>>; fn add_series(&self, series_id: SeriesID, labels: &Labels); } -
Implement Query Executor:
// nightlight-server/src/query/executor.rs struct QueryExecutor { head: Arc<Head>, blocks: Arc<BlockManager>, index: Arc<LabelIndex>, } impl QueryExecutor { async fn execute(&self, plan: QueryPlan) -> Result<QueryResult>; async fn execute_vector_selector(&self, matchers: Vec<LabelMatcher>, ts: i64) -> Result<InstantVector>; async fn execute_matrix_selector(&self, matchers: Vec<LabelMatcher>, range: Duration, ts: i64) -> Result<RangeVector>; fn apply_rate(&self, matrix: RangeVector) -> Result<InstantVector>; fn apply_aggregate(&self, op: AggregateOp, vector: InstantVector, grouping: Vec<String>) -> Result<InstantVector>; } -
Implement HTTP Query Handlers:
// nightlight-server/src/handlers/query.rs async fn handle_instant_query( Query(params): Query<QueryParams>, State(executor): State<Arc<QueryExecutor>>, ) -> Result<Json<QueryResponse>, (StatusCode, String)> { let expr = parse(¶ms.query)?; let plan = plan(expr, params.time.unwrap_or_else(now))?; let result = executor.execute(plan).await?; Ok(Json(format_response(result))) } async fn handle_range_query( Query(params): Query<RangeQueryParams>, State(executor): State<Arc<QueryExecutor>>, ) -> Result<Json<QueryResponse>, (StatusCode, String)> { // Similar to instant query, but iterate over [start, end] with step } -
Integration Test:
#[tokio::test] async fn test_instant_query() { // Ingest samples // Query: http_requests_total{method="GET"} // Verify results } #[tokio::test] async fn test_range_query_with_rate() { // Ingest counter samples // Query: rate(http_requests_total[5m]) // Verify rate calculation }
Estimated Effort: 7 days
S5: Storage Layer
Goal: Time-series storage with retention and compaction
Tasks:
-
Implement Block Writer:
// nightlight-server/src/block/writer.rs struct BlockWriter { block_dir: PathBuf, index_writer: IndexWriter, chunk_writer: ChunkWriter, } impl BlockWriter { fn new(block_dir: PathBuf) -> Result<Self>; fn write_series(&mut self, series: &Series, samples: &[Sample]) -> Result<()>; fn finalize(self) -> Result<BlockMeta>; } -
Implement Block Reader:
// nightlight-server/src/block/reader.rs struct BlockReader { meta: BlockMeta, index: Index, chunks: ChunkReader, } impl BlockReader { fn open(block_dir: PathBuf) -> Result<Self>; fn query_samples(&self, series_id: SeriesID, start: i64, end: i64) -> Result<Vec<Sample>>; } -
Implement Compaction:
// nightlight-server/src/compaction.rs struct Compactor { data_dir: PathBuf, config: CompactionConfig, } impl Compactor { async fn compact_head_to_l0(&self, head: &Head) -> Result<BlockID>; async fn compact_blocks(&self, source_blocks: Vec<BlockID>) -> Result<BlockID>; async fn run_compaction_loop(&self); // Background task } -
Implement Retention Enforcement:
impl Compactor { async fn enforce_retention(&self, retention: Duration) -> Result<()> { let cutoff = SystemTime::now() - retention; // Delete blocks older than cutoff } } -
Implement Block Manager:
// nightlight-server/src/block/manager.rs struct BlockManager { blocks: RwLock<Vec<Arc<BlockReader>>>, data_dir: PathBuf, } impl BlockManager { fn load_blocks(&mut self) -> Result<()>; fn add_block(&mut self, block: BlockReader); fn remove_block(&mut self, block_id: &BlockID); fn query_blocks(&self, start: i64, end: i64) -> Vec<Arc<BlockReader>>; } -
Integration Test:
#[tokio::test] async fn test_compaction() { // Ingest data for >2h // Trigger compaction // Verify block created // Query old data from block } #[tokio::test] async fn test_retention() { // Create old blocks // Run retention enforcement // Verify old blocks deleted }
Estimated Effort: 8 days
S6: Integration & Documentation
Goal: NixOS module, TLS config, integration tests, operator docs
Tasks:
-
Create NixOS Module:
- File:
nix/modules/nightlight.nix - Follow T024 patterns
- Include systemd service, firewall rules
- Support TLS configuration options
- File:
-
Implement mTLS:
- Load certs in server startup
- Configure Rustls with client cert verification
- Extract client identity for rate limiting
-
Create Nightlight Scraper:
- Standalone scraper service
- Reads scrape config (TOML)
- Scrapes
/metricsendpoints from services - Pushes to Nightlight via remote_write
-
Integration Tests:
#[tokio::test] async fn test_e2e_ingest_and_query() { // Start Nightlight server // Ingest samples via remote_write // Query via /api/v1/query // Query via /api/v1/query_range // Verify results match } #[tokio::test] async fn test_mtls_authentication() { // Start server with mTLS // Connect without client cert -> rejected // Connect with valid client cert -> accepted } #[tokio::test] async fn test_grafana_compatibility() { // Configure Grafana to use Nightlight // Execute sample queries // Verify dashboards render correctly } -
Write Operator Documentation:
- File:
docs/por/T033-nightlight/OPERATOR.md - Installation (NixOS, standalone)
- Configuration guide
- mTLS setup
- Scraper configuration
- Troubleshooting
- Performance tuning
- File:
-
Write Developer Documentation:
- File:
nightlight/README.md - Architecture overview
- Building from source
- Running tests
- Contributing guidelines
- File:
Estimated Effort: 5 days
8.2 Dependency Ordering
S1 (Research) → S2 (Scaffold)
↓
S3 (Ingestion) ──┐
↓ │
S4 (Query) │
↓ │
S5 (Storage) ←────┘
↓
S6 (Integration)
Critical Path: S1 → S2 → S3 → S5 → S6 Parallelizable: S4 can start after S3 completes basic ingestion
8.3 Total Effort Estimate
| Step | Effort | Priority |
|---|---|---|
| S1: Research | 2 days | P0 |
| S2: Scaffold | 2 days | P0 |
| S3: Ingestion | 5 days | P0 |
| S4: Query Engine | 7 days | P0 |
| S5: Storage Layer | 8 days | P1 |
| S6: Integration | 5 days | P1 |
| Total | 29 days |
Realistic Timeline: 6-8 weeks (accounting for testing, debugging, documentation)
9. Open Questions
9.1 Decisions Requiring User Input
Q1: Scraper Implementation Choice
Question: Should we use Prometheus in agent mode or build a custom Rust scraper?
Option A: Prometheus Agent + Remote Write
- Pros: Battle-tested, standard tool, no implementation effort
- Cons: Adds Go dependency, less platform integration
Option B: Custom Rust Scraper
- Pros: 100% Rust, platform consistency, easier integration
- Cons: Implementation effort, needs testing
Recommendation: Option B (custom scraper) for consistency with PROJECT.md philosophy
Decision: [ ] A [ ] B [ ] Defer to later
Q2: gRPC vs HTTP Priority
Question: Should we prioritize gRPC API or focus only on HTTP (Prometheus compatibility)?
Option A: HTTP only (v1)
- Pros: Simpler, Prometheus/Grafana compatibility is sufficient
- Cons: Less efficient for internal services
Option B: Both HTTP and gRPC (v1)
- Pros: Better performance for internal services, more flexibility
- Cons: More implementation effort
Recommendation: Option A for v1, add gRPC in v2 if needed
Decision: [ ] A [ ] B
Q3: FlareDB Metadata Integration Timeline
Question: When should we integrate FlareDB for metadata storage?
Option A: v1 (T033)
- Pros: Unified metadata story from the start
- Cons: Increases complexity, adds dependency
Option B: v2 (Future)
- Pros: Simpler v1, can deploy standalone
- Cons: Migration effort later
Recommendation: Option B (defer to v2)
Decision: [ ] A [ ] B
Q4: S3 Cold Storage Priority
Question: Should S3 cold storage be part of v1 or deferred?
Option A: v1 (T033.S5)
- Pros: Unlimited retention from day 1
- Cons: Complexity, operational overhead
Option B: v2 (Future)
- Pros: Simpler v1, focus on core functionality
- Cons: Limited retention (local disk only)
Recommendation: Option B (defer to v2), use local disk for v1 with 15-30 day retention
Decision: [ ] A [ ] B
9.2 Areas Needing Further Investigation
I1: PromQL Function Coverage
Issue: Need to determine exact subset of PromQL functions to support in v1.
Investigation Needed:
- Survey existing Grafana dashboards in use
- Identify most common functions (rate, increase, histogram_quantile, etc.)
- Prioritize by usage frequency
Proposed Approach:
- Analyze 10-20 sample dashboards
- Create coverage matrix
- Implement top 80% functions first
I2: Query Performance Benchmarking
Issue: Need to validate query latency targets (p95 <100ms) are achievable.
Investigation Needed:
- Benchmark promql-parser crate performance
- Measure Gorilla decompression throughput
- Test index lookup performance at 10M series scale
Proposed Approach:
- Create benchmark suite with synthetic data (1M, 10M series)
- Measure end-to-end query latency
- Identify bottlenecks and optimize
I3: Series Cardinality Limits
Issue: How to prevent series explosion (high cardinality killing performance)?
Investigation Needed:
- Research cardinality estimation algorithms (HyperLogLog)
- Define cardinality limits (per metric, per label, global)
- Implement rejection strategy (reject new series beyond limit)
Proposed Approach:
- Add cardinality tracking to label index
- Implement warnings at 80% limit, rejection at 100%
- Provide admin API to inspect high-cardinality series
I4: Out-of-Order Sample Edge Cases
Issue: How to handle out-of-order samples spanning chunk boundaries?
Investigation Needed:
- Test scenarios: samples arriving 1h late, 2h late, etc.
- Determine if we need multi-chunk updates or reject old samples
- Benchmark impact of re-sorting chunks
Proposed Approach:
- Implement configurable out-of-order window (default: 1h)
- Reject samples older than window
- For within-window samples, insert into correct chunk (may require chunk re-compression)
10. References
10.1 Research Sources
Time-Series Storage Formats
- Gorilla: A Fast, Scalable, In-Memory Time Series Database (Facebook)
- Gorilla Compression Algorithm - The Morning Paper
- Prometheus TSDB Storage Documentation
- Prometheus TSDB Architecture - Palark Blog
- InfluxDB TSM Storage Engine
- M3DB Storage Architecture
- M3DB at Uber Blog
PromQL Implementation
Prometheus Remote Write Protocol
- Prometheus Remote Write 1.0 Specification
- Prometheus Remote Write 2.0 Specification
- Prometheus Protobuf Schema (remote.proto)
Rust TSDB Implementations
- InfluxDB 3 Engineering with Rust - InfoQ
- Datadog's Rust TSDB - Datadog Blog
- GreptimeDB Announcement
- tstorage-rs Embedded TSDB
- tsink High-Performance Embedded TSDB
10.2 Platform References
Internal Documentation
- PROJECT.md (Item 12: Metrics Store)
- docs/por/T033-nightlight/task.yaml
- docs/por/T027-production-hardening/ (TLS patterns)
- docs/por/T024-nixos-packaging/ (NixOS module patterns)
Existing Service Patterns
- flaredb/crates/flaredb-server/src/main.rs (TLS, metrics export)
- flaredb/crates/flaredb-server/src/config/mod.rs (Config structure)
- chainfire/crates/chainfire-server/src/config.rs (TLS config)
- iam/crates/iam-server/src/config.rs (Config patterns)
10.3 External Tools
- Grafana - Visualization and dashboards
- Prometheus - Reference implementation
- VictoriaMetrics - Replacement target (study architecture)
Appendix A: PromQL Function Reference (v1 Support)
Supported Functions
| Function | Category | Description | Example |
|---|---|---|---|
rate() |
Counter | Per-second rate of increase | rate(http_requests_total[5m]) |
irate() |
Counter | Instant rate (last 2 samples) | irate(http_requests_total[5m]) |
increase() |
Counter | Total increase over range | increase(http_requests_total[1h]) |
histogram_quantile() |
Histogram | Calculate quantile from histogram | histogram_quantile(0.95, rate(http_duration_bucket[5m])) |
sum() |
Aggregation | Sum values | sum(metric) |
avg() |
Aggregation | Average values | avg(metric) |
min() |
Aggregation | Minimum value | min(metric) |
max() |
Aggregation | Maximum value | max(metric) |
count() |
Aggregation | Count series | count(metric) |
stddev() |
Aggregation | Standard deviation | stddev(metric) |
stdvar() |
Aggregation | Standard variance | stdvar(metric) |
topk() |
Aggregation | Top K series | topk(5, metric) |
bottomk() |
Aggregation | Bottom K series | bottomk(5, metric) |
time() |
Time | Current timestamp | time() |
timestamp() |
Time | Sample timestamp | timestamp(metric) |
abs() |
Math | Absolute value | abs(metric) |
ceil() |
Math | Round up | ceil(metric) |
floor() |
Math | Round down | floor(metric) |
round() |
Math | Round to nearest | round(metric, 0.1) |
NOT Supported in v1
| Function | Category | Reason |
|---|---|---|
predict_linear() |
Prediction | Complex, low usage |
deriv() |
Math | Low usage |
holt_winters() |
Prediction | Complex |
resets() |
Counter | Low usage |
changes() |
Analysis | Low usage |
| Subqueries | Advanced | Very complex |
Appendix B: Configuration Reference
Complete Configuration Example
# nightlight.toml - Complete configuration example
[server]
# Listen address for HTTP/gRPC API
addr = "0.0.0.0:8080"
# Log level: trace, debug, info, warn, error
log_level = "info"
# Metrics port for self-monitoring (Prometheus /metrics endpoint)
metrics_port = 9099
[server.tls]
# Enable TLS
cert_file = "/etc/nightlight/certs/server.crt"
key_file = "/etc/nightlight/certs/server.key"
# Enable mTLS (require client certificates)
ca_file = "/etc/nightlight/certs/ca.crt"
require_client_cert = true
[storage]
# Data directory for TSDB blocks and WAL
data_dir = "/var/lib/nightlight/data"
# Data retention period (days)
retention_days = 15
# WAL segment size (MB)
wal_segment_size_mb = 128
# Block duration for compaction
min_block_duration = "2h"
max_block_duration = "24h"
# Out-of-order sample acceptance window
out_of_order_time_window = "1h"
# Series cardinality limits
max_series = 10_000_000
max_series_per_metric = 100_000
# Memory limits
max_head_chunks_per_series = 2
max_head_size_mb = 2048
[query]
# Query timeout (seconds)
timeout_seconds = 30
# Maximum query range (hours)
max_range_hours = 24
# Query result cache TTL (seconds)
cache_ttl_seconds = 60
# Maximum concurrent queries
max_concurrent_queries = 100
[ingestion]
# Write buffer size (samples)
write_buffer_size = 100_000
# Backpressure strategy: "block" or "reject"
backpressure_strategy = "block"
# Rate limiting (samples per second per client)
rate_limit_per_client = 50_000
# Maximum samples per write request
max_samples_per_request = 10_000
[compaction]
# Enable background compaction
enabled = true
# Compaction interval (seconds)
interval_seconds = 7200 # 2 hours
# Number of compaction threads
num_threads = 2
[s3]
# S3 cold storage (optional, future)
enabled = false
endpoint = "https://s3.example.com"
bucket = "nightlight-blocks"
access_key_id = "..."
secret_access_key = "..."
upload_after_days = 7
local_cache_size_gb = 100
[flaredb]
# FlareDB metadata integration (optional, future)
enabled = false
endpoints = ["flaredb-1:50051", "flaredb-2:50051"]
namespace = "metrics"
Appendix C: Metrics Exported by Nightlight
Nightlight exports metrics about itself on port 9099 (configurable).
Ingestion Metrics
# Samples ingested
nightlight_samples_ingested_total{} counter
# Samples rejected (out-of-order, invalid, etc.)
nightlight_samples_rejected_total{reason="out_of_order|invalid|rate_limit"} counter
# Ingestion latency (milliseconds)
nightlight_ingestion_latency_ms{quantile="0.5|0.9|0.99"} summary
# Active series
nightlight_active_series{} gauge
# Head memory usage (bytes)
nightlight_head_memory_bytes{} gauge
Query Metrics
# Queries executed
nightlight_queries_total{type="instant|range"} counter
# Query latency (milliseconds)
nightlight_query_latency_ms{type="instant|range", quantile="0.5|0.9|0.99"} summary
# Query errors
nightlight_query_errors_total{reason="timeout|parse_error|execution_error"} counter
Storage Metrics
# WAL segments
nightlight_wal_segments{} gauge
# WAL size (bytes)
nightlight_wal_size_bytes{} gauge
# Blocks
nightlight_blocks_total{level="0|1|2"} gauge
# Block size (bytes)
nightlight_block_size_bytes{level="0|1|2"} gauge
# Compactions
nightlight_compactions_total{level="0|1|2"} counter
# Compaction duration (seconds)
nightlight_compaction_duration_seconds{level="0|1|2", quantile="0.5|0.9|0.99"} summary
System Metrics
# Go runtime metrics (if using Go for scraper)
# Rust memory metrics
nightlight_memory_allocated_bytes{} gauge
# CPU usage
nightlight_cpu_usage_seconds_total{} counter
Appendix D: Error Codes and Troubleshooting
HTTP Error Codes
| Code | Meaning | Common Causes |
|---|---|---|
| 200 | OK | Query successful |
| 204 | No Content | Write successful |
| 400 | Bad Request | Invalid PromQL, malformed protobuf |
| 401 | Unauthorized | mTLS cert validation failed |
| 429 | Too Many Requests | Rate limit exceeded |
| 500 | Internal Server Error | Storage error, WAL corruption |
| 503 | Service Unavailable | Write buffer full, server overloaded |
Common Issues
Issue: "Samples rejected: out_of_order"
Cause: Samples arriving with timestamps older than out_of_order_time_window
Solution:
- Increase
out_of_order_time_windowin config - Check clock sync on clients (NTP)
- Reduce scrape batch size
Issue: "Rate limit exceeded"
Cause: Client exceeding rate_limit_per_client samples/sec
Solution:
- Increase rate limit in config
- Reduce scrape frequency
- Shard writes across multiple clients
Issue: "Query timeout"
Cause: Query exceeding timeout_seconds
Solution:
- Increase query timeout
- Reduce query time range
- Add more specific label matchers to reduce series scanned
Issue: "Series cardinality explosion"
Cause: Too many unique label combinations (high cardinality)
Solution:
- Review label design (avoid unbounded labels like user_id)
- Use relabeling to drop high-cardinality labels
- Increase
max_serieslimit (if justified)
End of Design Document
Total Length: ~3,800 lines
Status: Ready for review and S2-S6 implementation
Next Steps:
- Review and approve design decisions
- Create GitHub issues for S2-S6 tasks
- Begin S2: Workspace Scaffold