centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere

- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-11 09:59:19 +09:00

8.1 KiB

Raw Blame History

Metricstor E2E Validation Report

Date: 2025-12-11 Validator: PeerA Status: BLOCKED - Critical Integration Bug Found Duration: 1.5 hours

Executive Summary

E2E validation of Metricstor (T033) discovered a critical integration bug: ingestion and query services do not share storage, making the system non-functional despite all 57 unit/integration tests passing.

Key Finding: Unit tests validated components in isolation but missed the integration gap. This validates PeerB's strategic insight that "marking tasks complete based on unit tests alone creates false confidence."

Test Environment

Metricstor Server: v0.1.0 (release build)
HTTP Endpoint: 127.0.0.1:9101
Dependencies:
- plasma-demo-api (PID 2441074, port 3000) ✓ RUNNING
- flaredb-server (PID 2368777, port 8001) ✓ RUNNING
- iam-server (PID 2366509, port 8002) ✓ RUNNING

Test Scenarios

✅ Scenario 1: Server Startup

Test: Start metricstor-server with default configuration Result: SUCCESS Evidence:

INFO Metricstor server starting...
INFO Version: 0.1.0
INFO Server configuration:
INFO   HTTP address: 127.0.0.1:9101
INFO   Data directory: ./data
INFO Ingestion service initialized
INFO Query service initialized
INFO HTTP server listening on 127.0.0.1:9101
INFO   - Ingestion: POST /api/v1/write
INFO   - Query: GET /api/v1/query, /api/v1/query_range
INFO   - Metadata: GET /api/v1/series, /api/v1/label/:name/values
INFO Metricstor server ready

✅ Scenario 2: Metric Ingestion (Prometheus remote_write)

Test: Push metrics via POST /api/v1/write (protobuf + snappy) Result: SUCCESS (HTTP 204 No Content) Evidence:

$ cargo run --example push_metrics
Pushing metrics to http://127.0.0.1:9101/api/v1/write...
Encoded 219 bytes of protobuf data
Compressed to 177 bytes with Snappy
Response status: 204 No Content
Successfully pushed 3 samples across 2 time series

Metrics pushed:

http_requests_total{job="example_app",method="GET",status="200"} = 1234.0
http_request_duration_seconds{job="example_app",method="GET"} = [0.042, 0.055]

❌ Scenario 3: PromQL Instant Query

Test: Query pushed metrics via GET /api/v1/query Result: FAILED (Empty results despite successful ingestion) Evidence:

$ curl "http://127.0.0.1:9101/api/v1/query?query=http_requests_total"
{
    "status": "success",
    "data": {
        "result": [],          # ❌ EXPECTED: 1 result with value 1234.0
        "resultType": "vector"
    },
    "error": null
}

❌ Scenario 4: Series Metadata Query

Test: List all stored series via GET /api/v1/series Result: FAILED (No series found despite successful ingestion) Evidence:

$ curl "http://127.0.0.1:9101/api/v1/series"
{
    "status": "success",
    "data": []  # ❌ EXPECTED: 2 time series
}

Root Cause Analysis

Architecture Investigation

File: metricstor-server/src/main.rs

// PROBLEM: Ingestion and Query services created independently
let ingestion_service = ingestion::IngestionService::new();
let query_service = query::QueryService::new_with_persistence(&data_path)?;

// Router merge does NOT share storage between services
let app = ingestion_service.router().merge(query_service.router());

File: metricstor-server/src/ingestion.rs (lines 28-39)

pub struct IngestionService {
    write_buffer: Arc<RwLock<WriteBuffer>>,  // ← Isolated in-memory buffer
    metrics: Arc<IngestionMetrics>,
}

struct WriteBuffer {
    samples: Vec<metricstor_types::Sample>,  // ← Data stored HERE
    series: Vec<metricstor_types::TimeSeries>,
}

File: metricstor-server/src/query.rs

pub struct QueryService {
    storage: Arc<RwLock<QueryableStorage>>,  // ← Separate storage!
}

Problem: Ingestion stores data in WriteBuffer, Query reads from QueryableStorage. They never communicate.

Why Unit Tests Passed

All 57 tests (24 unit + 8 ingestion + 9 query + 16 types) passed because:

Ingestion tests (8 tests): Tested HTTP endpoint → WriteBuffer (isolated)
Query tests (9 tests): Created QueryableStorage with pre-populated data (mocked)
No integration test validating: Ingest → Store → Query roundtrip

Reference: T033.S3 notes (ingestion_test.rs)

// Example: test_remote_write_valid_request
// ✓ Tests HTTP 204 response
// ✗ Does NOT verify data is queryable

Impact Assessment

Severity: CRITICAL (P0) Status: System non-functional for real-world use

What Works:

✅ HTTP server startup
✅ Prometheus remote_write protocol (protobuf + snappy)
✅ Request validation (labels, samples)
✅ PromQL query parser
✅ HTTP API endpoints

What's Broken:

❌ End-to-end data flow (ingest → query)
❌ Real-world usability
❌ Observability stack integration

User Impact:

Metrics appear to be stored (204 response)
But queries return empty results
Silent data loss (most dangerous failure mode)

Validation Gap Analysis

This finding validates the strategic decision (by PeerA/PeerB) to perform E2E validation despite T033 being marked "complete":

T029 vs T033 Evidence Quality

Aspect	T029 (Practical Demo)	T033 (Metricstor)
Tests Passing	34 integration tests	57 unit/integration tests
E2E Validation	✅ 7 scenarios (real binary execution)	❌ None (until now)
Evidence	HTTP requests/responses logged	`evidence: []`
Real-world test	Created items in FlareDB + IAM auth	Only in-process tests
Integration bugs	Caught before "complete"	Caught during E2E validation

Lesson Learned

PeerB's insight (inbox 000486):

"T033 validation gap reveals pattern — marking tasks 'complete' based on unit tests alone creates false confidence; E2E evidence essential for real completion"

Validation:

Unit tests: 57/57 passing ✅
E2E test: FAILED — system non-functional ❌

This gap would have reached production without E2E validation, causing:

Silent data loss (metrics accepted but not stored)
Debugging nightmare (HTTP 204 suggests success)
Loss of confidence in observability stack

Recommendations

Immediate Actions (Required for T033 completion)

Fix Integration Bug (New task: T033.S7 or T037)
- Share storage between IngestionService and QueryService
- Options:
  - A) Pass shared Arc<RwLock<QueryableStorage>> to both services
  - B) Implement background flush from WriteBuffer → QueryableStorage
  - C) Unified storage layer abstraction
Add Integration Test
- Test: test_ingestion_query_roundtrip()
- Flow: POST /api/v1/write → GET /api/v1/query
- Verify: Pushed data is queryable
Update T033 Evidence
- Document bug found during E2E validation
- Add this report to evidence section
- Mark T033 as "needs-fix" (not complete)

Strategic Actions

Establish E2E Validation as Gate
- No task marked "complete" without E2E evidence
- Unit tests necessary but not sufficient
- Follow T029 evidence standard
Update POR.md
- MVP-Alpha: 11/12 (Metricstor non-functional)
- Add validation phase to task lifecycle

Evidence Files

This validation produced the following artifacts:

This Report: docs/por/T033-metricstor/E2E_VALIDATION.md
Server Logs: Metricstor startup + ingestion success + query failure
Test Commands: Documented curl/cargo commands for reproduction
Root Cause: Architecture analysis (ingestion.rs + query.rs + main.rs)

Validation Outcome

Status: INCOMPLETE Reason: Critical integration bug blocks E2E validation completion Next: Fix ingestion→query integration, then re-run validation

Time Investment:

E2E Validation: 1.5 hours
Bug Discovery: 45 minutes
Root Cause Analysis: 30 minutes
Documentation: 15 minutes

ROI: CRITICAL — Prevented production deployment of non-functional system

Conclusion: E2E validation is not optional. This finding demonstrates the value of real-world testing beyond unit tests. T033 cannot be marked "complete" until the integration bug is fixed and E2E validation passes.

8.1 KiB Raw Blame History