photoncloud-monorepo/docs/por/T033-metricstor/E2E_VALIDATION.md
centra d2149b6249 fix(lightningstor): Fix SigV4 canonicalization for AWS S3 auth
- Replace form_urlencoded with RFC 3986 compliant URI encoding
- Implement aws_uri_encode() matching AWS SigV4 spec exactly
- Unreserved chars (A-Z,a-z,0-9,-,_,.,~) not encoded
- All other chars percent-encoded with uppercase hex
- Preserve slashes in paths, encode in query params
- Normalize empty paths to '/' per AWS spec
- Fix test expectations (body hash, HMAC values)
- Add comprehensive SigV4 signature determinism test

This fixes the canonicalization mismatch that caused signature
validation failures in T047. Auth can now be enabled for production.

Refs: T058.S1
2025-12-12 06:23:46 +09:00

8.1 KiB

Nightlight E2E Validation Report

Date: 2025-12-11 Validator: PeerA Status: BLOCKED - Critical Integration Bug Found Duration: 1.5 hours

Executive Summary

E2E validation of Nightlight (T033) discovered a critical integration bug: ingestion and query services do not share storage, making the system non-functional despite all 57 unit/integration tests passing.

Key Finding: Unit tests validated components in isolation but missed the integration gap. This validates PeerB's strategic insight that "marking tasks complete based on unit tests alone creates false confidence."

Test Environment

  • Nightlight Server: v0.1.0 (release build)
  • HTTP Endpoint: 127.0.0.1:9101
  • Dependencies:
    • plasma-demo-api (PID 2441074, port 3000) ✓ RUNNING
    • flaredb-server (PID 2368777, port 8001) ✓ RUNNING
    • iam-server (PID 2366509, port 8002) ✓ RUNNING

Test Scenarios

Scenario 1: Server Startup

Test: Start nightlight-server with default configuration Result: SUCCESS Evidence:

INFO Nightlight server starting...
INFO Version: 0.1.0
INFO Server configuration:
INFO   HTTP address: 127.0.0.1:9101
INFO   Data directory: ./data
INFO Ingestion service initialized
INFO Query service initialized
INFO HTTP server listening on 127.0.0.1:9101
INFO   - Ingestion: POST /api/v1/write
INFO   - Query: GET /api/v1/query, /api/v1/query_range
INFO   - Metadata: GET /api/v1/series, /api/v1/label/:name/values
INFO Nightlight server ready

Scenario 2: Metric Ingestion (Prometheus remote_write)

Test: Push metrics via POST /api/v1/write (protobuf + snappy) Result: SUCCESS (HTTP 204 No Content) Evidence:

$ cargo run --example push_metrics
Pushing metrics to http://127.0.0.1:9101/api/v1/write...
Encoded 219 bytes of protobuf data
Compressed to 177 bytes with Snappy
Response status: 204 No Content
Successfully pushed 3 samples across 2 time series

Metrics pushed:

  • http_requests_total{job="example_app",method="GET",status="200"} = 1234.0
  • http_request_duration_seconds{job="example_app",method="GET"} = [0.042, 0.055]

Scenario 3: PromQL Instant Query

Test: Query pushed metrics via GET /api/v1/query Result: FAILED (Empty results despite successful ingestion) Evidence:

$ curl "http://127.0.0.1:9101/api/v1/query?query=http_requests_total"
{
    "status": "success",
    "data": {
        "result": [],          # ❌ EXPECTED: 1 result with value 1234.0
        "resultType": "vector"
    },
    "error": null
}

Scenario 4: Series Metadata Query

Test: List all stored series via GET /api/v1/series Result: FAILED (No series found despite successful ingestion) Evidence:

$ curl "http://127.0.0.1:9101/api/v1/series"
{
    "status": "success",
    "data": []  # ❌ EXPECTED: 2 time series
}

Root Cause Analysis

Architecture Investigation

File: nightlight-server/src/main.rs

// PROBLEM: Ingestion and Query services created independently
let ingestion_service = ingestion::IngestionService::new();
let query_service = query::QueryService::new_with_persistence(&data_path)?;

// Router merge does NOT share storage between services
let app = ingestion_service.router().merge(query_service.router());

File: nightlight-server/src/ingestion.rs (lines 28-39)

pub struct IngestionService {
    write_buffer: Arc<RwLock<WriteBuffer>>,  // ← Isolated in-memory buffer
    metrics: Arc<IngestionMetrics>,
}

struct WriteBuffer {
    samples: Vec<nightlight_types::Sample>,  // ← Data stored HERE
    series: Vec<nightlight_types::TimeSeries>,
}

File: nightlight-server/src/query.rs

pub struct QueryService {
    storage: Arc<RwLock<QueryableStorage>>,  // ← Separate storage!
}

Problem: Ingestion stores data in WriteBuffer, Query reads from QueryableStorage. They never communicate.

Why Unit Tests Passed

All 57 tests (24 unit + 8 ingestion + 9 query + 16 types) passed because:

  1. Ingestion tests (8 tests): Tested HTTP endpoint → WriteBuffer (isolated)
  2. Query tests (9 tests): Created QueryableStorage with pre-populated data (mocked)
  3. No integration test validating: Ingest → Store → Query roundtrip

Reference: T033.S3 notes (ingestion_test.rs)

// Example: test_remote_write_valid_request
// ✓ Tests HTTP 204 response
// ✗ Does NOT verify data is queryable

Impact Assessment

Severity: CRITICAL (P0) Status: System non-functional for real-world use

What Works:

  • HTTP server startup
  • Prometheus remote_write protocol (protobuf + snappy)
  • Request validation (labels, samples)
  • PromQL query parser
  • HTTP API endpoints

What's Broken:

  • End-to-end data flow (ingest → query)
  • Real-world usability
  • Observability stack integration

User Impact:

  • Metrics appear to be stored (204 response)
  • But queries return empty results
  • Silent data loss (most dangerous failure mode)

Validation Gap Analysis

This finding validates the strategic decision (by PeerA/PeerB) to perform E2E validation despite T033 being marked "complete":

T029 vs T033 Evidence Quality

Aspect T029 (Practical Demo) T033 (Nightlight)
Tests Passing 34 integration tests 57 unit/integration tests
E2E Validation 7 scenarios (real binary execution) None (until now)
Evidence HTTP requests/responses logged evidence: []
Real-world test Created items in FlareDB + IAM auth Only in-process tests
Integration bugs Caught before "complete" Caught during E2E validation

Lesson Learned

PeerB's insight (inbox 000486):

"T033 validation gap reveals pattern — marking tasks 'complete' based on unit tests alone creates false confidence; E2E evidence essential for real completion"

Validation:

  • Unit tests: 57/57 passing
  • E2E test: FAILED — system non-functional

This gap would have reached production without E2E validation, causing:

  1. Silent data loss (metrics accepted but not stored)
  2. Debugging nightmare (HTTP 204 suggests success)
  3. Loss of confidence in observability stack

Recommendations

Immediate Actions (Required for T033 completion)

  1. Fix Integration Bug (New task: T033.S7 or T037)

    • Share storage between IngestionService and QueryService
    • Options:
      • A) Pass shared Arc<RwLock<QueryableStorage>> to both services
      • B) Implement background flush from WriteBuffer → QueryableStorage
      • C) Unified storage layer abstraction
  2. Add Integration Test

    • Test: test_ingestion_query_roundtrip()
    • Flow: POST /api/v1/write → GET /api/v1/query
    • Verify: Pushed data is queryable
  3. Update T033 Evidence

    • Document bug found during E2E validation
    • Add this report to evidence section
    • Mark T033 as "needs-fix" (not complete)

Strategic Actions

  1. Establish E2E Validation as Gate

    • No task marked "complete" without E2E evidence
    • Unit tests necessary but not sufficient
    • Follow T029 evidence standard
  2. Update POR.md

    • MVP-Alpha: 11/12 (Nightlight non-functional)
    • Add validation phase to task lifecycle

Evidence Files

This validation produced the following artifacts:

  1. This Report: docs/por/T033-nightlight/E2E_VALIDATION.md
  2. Server Logs: Nightlight startup + ingestion success + query failure
  3. Test Commands: Documented curl/cargo commands for reproduction
  4. Root Cause: Architecture analysis (ingestion.rs + query.rs + main.rs)

Validation Outcome

Status: INCOMPLETE Reason: Critical integration bug blocks E2E validation completion Next: Fix ingestion→query integration, then re-run validation

Time Investment:

  • E2E Validation: 1.5 hours
  • Bug Discovery: 45 minutes
  • Root Cause Analysis: 30 minutes
  • Documentation: 15 minutes

ROI: CRITICAL — Prevented production deployment of non-functional system


Conclusion: E2E validation is not optional. This finding demonstrates the value of real-world testing beyond unit tests. T033 cannot be marked "complete" until the integration bug is fixed and E2E validation passes.