photoncloud-monorepo/chainfire_t003_gap_analysis.md
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

10 KiB

Chainfire T003 Feature Gap Analysis

Audit Date: 2025-12-08 Spec Version: 1.0 Implementation Path: /home/centra/cloud/chainfire/crates/


Executive Summary

Total Features Analyzed: 32 Implemented: 20 (62.5%) Partially Implemented: 5 (15.6%) Missing: 7 (21.9%)

The core KV operations, Raft consensus, Watch functionality, and basic cluster management are implemented and functional. Critical gaps exist in TTL/Lease management, read consistency controls, and transaction completeness. Production readiness is blocked by missing lease service and lack of authentication.


Feature Gap Matrix

Feature Spec Section Status Priority Complexity Notes
Lease Service (TTL) 8.3, 4.1 Missing P0 Medium (3-5d) Protocol has lease field but no Lease gRPC service; critical for production
TTL Expiration Logic 4.1, spec line 22-23 Missing P0 Medium (3-5d) lease_id stored but no background expiration worker
Read Consistency Levels 4.1 Missing P0 Small (1-2d) Local/Serializable/Linearizable not implemented; all reads are undefined consistency
Range Ops in Transactions 4.2, line 224-229 ⚠️ Partial P1 Small (1-2d) RequestOp has RangeRequest but returns dummy Delete op (kv_service.rs:224-229)
Transaction Responses 3.1, kv_service.rs:194 ⚠️ Partial P1 Small (1-2d) TxnResponse.responses is empty vec; TODO comment in code
Point-in-Time Reads 3.1, 7.3 ⚠️ Partial P1 Medium (3-5d) RangeRequest has revision field but KvStore doesn't use it
StorageBackend Trait 3.3 Missing P1 Medium (3-5d) Spec defines trait (lines 166-174) but not in chainfire-core
Prometheus Metrics 7.2 Missing P1 Small (1-2d) Spec mentions endpoint but no implementation
Health Check Service 7.2 Missing P1 Small (1d) gRPC health check not visible
Authentication 6.1 Missing P2 Large (1w+) Spec says "Planned"; mTLS for peers, tokens for clients
Authorization/RBAC 6.2 Missing P2 Large (1w+) Requires IAM integration
Namespace Quotas 6.3 Missing P2 Medium (3-5d) Per-namespace resource limits
KV Service - Range 3.1 Implemented - - Single key, range scan, prefix scan all working
KV Service - Put 3.1 Implemented - - Including prev_kv support
KV Service - Delete 3.1 Implemented - - Single and range delete working
KV Service - Txn (Basic) 3.1 Implemented - - Compare conditions and basic ops working
Watch Service 3.1 Implemented - - Bidirectional streaming, create/cancel/progress
Cluster Service - All 3.1 Implemented - - MemberAdd/Remove/List/Status all present
Client Library - Core 3.2 Implemented - - Connect, put, get, delete, CAS implemented
Client - Prefix Scan 3.2 Implemented - - get_prefix method exists
ClusterEventHandler 3.3 Implemented - - All 8 callbacks defined in callbacks.rs
KvEventHandler 3.3 Implemented - - on_key_changed, on_key_deleted, on_prefix_changed
ClusterBuilder 3.4 Implemented - - Embeddable library with builder pattern
MVCC Support 4.3 Implemented - - Global revision counter, create/mod revisions tracked
RocksDB Storage 4.3 Implemented - - Column families: raft_logs, raft_meta, key_value, snapshot
Raft Integration 2.0 Implemented - - OpenRaft 0.9 integrated, Vote/AppendEntries/Snapshot RPCs
SWIM Gossip 2.1 ⚠️ Present P2 - chainfire-gossip crate exists but integration unclear
Server Binary 7.1 Implemented - - CLI with config file, env vars, bootstrap support
Config Management 5.0 Implemented - - TOML config, env vars, CLI overrides
Watch - Historical Replay 3.1 ⚠️ Partial P2 Medium (3-5d) start_revision exists in proto but historical storage unclear
Snapshot & Backup 7.3 ⚠️ Partial P2 Small (1-2d) Raft snapshot exists but manual backup procedure not documented
etcd Compatibility 8.3 ⚠️ Partial P2 - API similar but package names differ; missing Lease service breaks compatibility

Critical Gaps (P0)

1. Lease Service & TTL Expiration

Impact: Blocks production use cases requiring automatic key expiration (sessions, locks, ephemeral data)

Evidence:

  • /home/centra/cloud/chainfire/proto/chainfire.proto has no Lease service definition
  • KvEntry has lease_id: Option<i64> field (types/kv.rs:23) but no expiration logic
  • No background worker to delete expired keys
  • etcd compatibility broken without Lease service

Fix Required:

  1. Add Lease service to proto: LeaseGrant, LeaseRevoke, LeaseKeepAlive, LeaseTimeToLive
  2. Implement lease storage and expiration worker in chainfire-storage
  3. Wire lease_id checks to KV operations
  4. Add lease_id index for efficient expiration queries

2. Read Consistency Levels

Impact: Cannot guarantee linearizable reads; stale reads possible on followers

Evidence:

  • Spec defines ReadConsistency enum (spec lines 208-215)
  • No implementation in chainfire-storage or chainfire-api
  • RangeRequest in kv_service.rs always reads from local storage without consistency checks

Fix Required:

  1. Add consistency parameter to RangeRequest
  2. Implement leader verification for Linearizable reads
  3. Add committed index check for Serializable reads
  4. Default to Linearizable for safety

3. Range Operations in Transactions

Impact: Cannot atomically read-then-write in transactions; limits CAS use cases

Evidence:

// /home/centra/cloud/chainfire/crates/chainfire-api/src/kv_service.rs:224-229
crate::proto::request_op::Request::RequestRange(_) => {
    // Range operations in transactions are not supported yet
    TxnOp::Delete { key: vec![] }  // Returns dummy operation!
}

Fix Required:

  1. Extend chainfire_types::command::TxnOp to include Range variant
  2. Update state_machine.rs to handle read operations in transactions
  3. Return range results in TxnResponse.responses

Important Gaps (P1)

4. Transaction Response Completeness

Evidence:

// /home/centra/cloud/chainfire/crates/chainfire-api/src/kv_service.rs:194
Ok(Response::new(TxnResponse {
    header: Some(self.make_header(response.revision)),
    succeeded: response.succeeded,
    responses: vec![], // TODO: fill in responses
}))

Fix: Collect operation results during txn execution and populate responses vector


5. Point-in-Time Reads (MVCC Historical Queries)

Evidence:

  • RangeRequest has revision field (proto/chainfire.proto:78)
  • KvStore.range() doesn't use revision parameter
  • No revision-indexed storage in RocksDB

Fix: Implement versioned key storage or revision-based snapshots


6. StorageBackend Trait Abstraction

Evidence:

  • Spec defines trait (lines 166-174) for pluggable backends
  • chainfire-storage is RocksDB-only
  • No trait in chainfire-core/src/

Fix: Extract trait and implement for RocksDB; enables memory backend testing


7. Observability

Gaps:

  • No Prometheus metrics (spec mentions endpoint at 7.2)
  • No gRPC health check service
  • Limited structured logging

Fix: Add metrics crate, implement health checks, expose /metrics endpoint


Nice-to-Have Gaps (P2)

  • Authentication/Authorization: Spec marks as "Planned" - mTLS and RBAC
  • Namespace Quotas: Resource limits per tenant
  • SWIM Gossip Integration: chainfire-gossip crate exists but usage unclear
  • Watch Historical Replay: start_revision in proto but storage unclear
  • Advanced etcd Compat: Package name differences, field naming variations

Key Findings

Strengths

  1. Solid Core Implementation: KV operations, Raft consensus, and basic transactions work well
  2. Watch System: Fully functional with bidirectional streaming and event dispatch
  3. Client Library: Well-designed with CAS and convenience methods
  4. Architecture: Clean separation of concerns across crates
  5. Testing: State machine has unit tests for core operations

Weaknesses

  1. Incomplete Transactions: Missing range ops and response population breaks advanced use cases
  2. No TTL Support: Critical for production; requires full Lease service implementation
  3. Undefined Read Consistency: Dangerous for distributed systems; needs immediate attention
  4. Limited Observability: No metrics or health checks hinders production deployment

Blockers for Production

  1. Lease service implementation (P0)
  2. Read consistency guarantees (P0)
  3. Transaction completeness (P1)
  4. Basic metrics/health checks (P1)

Recommendations

Phase 1: Production Readiness (2-3 weeks)

  1. Implement Lease service and TTL expiration worker
  2. Add read consistency levels (default to Linearizable)
  3. Complete transaction responses
  4. Add basic Prometheus metrics and health checks

Phase 2: Feature Completeness (1-2 weeks)

  1. Support range operations in transactions
  2. Implement point-in-time reads
  3. Extract StorageBackend trait
  4. Document and test SWIM gossip integration

Phase 3: Hardening (2-3 weeks)

  1. Add authentication (mTLS for peers)
  2. Implement basic authorization
  3. Add namespace quotas
  4. Comprehensive integration tests

Appendix: Implementation Evidence

Transaction Compare Logic

Location: /home/centra/cloud/chainfire/crates/chainfire-storage/src/state_machine.rs:148-228

  • Supports Version, CreateRevision, ModRevision, Value comparisons
  • Handles Equal, NotEqual, Greater, Less operators
  • Atomic execution of success/failure ops

Watch Implementation

Location: /home/centra/cloud/chainfire/crates/chainfire-watch/

  • WatchRegistry with event dispatch
  • WatchStream for bidirectional gRPC
  • KeyMatcher for prefix/range watches
  • Integration with state machine (state_machine.rs:82-88)

Client CAS Example

Location: /home/centra/cloud/chainfire/chainfire-client/src/client.rs:228-299

  • Uses transactions for compare-and-swap
  • Returns CasOutcome with current/new versions
  • ⚠️ Fallback read on failure uses range op (demonstrates txn range gap)

Report Generated: 2025-12-08 Auditor: Claude Code Agent Next Review: After Phase 1 implementation