photoncloud-monorepo/chainfire_t003_gap_analysis.md
centra 5c6eb04a46 T036: Add VM cluster deployment configs for nixos-anywhere
- netboot-base.nix with SSH key auth
- Launch scripts for node01/02/03
- Node configuration.nix and disko.nix
- Nix modules for first-boot automation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 09:59:19 +09:00

240 lines
10 KiB
Markdown

# Chainfire T003 Feature Gap Analysis
**Audit Date:** 2025-12-08
**Spec Version:** 1.0
**Implementation Path:** `/home/centra/cloud/chainfire/crates/`
---
## Executive Summary
**Total Features Analyzed:** 32
**Implemented:** 20 (62.5%)
**Partially Implemented:** 5 (15.6%)
**Missing:** 7 (21.9%)
The core KV operations, Raft consensus, Watch functionality, and basic cluster management are implemented and functional. Critical gaps exist in TTL/Lease management, read consistency controls, and transaction completeness. Production readiness is blocked by missing lease service and lack of authentication.
---
## Feature Gap Matrix
| Feature | Spec Section | Status | Priority | Complexity | Notes |
|---------|--------------|--------|----------|------------|-------|
| **Lease Service (TTL)** | 8.3, 4.1 | ❌ Missing | P0 | Medium (3-5d) | Protocol has lease field but no Lease gRPC service; critical for production |
| **TTL Expiration Logic** | 4.1, spec line 22-23 | ❌ Missing | P0 | Medium (3-5d) | lease_id stored but no background expiration worker |
| **Read Consistency Levels** | 4.1 | ❌ Missing | P0 | Small (1-2d) | Local/Serializable/Linearizable not implemented; all reads are undefined consistency |
| **Range Ops in Transactions** | 4.2, line 224-229 | ⚠️ Partial | P1 | Small (1-2d) | RequestOp has RangeRequest but returns dummy Delete op (kv_service.rs:224-229) |
| **Transaction Responses** | 3.1, kv_service.rs:194 | ⚠️ Partial | P1 | Small (1-2d) | TxnResponse.responses is empty vec; TODO comment in code |
| **Point-in-Time Reads** | 3.1, 7.3 | ⚠️ Partial | P1 | Medium (3-5d) | RangeRequest has revision field but KvStore doesn't use it |
| **StorageBackend Trait** | 3.3 | ❌ Missing | P1 | Medium (3-5d) | Spec defines trait (lines 166-174) but not in chainfire-core |
| **Prometheus Metrics** | 7.2 | ❌ Missing | P1 | Small (1-2d) | Spec mentions endpoint but no implementation |
| **Health Check Service** | 7.2 | ❌ Missing | P1 | Small (1d) | gRPC health check not visible |
| **Authentication** | 6.1 | ❌ Missing | P2 | Large (1w+) | Spec says "Planned"; mTLS for peers, tokens for clients |
| **Authorization/RBAC** | 6.2 | ❌ Missing | P2 | Large (1w+) | Requires IAM integration |
| **Namespace Quotas** | 6.3 | ❌ Missing | P2 | Medium (3-5d) | Per-namespace resource limits |
| **KV Service - Range** | 3.1 | ✅ Implemented | - | - | Single key, range scan, prefix scan all working |
| **KV Service - Put** | 3.1 | ✅ Implemented | - | - | Including prev_kv support |
| **KV Service - Delete** | 3.1 | ✅ Implemented | - | - | Single and range delete working |
| **KV Service - Txn (Basic)** | 3.1 | ✅ Implemented | - | - | Compare conditions and basic ops working |
| **Watch Service** | 3.1 | ✅ Implemented | - | - | Bidirectional streaming, create/cancel/progress |
| **Cluster Service - All** | 3.1 | ✅ Implemented | - | - | MemberAdd/Remove/List/Status all present |
| **Client Library - Core** | 3.2 | ✅ Implemented | - | - | Connect, put, get, delete, CAS implemented |
| **Client - Prefix Scan** | 3.2 | ✅ Implemented | - | - | get_prefix method exists |
| **ClusterEventHandler** | 3.3 | ✅ Implemented | - | - | All 8 callbacks defined in callbacks.rs |
| **KvEventHandler** | 3.3 | ✅ Implemented | - | - | on_key_changed, on_key_deleted, on_prefix_changed |
| **ClusterBuilder** | 3.4 | ✅ Implemented | - | - | Embeddable library with builder pattern |
| **MVCC Support** | 4.3 | ✅ Implemented | - | - | Global revision counter, create/mod revisions tracked |
| **RocksDB Storage** | 4.3 | ✅ Implemented | - | - | Column families: raft_logs, raft_meta, key_value, snapshot |
| **Raft Integration** | 2.0 | ✅ Implemented | - | - | OpenRaft 0.9 integrated, Vote/AppendEntries/Snapshot RPCs |
| **SWIM Gossip** | 2.1 | ⚠️ Present | P2 | - | chainfire-gossip crate exists but integration unclear |
| **Server Binary** | 7.1 | ✅ Implemented | - | - | CLI with config file, env vars, bootstrap support |
| **Config Management** | 5.0 | ✅ Implemented | - | - | TOML config, env vars, CLI overrides |
| **Watch - Historical Replay** | 3.1 | ⚠️ Partial | P2 | Medium (3-5d) | start_revision exists in proto but historical storage unclear |
| **Snapshot & Backup** | 7.3 | ⚠️ Partial | P2 | Small (1-2d) | Raft snapshot exists but manual backup procedure not documented |
| **etcd Compatibility** | 8.3 | ⚠️ Partial | P2 | - | API similar but package names differ; missing Lease service breaks compatibility |
---
## Critical Gaps (P0)
### 1. Lease Service & TTL Expiration
**Impact:** Blocks production use cases requiring automatic key expiration (sessions, locks, ephemeral data)
**Evidence:**
- `/home/centra/cloud/chainfire/proto/chainfire.proto` has no `Lease` service definition
- `KvEntry` has `lease_id: Option<i64>` field (types/kv.rs:23) but no expiration logic
- No background worker to delete expired keys
- etcd compatibility broken without Lease service
**Fix Required:**
1. Add Lease service to proto: `LeaseGrant`, `LeaseRevoke`, `LeaseKeepAlive`, `LeaseTimeToLive`
2. Implement lease storage and expiration worker in chainfire-storage
3. Wire lease_id checks to KV operations
4. Add lease_id index for efficient expiration queries
---
### 2. Read Consistency Levels
**Impact:** Cannot guarantee linearizable reads; stale reads possible on followers
**Evidence:**
- Spec defines `ReadConsistency` enum (spec lines 208-215)
- No implementation in chainfire-storage or chainfire-api
- RangeRequest in kv_service.rs always reads from local storage without consistency checks
**Fix Required:**
1. Add consistency parameter to RangeRequest
2. Implement leader verification for Linearizable reads
3. Add committed index check for Serializable reads
4. Default to Linearizable for safety
---
### 3. Range Operations in Transactions
**Impact:** Cannot atomically read-then-write in transactions; limits CAS use cases
**Evidence:**
```rust
// /home/centra/cloud/chainfire/crates/chainfire-api/src/kv_service.rs:224-229
crate::proto::request_op::Request::RequestRange(_) => {
// Range operations in transactions are not supported yet
TxnOp::Delete { key: vec![] } // Returns dummy operation!
}
```
**Fix Required:**
1. Extend `chainfire_types::command::TxnOp` to include `Range` variant
2. Update state_machine.rs to handle read operations in transactions
3. Return range results in TxnResponse.responses
---
## Important Gaps (P1)
### 4. Transaction Response Completeness
**Evidence:**
```rust
// /home/centra/cloud/chainfire/crates/chainfire-api/src/kv_service.rs:194
Ok(Response::new(TxnResponse {
header: Some(self.make_header(response.revision)),
succeeded: response.succeeded,
responses: vec![], // TODO: fill in responses
}))
```
**Fix:** Collect operation results during txn execution and populate responses vector
---
### 5. Point-in-Time Reads (MVCC Historical Queries)
**Evidence:**
- RangeRequest has `revision` field (proto/chainfire.proto:78)
- KvStore.range() doesn't use revision parameter
- No revision-indexed storage in RocksDB
**Fix:** Implement versioned key storage or revision-based snapshots
---
### 6. StorageBackend Trait Abstraction
**Evidence:**
- Spec defines trait (lines 166-174) for pluggable backends
- chainfire-storage is RocksDB-only
- No trait in chainfire-core/src/
**Fix:** Extract trait and implement for RocksDB; enables memory backend testing
---
### 7. Observability
**Gaps:**
- No Prometheus metrics (spec mentions endpoint at 7.2)
- No gRPC health check service
- Limited structured logging
**Fix:** Add metrics crate, implement health checks, expose /metrics endpoint
---
## Nice-to-Have Gaps (P2)
- **Authentication/Authorization:** Spec marks as "Planned" - mTLS and RBAC
- **Namespace Quotas:** Resource limits per tenant
- **SWIM Gossip Integration:** chainfire-gossip crate exists but usage unclear
- **Watch Historical Replay:** start_revision in proto but storage unclear
- **Advanced etcd Compat:** Package name differences, field naming variations
---
## Key Findings
### Strengths
1. **Solid Core Implementation:** KV operations, Raft consensus, and basic transactions work well
2. **Watch System:** Fully functional with bidirectional streaming and event dispatch
3. **Client Library:** Well-designed with CAS and convenience methods
4. **Architecture:** Clean separation of concerns across crates
5. **Testing:** State machine has unit tests for core operations
### Weaknesses
1. **Incomplete Transactions:** Missing range ops and response population breaks advanced use cases
2. **No TTL Support:** Critical for production; requires full Lease service implementation
3. **Undefined Read Consistency:** Dangerous for distributed systems; needs immediate attention
4. **Limited Observability:** No metrics or health checks hinders production deployment
### Blockers for Production
1. Lease service implementation (P0)
2. Read consistency guarantees (P0)
3. Transaction completeness (P1)
4. Basic metrics/health checks (P1)
---
## Recommendations
### Phase 1: Production Readiness (2-3 weeks)
1. Implement Lease service and TTL expiration worker
2. Add read consistency levels (default to Linearizable)
3. Complete transaction responses
4. Add basic Prometheus metrics and health checks
### Phase 2: Feature Completeness (1-2 weeks)
1. Support range operations in transactions
2. Implement point-in-time reads
3. Extract StorageBackend trait
4. Document and test SWIM gossip integration
### Phase 3: Hardening (2-3 weeks)
1. Add authentication (mTLS for peers)
2. Implement basic authorization
3. Add namespace quotas
4. Comprehensive integration tests
---
## Appendix: Implementation Evidence
### Transaction Compare Logic
**Location:** `/home/centra/cloud/chainfire/crates/chainfire-storage/src/state_machine.rs:148-228`
- ✅ Supports Version, CreateRevision, ModRevision, Value comparisons
- ✅ Handles Equal, NotEqual, Greater, Less operators
- ✅ Atomic execution of success/failure ops
### Watch Implementation
**Location:** `/home/centra/cloud/chainfire/crates/chainfire-watch/`
- ✅ WatchRegistry with event dispatch
- ✅ WatchStream for bidirectional gRPC
- ✅ KeyMatcher for prefix/range watches
- ✅ Integration with state machine (state_machine.rs:82-88)
### Client CAS Example
**Location:** `/home/centra/cloud/chainfire/chainfire-client/src/client.rs:228-299`
- ✅ Uses transactions for compare-and-swap
- ✅ Returns CasOutcome with current/new versions
- ⚠️ Fallback read on failure uses range op (demonstrates txn range gap)
---
**Report Generated:** 2025-12-08
**Auditor:** Claude Code Agent
**Next Review:** After Phase 1 implementation