id: T004 name: P0 Critical Fixes - Production Blockers status: complete created: 2025-12-08 completed: 2025-12-08 owner: peerB goal: Resolve all 6 P0 blockers identified in T003 gap analysis description: | Fix critical gaps that block production deployment. Priority order: FlareDB persistence (data loss) > Chainfire (etcd compat) > IAM (K8s deploy) acceptance: - All 6 P0 fixes implemented and tested - No regressions in existing tests - R4 risk (FlareDB data loss) closed steps: - step: S1 action: FlareDB persistent Raft storage priority: P0-CRITICAL status: complete complexity: large estimate: 1-2 weeks location: flaredb-raft/src/persistent_storage.rs, raft_node.rs, store.rs completed: 2025-12-08 notes: | Implemented persistent Raft storage with: - New `new_persistent()` constructor uses RocksDB via PersistentFlareStore - Snapshot persistence to RocksDB (data + metadata) - Startup recovery: loads snapshot, restores state machine - Fixed state machine serialization (bincode for tuple map keys) - FlareDB server now uses persistent storage by default - Added test: test_snapshot_persistence_and_recovery - step: S2 action: Chainfire lease service priority: P0 status: complete complexity: medium estimate: 3-5 days location: chainfire.proto, lease.rs, lease_store.rs, lease_service.rs completed: 2025-12-08 notes: | Implemented full Lease service for etcd compatibility: - Proto: LeaseGrant, LeaseRevoke, LeaseKeepAlive, LeaseTimeToLive, LeaseLeases RPCs - Types: Lease, LeaseData, LeaseId in chainfire-types - Storage: LeaseStore with grant/revoke/refresh/attach_key/detach_key/export/import - State machine: Handles LeaseGrant/Revoke/Refresh commands, key attachment - Service: LeaseServiceImpl in chainfire-api with streaming keep-alive - Integration: Put/Delete auto-attach/detach keys to/from leases - step: S3 action: Chainfire read consistency priority: P0 status: complete complexity: small estimate: 1-2 days location: kv_service.rs, chainfire.proto completed: 2025-12-08 notes: | Implemented linearizable/serializable read modes: - Added `serializable` field to RangeRequest in chainfire.proto - When serializable=false (default), calls linearizable_read() before reading - linearizable_read() uses OpenRaft's ensure_linearizable() for consistency - Updated all client RangeRequest usages with explicit serializable flags - step: S4 action: Chainfire range in transactions priority: P0 status: complete complexity: small estimate: 1-2 days location: kv_service.rs, command.rs, state_machine.rs completed: 2025-12-08 notes: | Fixed Range operations in transactions: - Added TxnOp::Range variant to chainfire-types/command.rs - Updated state_machine.rs to handle Range ops (read-only, no state change) - Fixed convert_ops in kv_service.rs to convert RequestRange properly - Removed dummy Delete op workaround - step: S5 action: IAM health endpoints priority: P0 status: complete complexity: small estimate: 1 day completed: 2025-12-08 notes: | Added gRPC health service (grpc.health.v1.Health) using tonic-health. K8s can use grpc health probes for liveness/readiness. Services: IamAuthz, IamToken, IamAdmin all report SERVING status. - step: S6 action: IAM metrics priority: P0 status: complete complexity: small estimate: 1-2 days completed: 2025-12-08 notes: | Added Prometheus metrics using metrics-exporter-prometheus. Serves metrics at http://0.0.0.0:{metrics_port}/metrics (default 9090). Pre-defined counters: authz_requests, allowed, denied, token_issued. Pre-defined histogram: request_duration_seconds. parallel_track: | After S5+S6 complete (IAM P0s, ~3 days), PlasmaVMC spec design can begin while S1 (FlareDB persistence) continues. notes: | Strategic decision: Modified (B) Parallel approach. FlareDB persistence is critical path - start immediately. Small fixes (S3-S6) can be done in parallel by multiple developers.