photoncloud-monorepo/docs/por/T004-p0-fixes/task.yaml
centra a7ec7e2158 Add T026 practical test + k8shost to flake + workspace files
- Created T026-practical-test task.yaml for MVP smoke testing
- Added k8shost-server to flake.nix (packages, apps, overlays)
- Staged all workspace directories for nix flake build
- Updated flake.nix shellHook to include k8shost

Resolves: T026.S1 blocker (R8 - nix submodule visibility)
2025-12-09 06:07:50 +09:00

115 lines
4.2 KiB
YAML

id: T004
name: P0 Critical Fixes - Production Blockers
status: complete
created: 2025-12-08
completed: 2025-12-08
owner: peerB
goal: Resolve all 6 P0 blockers identified in T003 gap analysis
description: |
Fix critical gaps that block production deployment.
Priority order: FlareDB persistence (data loss) > Chainfire (etcd compat) > IAM (K8s deploy)
acceptance:
- All 6 P0 fixes implemented and tested
- No regressions in existing tests
- R4 risk (FlareDB data loss) closed
steps:
- step: S1
action: FlareDB persistent Raft storage
priority: P0-CRITICAL
status: complete
complexity: large
estimate: 1-2 weeks
location: flaredb-raft/src/persistent_storage.rs, raft_node.rs, store.rs
completed: 2025-12-08
notes: |
Implemented persistent Raft storage with:
- New `new_persistent()` constructor uses RocksDB via PersistentFlareStore
- Snapshot persistence to RocksDB (data + metadata)
- Startup recovery: loads snapshot, restores state machine
- Fixed state machine serialization (bincode for tuple map keys)
- FlareDB server now uses persistent storage by default
- Added test: test_snapshot_persistence_and_recovery
- step: S2
action: Chainfire lease service
priority: P0
status: complete
complexity: medium
estimate: 3-5 days
location: chainfire.proto, lease.rs, lease_store.rs, lease_service.rs
completed: 2025-12-08
notes: |
Implemented full Lease service for etcd compatibility:
- Proto: LeaseGrant, LeaseRevoke, LeaseKeepAlive, LeaseTimeToLive, LeaseLeases RPCs
- Types: Lease, LeaseData, LeaseId in chainfire-types
- Storage: LeaseStore with grant/revoke/refresh/attach_key/detach_key/export/import
- State machine: Handles LeaseGrant/Revoke/Refresh commands, key attachment
- Service: LeaseServiceImpl in chainfire-api with streaming keep-alive
- Integration: Put/Delete auto-attach/detach keys to/from leases
- step: S3
action: Chainfire read consistency
priority: P0
status: complete
complexity: small
estimate: 1-2 days
location: kv_service.rs, chainfire.proto
completed: 2025-12-08
notes: |
Implemented linearizable/serializable read modes:
- Added `serializable` field to RangeRequest in chainfire.proto
- When serializable=false (default), calls linearizable_read() before reading
- linearizable_read() uses OpenRaft's ensure_linearizable() for consistency
- Updated all client RangeRequest usages with explicit serializable flags
- step: S4
action: Chainfire range in transactions
priority: P0
status: complete
complexity: small
estimate: 1-2 days
location: kv_service.rs, command.rs, state_machine.rs
completed: 2025-12-08
notes: |
Fixed Range operations in transactions:
- Added TxnOp::Range variant to chainfire-types/command.rs
- Updated state_machine.rs to handle Range ops (read-only, no state change)
- Fixed convert_ops in kv_service.rs to convert RequestRange properly
- Removed dummy Delete op workaround
- step: S5
action: IAM health endpoints
priority: P0
status: complete
complexity: small
estimate: 1 day
completed: 2025-12-08
notes: |
Added gRPC health service (grpc.health.v1.Health) using tonic-health.
K8s can use grpc health probes for liveness/readiness.
Services: IamAuthz, IamToken, IamAdmin all report SERVING status.
- step: S6
action: IAM metrics
priority: P0
status: complete
complexity: small
estimate: 1-2 days
completed: 2025-12-08
notes: |
Added Prometheus metrics using metrics-exporter-prometheus.
Serves metrics at http://0.0.0.0:{metrics_port}/metrics (default 9090).
Pre-defined counters: authz_requests, allowed, denied, token_issued.
Pre-defined histogram: request_duration_seconds.
parallel_track: |
After S5+S6 complete (IAM P0s, ~3 days), PlasmaVMC spec design can begin
while S1 (FlareDB persistence) continues.
notes: |
Strategic decision: Modified (B) Parallel approach.
FlareDB persistence is critical path - start immediately.
Small fixes (S3-S6) can be done in parallel by multiple developers.