id: T027 name: Production Hardening goal: Transform MVP stack into a production-grade, observable, and highly available platform. status: active priority: P1 owner: peerB created: 2025-12-10 depends_on: [T026] blocks: [] context: | With MVP functionality verified (T026), the platform must be hardened for production usage. This involves ensuring high availability (HA), comprehensive observability (metrics/logs), and security (TLS). This task focuses on Non-Functional Requirements (NFRs). Functional gaps (deferred P1s) will be handled in T028. acceptance: - All components use a unified configuration approach (clap + config file or env) - Full observability stack (Prometheus/Grafana/Loki) operational via NixOS - All services exporting metrics and logs to the stack - Chainfire and FlareDB verified in 3-node HA cluster - TLS enabled for all inter-service communication (optional for internal, required for external) - Chaos testing (kill node, verify recovery) passed - Ops documentation (Backup/Restore, Upgrade) created steps: - step: S0 name: Config Unification done: All components use unified configuration (clap + config file/env) status: complete owner: peerB priority: P0 - step: S1 name: Observability Stack done: Prometheus, Grafana, and Loki deployed and scraping targets status: pending owner: peerB priority: P0 - step: S2 name: Service Telemetry Integration done: All components (Chainfire, FlareDB, IAM, k8shost) dashboards functional status: pending owner: peerB priority: P0 - step: S3 name: HA Clustering Verification done: 3-node Chainfire/FlareDB cluster survives single node failure status: pending owner: peerB priority: P0 - step: S4 name: Security Hardening done: mTLS/TLS enabled where appropriate, secrets management verified status: pending owner: peerB priority: P1 - step: S5 name: Ops Documentation done: Runbooks for common operations (Scale out, Restore, Upgrade) status: pending owner: peerB priority: P1 evidence: [] notes: | Separated from functional feature work (T028).