photoncloud-monorepo/docs/por/T027-production-hardening/task.yaml

id: T027
name: Production Hardening
goal: Transform MVP stack into a production-grade, observable, and highly available platform.
status: active
priority: P1
owner: peerB
created: 2025-12-10
depends_on: [T026]
blocks: []

context: |
  With MVP functionality verified (T026), the platform must be hardened for
  production usage. This involves ensuring high availability (HA), comprehensive
  observability (metrics/logs), and security (TLS).

  This task focuses on Non-Functional Requirements (NFRs). Functional gaps
  (deferred P1s) will be handled in T028.

acceptance:
  - All components use a unified configuration approach (clap + config file or env)
  - Full observability stack (Prometheus/Grafana/Loki) operational via NixOS
  - All services exporting metrics and logs to the stack
  - Chainfire and FlareDB verified in 3-node HA cluster
  - TLS enabled for all inter-service communication (optional for internal, required for external)
  - Chaos testing (kill node, verify recovery) passed
  - Ops documentation (Backup/Restore, Upgrade) created

steps:
  - step: S0
    name: Config Unification
    done: All components use unified configuration (clap + config file/env)
    status: complete
    owner: peerB
    priority: P0

  - step: S1
    name: Observability Stack
    done: Prometheus, Grafana, and Loki deployed and scraping targets
    status: pending
    owner: peerB
    priority: P0

  - step: S2
    name: Service Telemetry Integration
    done: All components (Chainfire, FlareDB, IAM, k8shost) dashboards functional
    status: pending
    owner: peerB
    priority: P0

  - step: S3
    name: HA Clustering Verification
    done: 3-node Chainfire/FlareDB cluster survives single node failure
    status: pending
    owner: peerB
    priority: P0

  - step: S4
    name: Security Hardening
    done: mTLS/TLS enabled where appropriate, secrets management verified
    status: pending
    owner: peerB
    priority: P1

  - step: S5
    name: Ops Documentation
    done: Runbooks for common operations (Scale out, Restore, Upgrade)
    status: pending
    owner: peerB
    priority: P1

evidence: []
notes: |
  Separated from functional feature work (T028).