Ops / Security / Observability Audit — SoccerPredictAI¶
Date: 2026-04-28
Auditor: GitHub Copilot (Claude Opus 4.7) — /skill-ml-system-audit full (audit 10/12)
Scope: K8s deployment, secrets, CORS, HA, Prometheus, Grafana, drift monitoring
Baseline: docs/validation/20260424/10_ops_security_observability_audit.md
Delta vs baseline¶
k8s/helm/ns_soccer-api/, src/app/main.py, src/app/metrics.py, src/monitoring/, src/app/config/security.py unchanged since 2026-04-26. Untracked dev-only files (docker/grafana/, docker/prometheus.dev.yml, docker/Dockerfile.evidently) remain untracked and not wired into K8s manifests. Baseline findings remain in force.
Confirmed posture¶
HA: all components replicas=1 (FastAPI, worker-api, worker-ml, RabbitMQ, Redis). PostgreSQL/MinIO not in this Helm chart. HPA template exists (cpu=70%, mem=80%, min=1, max=3) but disabled in values.yaml.
Probes: FastAPI liveness + readiness on /healthcheck/. Celery worker liveness — needs verification of celery inspect ping.
Resources: FastAPI 1Gi/500m, worker-ml 2Gi/500m, worker-api 1Gi/500m. PVCs use storageClassName: manual (HostPath single-node).
Auth: X-Token only on /sources/*. /predict/*, /monitoring/*, /livescores/ open. CORS allow_origins=["*"]. Secrets via SOPS+age + pydantic-settings (clean).
Metrics (Prometheus): request_count, request_latency_seconds, prediction_count, prediction_timeouts_total, prediction_latency_seconds, inference_latency_seconds, prediction_confidence (per outcome), model_info. Scrape on FastAPI :8000/metrics and Celery ML :9091/metrics.
Grafana: dev-only docker/grafana/ files exist (untracked); not deployed in K8s → still "Planned".
Drift / monitoring: src/monitoring/ empty. docker/Dockerfile.evidently untracked, no service wired in. Evidently integration not implemented.
Risk register (re-confirmed)¶
| ID | Severity | Description | Status |
|---|---|---|---|
| OPS-01 | P1 | All components replicas=1, no HA |
Open (= R8) |
| OPS-02 | P1 | /predict/* and /monitoring/* unauthenticated |
Open (= SRV-01) |
| OPS-03 | P1 | CORS allow_origins=["*"] |
Open |
| OPS-04 | P1 | Grafana not deployed in K8s — dashboards missing | Open |
| OPS-05 | P1 | src/monitoring/ empty — no drift detection |
Open (= R7) |
| OPS-06 | P2 | FastAPI HPA template present but disabled | Open |
| OPS-07 | P2 | Celery worker liveness probe needs verification | Open |
| OPS-08 | P2 | storageClassName: manual (HostPath) PVCs not portable |
Open |
Summary¶
| Aspect | Status |
|---|---|
| Resource limits + probes for FastAPI | ✅ |
| Prometheus instrumentation coverage | ✅ |
| Secrets management (SOPS+age, pydantic-settings) | ✅ |
Logging (no print) |
✅ |
| HA / replicas | ❌ (OPS-01) |
| Auth on inference / monitoring endpoints | ❌ (OPS-02) |
| Tight CORS | ❌ (OPS-03) |
| Grafana dashboards in production | ❌ (OPS-04) |
| Drift detection | ❌ (OPS-05) |
Recommendation: The dev-only docker/grafana/ and docker/Dockerfile.evidently artifacts are not yet promoted to K8s. Once promoted (and committed), re-run this audit to update OPS-04 and OPS-05.
See baseline §1–§3 for code-level detail.