Skip to content

Ops / Security / Observability Audit — SoccerPredictAI

Date: 2026-04-28 Auditor: GitHub Copilot (Claude Opus 4.7) — /skill-ml-system-audit full (audit 10/12) Scope: K8s deployment, secrets, CORS, HA, Prometheus, Grafana, drift monitoring Baseline: docs/validation/20260424/10_ops_security_observability_audit.md


Delta vs baseline

k8s/helm/ns_soccer-api/, src/app/main.py, src/app/metrics.py, src/monitoring/, src/app/config/security.py unchanged since 2026-04-26. Untracked dev-only files (docker/grafana/, docker/prometheus.dev.yml, docker/Dockerfile.evidently) remain untracked and not wired into K8s manifests. Baseline findings remain in force.


Confirmed posture

HA: all components replicas=1 (FastAPI, worker-api, worker-ml, RabbitMQ, Redis). PostgreSQL/MinIO not in this Helm chart. HPA template exists (cpu=70%, mem=80%, min=1, max=3) but disabled in values.yaml.

Probes: FastAPI liveness + readiness on /healthcheck/. Celery worker liveness — needs verification of celery inspect ping.

Resources: FastAPI 1Gi/500m, worker-ml 2Gi/500m, worker-api 1Gi/500m. PVCs use storageClassName: manual (HostPath single-node).

Auth: X-Token only on /sources/*. /predict/*, /monitoring/*, /livescores/ open. CORS allow_origins=["*"]. Secrets via SOPS+age + pydantic-settings (clean).

Metrics (Prometheus): request_count, request_latency_seconds, prediction_count, prediction_timeouts_total, prediction_latency_seconds, inference_latency_seconds, prediction_confidence (per outcome), model_info. Scrape on FastAPI :8000/metrics and Celery ML :9091/metrics.

Grafana: dev-only docker/grafana/ files exist (untracked); not deployed in K8s → still "Planned".

Drift / monitoring: src/monitoring/ empty. docker/Dockerfile.evidently untracked, no service wired in. Evidently integration not implemented.


Risk register (re-confirmed)

ID Severity Description Status
OPS-01 P1 All components replicas=1, no HA Open (= R8)
OPS-02 P1 /predict/* and /monitoring/* unauthenticated Open (= SRV-01)
OPS-03 P1 CORS allow_origins=["*"] Open
OPS-04 P1 Grafana not deployed in K8s — dashboards missing Open
OPS-05 P1 src/monitoring/ empty — no drift detection Open (= R7)
OPS-06 P2 FastAPI HPA template present but disabled Open
OPS-07 P2 Celery worker liveness probe needs verification Open
OPS-08 P2 storageClassName: manual (HostPath) PVCs not portable Open

Summary

Aspect Status
Resource limits + probes for FastAPI
Prometheus instrumentation coverage
Secrets management (SOPS+age, pydantic-settings)
Logging (no print)
HA / replicas ❌ (OPS-01)
Auth on inference / monitoring endpoints ❌ (OPS-02)
Tight CORS ❌ (OPS-03)
Grafana dashboards in production ❌ (OPS-04)
Drift detection ❌ (OPS-05)

Recommendation: The dev-only docker/grafana/ and docker/Dockerfile.evidently artifacts are not yet promoted to K8s. Once promoted (and committed), re-run this audit to update OPS-04 and OPS-05.

See baseline §1–§3 for code-level detail.