Monitoring Overview¶

Status: Partially implemented — Prometheus metrics operational; Grafana, AlertManager, and Evidently not yet deployed.

Current Observability (Implemented)¶

The inference service currently exports 8 Prometheus metrics via GET /metrics:

What	How
HTTP request rate and latency	`soccer_requests_total`, `soccer_request_duration_seconds`
HTTP error rate (4xx/5xx)	`soccer_errors_total`
Prediction volume	`soccer_predictions_total`
Celery queue depth	`soccer_celery_queue_length` (also via `GET /monitoring/celery/queues`)
Celery active workers	`soccer_celery_workers_active` (also via `GET /monitoring/celery/workers`)
Model load state	`soccer_model_loaded`
Model version	`soccer_model_version`
Service liveness	`GET /healthcheck/`

Metrics are scraped from GET /metrics (Prometheus exposition format). Full details: Metrics reference · Coverage matrix

There are no active dashboards, alert channels, or drift detection integrations today. All operational monitoring is manual — via GET /metrics, /healthcheck/, and /monitoring/celery/*.

Roadmap¶

Phase 2 — Next (Current Priority)¶

[ ] Grafana dashboard (latency, throughput, error rate, queue depth)
[ ] Evidently data drift detection on prediction logs (offline batch report first)
[ ] AlertManager rule: soccer_model_loaded == 0

Phase 3 — Planned¶

[ ] Drift metrics exported to Prometheus
[ ] PostgreSQL query latency via pg_exporter
[ ] Log aggregation (ELK or Loki)
[ ] On-call escalation policy