Monitoring Overview¶
Status: Partially implemented — Prometheus metrics operational; Grafana, AlertManager, and Evidently not yet deployed.
Current Observability (Implemented)¶
The inference service currently exports 8 Prometheus metrics via GET /metrics:
| What | How |
|---|---|
| HTTP request rate and latency | soccer_requests_total, soccer_request_duration_seconds |
| HTTP error rate (4xx/5xx) | soccer_errors_total |
| Prediction volume | soccer_predictions_total |
| Celery queue depth | soccer_celery_queue_length (also via GET /monitoring/celery/queues) |
| Celery active workers | soccer_celery_workers_active (also via GET /monitoring/celery/workers) |
| Model load state | soccer_model_loaded |
| Model version | soccer_model_version |
| Service liveness | GET /healthcheck/ |
Metrics are scraped from GET /metrics (Prometheus exposition format).
Full details: Metrics reference · Coverage matrix
There are no active dashboards, alert channels, or drift detection integrations today.
All operational monitoring is manual — via GET /metrics, /healthcheck/, and /monitoring/celery/*.
Roadmap¶
Phase 2 — Next (Current Priority)¶
- [ ] Grafana dashboard (latency, throughput, error rate, queue depth)
- [ ] Evidently data drift detection on prediction logs (offline batch report first)
- [ ] AlertManager rule:
soccer_model_loaded == 0
Phase 3 — Planned¶
- [ ] Drift metrics exported to Prometheus
- [ ] PostgreSQL query latency via pg_exporter
- [ ] Log aggregation (ELK or Loki)
- [ ] On-call escalation policy