Skip to content

Monitoring Overview

Status: Partially implemented — Prometheus metrics operational; Grafana, AlertManager, and Evidently not yet deployed.


Current Observability (Implemented)

The inference service currently exports 8 Prometheus metrics via GET /metrics:

What How
HTTP request rate and latency soccer_requests_total, soccer_request_duration_seconds
HTTP error rate (4xx/5xx) soccer_errors_total
Prediction volume soccer_predictions_total
Celery queue depth soccer_celery_queue_length (also via GET /monitoring/celery/queues)
Celery active workers soccer_celery_workers_active (also via GET /monitoring/celery/workers)
Model load state soccer_model_loaded
Model version soccer_model_version
Service liveness GET /healthcheck/

Metrics are scraped from GET /metrics (Prometheus exposition format). Full details: Metrics reference · Coverage matrix

There are no active dashboards, alert channels, or drift detection integrations today. All operational monitoring is manual — via GET /metrics, /healthcheck/, and /monitoring/celery/*.


Roadmap

Phase 2 — Next (Current Priority)

  • [ ] Grafana dashboard (latency, throughput, error rate, queue depth)
  • [ ] Evidently data drift detection on prediction logs (offline batch report first)
  • [ ] AlertManager rule: soccer_model_loaded == 0

Phase 3 — Planned

  • [ ] Drift metrics exported to Prometheus
  • [ ] PostgreSQL query latency via pg_exporter
  • [ ] Log aggregation (ELK or Loki)
  • [ ] On-call escalation policy