Skip to content

Current Monitoring Coverage

This page is the source of truth for what observability is actually in place today.


Coverage matrix

Layer Tool Status What it covers
API request rate Prometheus middleware ✅ Implemented soccer_requests_total counter
API latency Prometheus histogram ✅ Implemented soccer_request_duration_seconds
API error rate (4xx/5xx) Prometheus counter ✅ Implemented soccer_errors_total
Celery queue depth GET /monitoring/celery/queues ✅ Implemented Per-queue message count
Celery active workers GET /monitoring/celery/workers ✅ Implemented Worker ping status
Task status polling GET /monitoring/task_status/{id} ✅ Implemented Celery result backend
Model version info GET /predict/model/info ✅ Implemented Loaded model name + version
Prediction counters Prometheus counter ✅ Implemented soccer_predictions_total
Service liveness GET /healthcheck/ ✅ Implemented Memory + status
Grafana dashboards Grafana 📋 Planned Prometheus data available; dashboards not deployed
PostgreSQL metrics pg_exporter 📋 Planned Not yet configured
Feature drift (Evidently) Evidently 📋 Planned Architecture designed; not integrated
Prediction drift (Evidently) Evidently 📋 Planned Requires ground truth feedback loop
Model performance monitoring MLflow + Evidently 📋 Planned Ground truth lag ~90 min after match
Alerting rules (AlertManager) AlertManager 📋 Planned Runbooks written; rules not deployed
Log aggregation ELK / Loki 📋 Planned stdout today

What Prometheus currently exports

8 metrics are scraped from GET /metrics:

Metric name Type Description
soccer_requests_total Counter Total HTTP requests by endpoint and status
soccer_request_duration_seconds Histogram Request latency by endpoint
soccer_errors_total Counter Total 4xx/5xx errors
soccer_predictions_total Counter Total predictions served
soccer_celery_queue_length Gauge Messages in each Celery queue
soccer_celery_workers_active Gauge Active Celery worker count
soccer_model_loaded Gauge 1 if model is loaded, 0 otherwise
soccer_model_version Gauge (label) Currently loaded model version

Gaps and planned work

Priority 1 (next sprint): - Deploy Grafana with a soccer-api dashboard (data is already being collected). - Configure AlertManager rule for soccer_model_loaded == 0.

Priority 2: - Integrate Evidently for feature drift detection (offline batch report first). - Add pg_exporter sidecar for PostgreSQL query latency.

Priority 3: - Real-time ground truth feedback loop (match result arrives ~90 min after KO). - Automated retraining trigger on confirmed drift signal.