Current Monitoring Coverage¶
This page is the source of truth for what observability is actually in place today.
Coverage matrix¶
| Layer | Tool | Status | What it covers |
|---|---|---|---|
| API request rate | Prometheus middleware | ✅ Implemented | soccer_requests_total counter |
| API latency | Prometheus histogram | ✅ Implemented | soccer_request_duration_seconds |
| API error rate (4xx/5xx) | Prometheus counter | ✅ Implemented | soccer_errors_total |
| Celery queue depth | GET /monitoring/celery/queues |
✅ Implemented | Per-queue message count |
| Celery active workers | GET /monitoring/celery/workers |
✅ Implemented | Worker ping status |
| Task status polling | GET /monitoring/task_status/{id} |
✅ Implemented | Celery result backend |
| Model version info | GET /predict/model/info |
✅ Implemented | Loaded model name + version |
| Prediction counters | Prometheus counter | ✅ Implemented | soccer_predictions_total |
| Service liveness | GET /healthcheck/ |
✅ Implemented | Memory + status |
| Grafana dashboards | Grafana | 📋 Planned | Prometheus data available; dashboards not deployed |
| PostgreSQL metrics | pg_exporter | 📋 Planned | Not yet configured |
| Feature drift (Evidently) | Evidently | 📋 Planned | Architecture designed; not integrated |
| Prediction drift (Evidently) | Evidently | 📋 Planned | Requires ground truth feedback loop |
| Model performance monitoring | MLflow + Evidently | 📋 Planned | Ground truth lag ~90 min after match |
| Alerting rules (AlertManager) | AlertManager | 📋 Planned | Runbooks written; rules not deployed |
| Log aggregation | ELK / Loki | 📋 Planned | stdout today |
What Prometheus currently exports¶
8 metrics are scraped from GET /metrics:
| Metric name | Type | Description |
|---|---|---|
soccer_requests_total |
Counter | Total HTTP requests by endpoint and status |
soccer_request_duration_seconds |
Histogram | Request latency by endpoint |
soccer_errors_total |
Counter | Total 4xx/5xx errors |
soccer_predictions_total |
Counter | Total predictions served |
soccer_celery_queue_length |
Gauge | Messages in each Celery queue |
soccer_celery_workers_active |
Gauge | Active Celery worker count |
soccer_model_loaded |
Gauge | 1 if model is loaded, 0 otherwise |
soccer_model_version |
Gauge (label) | Currently loaded model version |
Gaps and planned work¶
Priority 1 (next sprint):
- Deploy Grafana with a soccer-api dashboard (data is already being collected).
- Configure AlertManager rule for soccer_model_loaded == 0.
Priority 2: - Integrate Evidently for feature drift detection (offline batch report first). - Add pg_exporter sidecar for PostgreSQL query latency.
Priority 3: - Real-time ground truth feedback loop (match result arrives ~90 min after KO). - Automated retraining trigger on confirmed drift signal.