Skip to content

Monitoring Overview¶

Status: � In Progress (Prometheus implemented; Grafana + Evidently pending)

Monitoring ensures that the Time2Bet system remains: - reliable, - observable, - and safe to operate in production.

Monitoring covers three layers: 1. Service & infrastructure health, 2. Data integrity and drift, 3. Model behavior and prediction stability.

Implementation Roadmap¶

Phase 1 ✅ Complete¶

[x] Prometheus GET /metrics endpoint in FastAPI (src/app/main.py)
[x] Request latency + throughput via _PrometheusMiddleware
[x] Service health via GET /healthcheck/
[x] Prediction counters: prediction_requests_total{source}, prediction_timeouts_total
[x] ML worker metrics: inference_duration_seconds, prediction_confidence, model_info
[x] Celery queue stats: GET /monitoring/celery/queues, /celery/workers

Phase 2 📋 Planned (Current Priority)¶

[ ] Grafana dashboard (latency, throughput, error rate, queue depth)
[ ] Evidently data drift detection on prediction logs
[ ] Drift metrics exported to Prometheus

Phase 3 📋 Planned¶

[ ] Alerting rules (Prometheus Alertmanager)
[ ] Incident runbooks deployed
[ ] On-call escalation policy