Monitoring Overview¶
Status: � In Progress (Prometheus implemented; Grafana + Evidently pending)
Monitoring ensures that the Time2Bet system remains: - reliable, - observable, - and safe to operate in production.
Monitoring covers three layers: 1. Service & infrastructure health, 2. Data integrity and drift, 3. Model behavior and prediction stability.
Implementation Roadmap¶
Phase 1 ✅ Complete¶
- [x] Prometheus
GET /metricsendpoint in FastAPI (src/app/main.py) - [x] Request latency + throughput via
_PrometheusMiddleware - [x] Service health via
GET /healthcheck/ - [x] Prediction counters:
prediction_requests_total{source},prediction_timeouts_total - [x] ML worker metrics:
inference_duration_seconds,prediction_confidence,model_info - [x] Celery queue stats:
GET /monitoring/celery/queues,/celery/workers
Phase 2 📋 Planned (Current Priority)¶
- [ ] Grafana dashboard (latency, throughput, error rate, queue depth)
- [ ] Evidently data drift detection on prediction logs
- [ ] Drift metrics exported to Prometheus
Phase 3 📋 Planned¶
- [ ] Alerting rules (Prometheus Alertmanager)
- [ ] Incident runbooks deployed
- [ ] On-call escalation policy