Skip to content

Monitoring Overview

Status: � In Progress (Prometheus implemented; Grafana + Evidently pending)

Monitoring ensures that the Time2Bet system remains: - reliable, - observable, - and safe to operate in production.

Monitoring covers three layers: 1. Service & infrastructure health, 2. Data integrity and drift, 3. Model behavior and prediction stability.


Implementation Roadmap

Phase 1 ✅ Complete

  • [x] Prometheus GET /metrics endpoint in FastAPI (src/app/main.py)
  • [x] Request latency + throughput via _PrometheusMiddleware
  • [x] Service health via GET /healthcheck/
  • [x] Prediction counters: prediction_requests_total{source}, prediction_timeouts_total
  • [x] ML worker metrics: inference_duration_seconds, prediction_confidence, model_info
  • [x] Celery queue stats: GET /monitoring/celery/queues, /celery/workers

Phase 2 📋 Planned (Current Priority)

  • [ ] Grafana dashboard (latency, throughput, error rate, queue depth)
  • [ ] Evidently data drift detection on prediction logs
  • [ ] Drift metrics exported to Prometheus

Phase 3 📋 Planned

  • [ ] Alerting rules (Prometheus Alertmanager)
  • [ ] Incident runbooks deployed
  • [ ] On-call escalation policy