Skip to content

Alerting Strategy

Status: 📋 Planned — AlertManager is not deployed. No active alert rules or notification channels exist today.

Alerting rules and runbooks are designed but not yet configured in Prometheus Alertmanager.


Intended alert rules

Service health

Condition Intended severity
soccer_model_loaded == 0 P1 — model not available
Sustained soccer_request_duration_seconds p99 > 500ms P2 — latency SLO breach
Elevated soccer_errors_total rate (5xx) P2 — error spike
Service unavailable (/healthcheck/ failing) P1

Async pipeline

Condition Intended severity
soccer_celery_queue_length growing without bound P2
soccer_celery_workers_active == 0 P1 — no workers processing

Data & model (pending Evidently integration)

  • Severe feature distribution drift detected.
  • Abnormal prediction distribution.

Alert routing (intended)

  • P1/P2: notify operator; link to relevant runbook.
  • P3/P4: log only.

Alerts will always reference the relevant runbook in Runbooks.