Alerting Strategy¶
Status: 📋 Planned — AlertManager is not deployed. No active alert rules or notification channels exist today.
Alerting rules and runbooks are designed but not yet configured in Prometheus Alertmanager.
Intended alert rules¶
Service health¶
| Condition | Intended severity |
|---|---|
soccer_model_loaded == 0 |
P1 — model not available |
Sustained soccer_request_duration_seconds p99 > 500ms |
P2 — latency SLO breach |
Elevated soccer_errors_total rate (5xx) |
P2 — error spike |
Service unavailable (/healthcheck/ failing) |
P1 |
Async pipeline¶
| Condition | Intended severity |
|---|---|
soccer_celery_queue_length growing without bound |
P2 |
soccer_celery_workers_active == 0 |
P1 — no workers processing |
Data & model (pending Evidently integration)¶
- Severe feature distribution drift detected.
- Abnormal prediction distribution.
Alert routing (intended)¶
- P1/P2: notify operator; link to relevant runbook.
- P3/P4: log only.
Alerts will always reference the relevant runbook in Runbooks.