Incident Response & Playbooks¶
How to respond to production incidents affecting the SoccerPredictAI system.
There are no active alert channels or dashboards today. Detection is manual — via GET /metrics, GET /healthcheck/, and GET /monitoring/celery/*.
Severity Levels¶
| Level | Definition | Response Time |
|---|---|---|
| P1 | API down, no predictions served | Immediate (< 15 min) |
| P2 | Degraded latency or partial failure | < 1 hour |
| P3 | Data pipeline failure, no model impact yet | < 4 hours |
| P4 | Non-critical warning, cosmetic issues | Next business day |
Current Response Process¶
- Detect — Manual check via
GET /healthcheck/,GET /metrics, orGET /monitoring/celery/queues. No automated notifications today. - Diagnose — Check Prometheus metrics,
kubectl logs, Celery queue depth, and model load state (soccer_model_loaded). - Mitigate — Apply runbook action (restart pod, rollback model, clear queue).
- Resolve — Confirm metrics return to baseline.
- Document — Record what happened, what was done, and what changed.
Quick Pointers¶
- API down → Deployment Recovery Runbook
- Bad model predictions → Model Rollback Runbook
- Data staleness → Backfills Runbook
- General failures → Troubleshooting
- On-call Cheat Sheet