Incident Response & Playbooks¶

How to respond to production incidents affecting the SoccerPredictAI system.

There are no active alert channels or dashboards today. Detection is manual — via GET /metrics, GET /healthcheck/, and GET /monitoring/celery/*.

Severity Levels¶

Level	Definition	Response Time
P1	API down, no predictions served	Immediate (< 15 min)
P2	Degraded latency or partial failure	< 1 hour
P3	Data pipeline failure, no model impact yet	< 4 hours
P4	Non-critical warning, cosmetic issues	Next business day

Current Response Process¶

Detect — Manual check via GET /healthcheck/, GET /metrics, or GET /monitoring/celery/queues. No automated notifications today.
Diagnose — Check Prometheus metrics, kubectl logs, Celery queue depth, and model load state (soccer_model_loaded).
Mitigate — Apply runbook action (restart pod, rollback model, clear queue).
Resolve — Confirm metrics return to baseline.
Document — Record what happened, what was done, and what changed.

Quick Pointers¶

API down → Deployment Recovery Runbook
Bad model predictions → Model Rollback Runbook
Data staleness → Backfills Runbook
General failures → Troubleshooting
On-call Cheat Sheet

Incident Response & Playbooks¶

Severity Levels¶

Current Response Process¶

Quick Pointers¶

Related¶