Skip to content

Incident Response & Playbooks

How to respond to production incidents affecting the SoccerPredictAI system.

There are no active alert channels or dashboards today. Detection is manual — via GET /metrics, GET /healthcheck/, and GET /monitoring/celery/*.


Severity Levels

Level Definition Response Time
P1 API down, no predictions served Immediate (< 15 min)
P2 Degraded latency or partial failure < 1 hour
P3 Data pipeline failure, no model impact yet < 4 hours
P4 Non-critical warning, cosmetic issues Next business day

Current Response Process

  1. Detect — Manual check via GET /healthcheck/, GET /metrics, or GET /monitoring/celery/queues. No automated notifications today.
  2. Diagnose — Check Prometheus metrics, kubectl logs, Celery queue depth, and model load state (soccer_model_loaded).
  3. Mitigate — Apply runbook action (restart pod, rollback model, clear queue).
  4. Resolve — Confirm metrics return to baseline.
  5. Document — Record what happened, what was done, and what changed.

Quick Pointers