Deployment Recovery¶
Steps to recover the serving layer after a failed deployment or crash.
Work in progress
This page is a placeholder. Will be refined after first production incident.
Symptoms of a Failed Deployment¶
kubectl rollout status deployment/soccer-apireportsdeadline exceeded/readyreturns 503 or is unreachable- Prometheus alert:
SoccerAPIDownorHighErrorRate
Step 1 — Check Pod Status¶
kubectl get pods -n soccer -l app=soccer-api
kubectl describe pod <pod-name> -n soccer
kubectl logs <pod-name> -n soccer --previous
Common causes:
CrashLoopBackOff→ application error (check logs)ImagePullBackOff→ wrong image tag or registry credentialsOOMKilled→ memory limit too low
Step 2 — Roll Back Helm Release¶
helm history soccer-api -n soccer
helm rollback soccer-api <revision> -n soccer
kubectl rollout status deployment/soccer-api -n soccer
Step 3 — Verify Recovery¶
Run a smoke test:
curl -X POST http://api.time2bet.ru/v1/predict \
-H "Content-Type: application/json" \
-d '{"home_team":"Arsenal","away_team":"Chelsea","match_date":"2025-05-10","league":"premier_league"}'
Step 4 — Post-Incident¶
- Document the failure in Incidents
- Open a GitLab issue with root cause and fix
- Update this runbook if steps were missing