Skip to content

Deployment Recovery¶

Steps to recover the serving layer after a failed deployment or crash.

Work in progress

This page is a placeholder. Will be refined after first production incident.

Symptoms of a Failed Deployment¶

kubectl rollout status deployment/soccer-api reports deadline exceeded
/ready returns 503 or is unreachable
Prometheus alert: SoccerAPIDown or HighErrorRate

Step 1 — Check Pod Status¶

kubectl get pods -n soccer -l app=soccer-api
kubectl describe pod <pod-name> -n soccer
kubectl logs <pod-name> -n soccer --previous

Common causes:

CrashLoopBackOff → application error (check logs)
ImagePullBackOff → wrong image tag or registry credentials
OOMKilled → memory limit too low

Step 2 — Roll Back Helm Release¶

helm history soccer-api -n soccer
helm rollback soccer-api <revision> -n soccer
kubectl rollout status deployment/soccer-api -n soccer

Step 3 — Verify Recovery¶

curl http://api.time2bet.ru/health   # expect 200
curl http://api.time2bet.ru/ready    # expect 200

Run a smoke test:

curl -X POST http://api.time2bet.ru/v1/predict \
  -H "Content-Type: application/json" \
  -d '{"home_team":"Arsenal","away_team":"Chelsea","match_date":"2025-05-10","league":"premier_league"}'

Step 4 — Post-Incident¶

Document the failure in Incidents
Open a GitLab issue with root cause and fix
Update this runbook if steps were missing