Skip to content

Deployment Recovery

Steps to recover the serving layer after a failed deployment or crash.

Work in progress

This page is a placeholder. Will be refined after first production incident.

Symptoms of a Failed Deployment

  • kubectl rollout status deployment/soccer-api reports deadline exceeded
  • /ready returns 503 or is unreachable
  • Prometheus alert: SoccerAPIDown or HighErrorRate

Step 1 — Check Pod Status

kubectl get pods -n soccer -l app=soccer-api
kubectl describe pod <pod-name> -n soccer
kubectl logs <pod-name> -n soccer --previous

Common causes:

  • CrashLoopBackOff → application error (check logs)
  • ImagePullBackOff → wrong image tag or registry credentials
  • OOMKilled → memory limit too low

Step 2 — Roll Back Helm Release

helm history soccer-api -n soccer
helm rollback soccer-api <revision> -n soccer
kubectl rollout status deployment/soccer-api -n soccer

Step 3 — Verify Recovery

curl http://api.time2bet.ru/health   # expect 200
curl http://api.time2bet.ru/ready    # expect 200

Run a smoke test:

curl -X POST http://api.time2bet.ru/v1/predict \
  -H "Content-Type: application/json" \
  -d '{"home_team":"Arsenal","away_team":"Chelsea","match_date":"2025-05-10","league":"premier_league"}'

Step 4 — Post-Incident

  1. Document the failure in Incidents
  2. Open a GitLab issue with root cause and fix
  3. Update this runbook if steps were missing