| WhoScored source unavailable |
Celery task fails; Airflow DAG marked failed |
No new data ingested; existing predictions unaffected |
Retry via Airflow backfill once source recovers; check for layout changes |
Retry policy on Celery task; Airflow alerting (📋 alerting configured) |
✅ Retry logic |
| Scraper broken (layout change) |
Celery task fails with parse error; data gap in PostgreSQL |
New data not ingested; model serves stale predictions |
Update scraper selectors; trigger manual backfill |
Great Expectations detects schema shifts in raw data; Airflow failure visible in UI |
✅ GE validation |
| Selenoid host unreachable |
Celery task fails on browser session init; logged with error |
Scraping completely blocked until resolved |
Verify Selenoid host is running; restart Selenoid service manually |
Selenoid is operator-managed; health check script in runbooks |
🚧 Manual only |
| Bad raw data schema |
Great Expectations validate_raw stage fails |
DVC pipeline blocked at validation gate; no downstream processing |
Investigate data source change; update GE suite if intentional; re-run pipeline |
GE suite enforced as a blocking DVC stage gate |
✅ Implemented |
| Raw export failed (PostgreSQL → MinIO) |
DVC stage load_data_from_sources fails; no new parquet in MinIO |
Pipeline cannot proceed; model not retrained from new data |
Check DB connectivity and MinIO credentials; re-run dvc repro load_data_from_sources |
MinIO health probe in K8s; DVC stage output check |
✅ DVC stage |
| DVC stage failure (any stage) |
dvc repro exits non-zero; stage shown as failed |
Pipeline halted at failed stage; no artifact update |
Inspect stage logs; fix root cause; re-run dvc repro |
DVC stage-level output locking; GE gates block invalid data |
✅ DVC |
| Model regression (metric drop) |
MLflow metric comparison in classification_models stage |
Degraded prediction quality; may be promoted if gate is bypassed |
Roll back MLflow registry alias to previous champion; re-examine training data |
Manual promotion review (gate is currently manual) |
🚧 Manual gate |
| MLflow server unavailable |
FastAPI fails to resolve model_uri; worker startup fails |
New worker processes cannot load model; in-flight workers unaffected until restart |
Restart MLflow pod; workers reload model on next request |
K8s liveness probe on MLflow pod; model already loaded in running workers |
🚧 No auto-recovery |
| FastAPI pod crash |
K8s readiness/liveness probe fails; Ingress stops routing |
API endpoints unreachable |
K8s restarts pod automatically via restart policy |
K8s deployment restart policy; /healthcheck/ liveness probe |
✅ K8s probes |
| RabbitMQ unavailable |
FastAPI cannot enqueue task; Celery workers lose connection |
All inference (sync + async) fails with 500/503; batch ops blocked |
Restart RabbitMQ pod; workers reconnect automatically on restart |
K8s liveness probe; Celery auto-reconnect on broker recovery |
🚧 Single broker |
| Redis unavailable |
Cache GET/SET fails; logged as error in worker |
Cache-miss path for all requests; no data loss; inference still works (slower) |
Restart Redis pod; cache warms automatically on subsequent requests |
K8s liveness probe; FastAPI handles Redis exceptions gracefully |
✅ Degraded mode |
| Stale cache (wrong model version served) |
No automated detection today |
Predictions from the previous champion model are served for the TTL duration after a new model is promoted. Impact is bounded: stale entries expire automatically; predictions are probabilistic and advisory; no data corruption or system failure results. The window of exposure equals the configured TTL. |
Flush Redis cache manually (redis-cli FLUSHDB) immediately after model promotion if tight consistency is required; trigger re-inference for active matches. Stale entries otherwise expire on their own. |
📋 Planned: invalidate or namespace cache keys on model promotion event. Today this is an accepted eventual-consistency window, not an incident. |
📋 Planned |
| Secrets injection failure (SOPS/K8s) |
Pod fails to start; env vars missing; logged in pod events |
Service(s) that depend on the missing secret are fully unavailable |
Re-run SOPS decryption in CI; re-apply K8s secret manifest; restart affected pods |
CI pipeline validates secret decryption before deploy; K8s secret version tracking |
✅ CI validation |
| Bad deployment config (Helm values error) |
Helm install/upgrade fails in CI; pods in CrashLoopBackOff |
New deployment does not roll out; previous version keeps running |
Roll back Helm release (helm rollback); fix values; redeploy |
CI runs helm lint and helm template before apply; canary deployment not yet configured |
✅ Helm lint in CI |