Skip to content

Failure Modes

This page documents known failure modes for each major system component, how they are detected, what impact they have, how to recover, and what preventive controls exist.

Use this page as the starting point for incident response before consulting individual runbooks.


Failure Mode Table

Failure Detection Impact Recovery Preventive control Status
WhoScored source unavailable Celery task fails; Airflow DAG marked failed No new data ingested; existing predictions unaffected Retry via Airflow backfill once source recovers; check for layout changes Retry policy on Celery task; Airflow alerting (📋 alerting configured) ✅ Retry logic
Scraper broken (layout change) Celery task fails with parse error; data gap in PostgreSQL New data not ingested; model serves stale predictions Update scraper selectors; trigger manual backfill Great Expectations detects schema shifts in raw data; Airflow failure visible in UI ✅ GE validation
Selenoid host unreachable Celery task fails on browser session init; logged with error Scraping completely blocked until resolved Verify Selenoid host is running; restart Selenoid service manually Selenoid is operator-managed; health check script in runbooks 🚧 Manual only
Bad raw data schema Great Expectations validate_raw stage fails DVC pipeline blocked at validation gate; no downstream processing Investigate data source change; update GE suite if intentional; re-run pipeline GE suite enforced as a blocking DVC stage gate ✅ Implemented
Raw export failed (PostgreSQL → MinIO) DVC stage load_data_from_sources fails; no new parquet in MinIO Pipeline cannot proceed; model not retrained from new data Check DB connectivity and MinIO credentials; re-run dvc repro load_data_from_sources MinIO health probe in K8s; DVC stage output check ✅ DVC stage
DVC stage failure (any stage) dvc repro exits non-zero; stage shown as failed Pipeline halted at failed stage; no artifact update Inspect stage logs; fix root cause; re-run dvc repro DVC stage-level output locking; GE gates block invalid data ✅ DVC
Model regression (metric drop) MLflow metric comparison in classification_models stage Degraded prediction quality; may be promoted if gate is bypassed Roll back MLflow registry alias to previous champion; re-examine training data Manual promotion review (gate is currently manual) 🚧 Manual gate
MLflow server unavailable FastAPI fails to resolve model_uri; worker startup fails New worker processes cannot load model; in-flight workers unaffected until restart Restart MLflow pod; workers reload model on next request K8s liveness probe on MLflow pod; model already loaded in running workers 🚧 No auto-recovery
FastAPI pod crash K8s readiness/liveness probe fails; Ingress stops routing API endpoints unreachable K8s restarts pod automatically via restart policy K8s deployment restart policy; /healthcheck/ liveness probe ✅ K8s probes
RabbitMQ unavailable FastAPI cannot enqueue task; Celery workers lose connection All inference (sync + async) fails with 500/503; batch ops blocked Restart RabbitMQ pod; workers reconnect automatically on restart K8s liveness probe; Celery auto-reconnect on broker recovery 🚧 Single broker
Redis unavailable Cache GET/SET fails; logged as error in worker Cache-miss path for all requests; no data loss; inference still works (slower) Restart Redis pod; cache warms automatically on subsequent requests K8s liveness probe; FastAPI handles Redis exceptions gracefully ✅ Degraded mode
Stale cache (wrong model version served) No automated detection today Predictions from the previous champion model are served for the TTL duration after a new model is promoted. Impact is bounded: stale entries expire automatically; predictions are probabilistic and advisory; no data corruption or system failure results. The window of exposure equals the configured TTL. Flush Redis cache manually (redis-cli FLUSHDB) immediately after model promotion if tight consistency is required; trigger re-inference for active matches. Stale entries otherwise expire on their own. 📋 Planned: invalidate or namespace cache keys on model promotion event. Today this is an accepted eventual-consistency window, not an incident. 📋 Planned
Secrets injection failure (SOPS/K8s) Pod fails to start; env vars missing; logged in pod events Service(s) that depend on the missing secret are fully unavailable Re-run SOPS decryption in CI; re-apply K8s secret manifest; restart affected pods CI pipeline validates secret decryption before deploy; K8s secret version tracking ✅ CI validation
Bad deployment config (Helm values error) Helm install/upgrade fails in CI; pods in CrashLoopBackOff New deployment does not roll out; previous version keeps running Roll back Helm release (helm rollback); fix values; redeploy CI runs helm lint and helm template before apply; canary deployment not yet configured ✅ Helm lint in CI

Failure Severity Classification

Severity Criteria Examples
P1 — Full outage API completely unreachable OR all inference fails RabbitMQ down, FastAPI pods all crashed
P2 — Degraded serving API responds but inference is slow or partially failing Redis down (cache miss), MLflow unavailable (new workers can't start)
P3 — Data pipeline blocked New data not ingested; existing model continues to serve Selenoid unreachable, scraper broken, raw export failed
P4 — Offline pipeline only Training cannot produce new model; serving unaffected DVC stage failure, model regression
P5 — Visibility only Metrics or dashboards unavailable; no functional impact Prometheus scrape failure, Grafana pod down

Known Systemic Risks

Risk Root cause Mitigation
Single-node K8s cluster VPS failure = full outage Documented recovery runbook; Helm charts portable to managed K8s
External Selenoid host Outside K8s; no K8s health probes Manual monitoring; scraping failure surfaced via Airflow
No automated model promotion gate Regression may go undetected if not manually reviewed Mandatory manual review before alias update; metrics logged in MLflow
No alerting rules (Prometheus → alertmanager) Failures not proactively detected 📋 Planned: alertmanager + alerting rules

  • Runtime View — what the system does under normal conditions
  • Deployment View — physical topology and single-node limits
  • Security — failure modes specific to secrets and access
  • Runbooks — step-by-step recovery procedures