Skip to content

Failure Modes¶

This page documents known failure modes for each major system component, how they are detected, what impact they have, how to recover, and what preventive controls exist.

Use this page as the starting point for incident response before consulting individual runbooks.

Failure Mode Table¶

Failure	Detection	Impact	Recovery	Preventive control	Status
WhoScored source unavailable	Celery task fails; Airflow DAG marked failed	No new data ingested; existing predictions unaffected	Retry via Airflow backfill once source recovers; check for layout changes	Retry policy on Celery task; Airflow alerting (📋 alerting configured)	✅ Retry logic
Scraper broken (layout change)	Celery task fails with parse error; data gap in PostgreSQL	New data not ingested; model serves stale predictions	Update scraper selectors; trigger manual backfill	Great Expectations detects schema shifts in raw data; Airflow failure visible in UI	✅ GE validation
Selenoid host unreachable	Celery task fails on browser session init; logged with error	Scraping completely blocked until resolved	Verify Selenoid host is running; restart Selenoid service manually	Selenoid is operator-managed; health check script in runbooks	🚧 Manual only
Bad raw data schema	Great Expectations `validate_raw` stage fails	DVC pipeline blocked at validation gate; no downstream processing	Investigate data source change; update GE suite if intentional; re-run pipeline	GE suite enforced as a blocking DVC stage gate	✅ Implemented
Raw export failed (PostgreSQL → MinIO)	DVC stage `load_data_from_sources` fails; no new parquet in MinIO	Pipeline cannot proceed; model not retrained from new data	Check DB connectivity and MinIO credentials; re-run `dvc repro load_data_from_sources`	MinIO health probe in K8s; DVC stage output check	✅ DVC stage
DVC stage failure (any stage)	`dvc repro` exits non-zero; stage shown as failed	Pipeline halted at failed stage; no artifact update	Inspect stage logs; fix root cause; re-run `dvc repro`	DVC stage-level output locking; GE gates block invalid data	✅ DVC
Model regression (metric drop)	MLflow metric comparison in `classification_models` stage	Degraded prediction quality; may be promoted if gate is bypassed	Roll back MLflow registry alias to previous `champion`; re-examine training data	Manual promotion review (gate is currently manual)	🚧 Manual gate
MLflow server unavailable	FastAPI fails to resolve `model_uri`; worker startup fails	New worker processes cannot load model; in-flight workers unaffected until restart	Restart MLflow pod; workers reload model on next request	K8s liveness probe on MLflow pod; model already loaded in running workers	🚧 No auto-recovery
FastAPI pod crash	K8s readiness/liveness probe fails; Ingress stops routing	API endpoints unreachable	K8s restarts pod automatically via restart policy	K8s deployment restart policy; `/healthcheck/` liveness probe	✅ K8s probes
RabbitMQ unavailable	FastAPI cannot enqueue task; Celery workers lose connection	All inference (sync + async) fails with 500/503; batch ops blocked	Restart RabbitMQ pod; workers reconnect automatically on restart	K8s liveness probe; Celery auto-reconnect on broker recovery	🚧 Single broker
Redis unavailable	Cache GET/SET fails; logged as error in worker	Cache-miss path for all requests; no data loss; inference still works (slower)	Restart Redis pod; cache warms automatically on subsequent requests	K8s liveness probe; FastAPI handles Redis exceptions gracefully	✅ Degraded mode
Stale cache (wrong model version served)	No automated detection today	Predictions from the previous `champion` model are served for the TTL duration after a new model is promoted. Impact is bounded: stale entries expire automatically; predictions are probabilistic and advisory; no data corruption or system failure results. The window of exposure equals the configured TTL.	Flush Redis cache manually (`redis-cli FLUSHDB`) immediately after model promotion if tight consistency is required; trigger re-inference for active matches. Stale entries otherwise expire on their own.	📋 Planned: invalidate or namespace cache keys on model promotion event. Today this is an accepted eventual-consistency window, not an incident.	📋 Planned
Secrets injection failure (SOPS/K8s)	Pod fails to start; env vars missing; logged in pod events	Service(s) that depend on the missing secret are fully unavailable	Re-run SOPS decryption in CI; re-apply K8s secret manifest; restart affected pods	CI pipeline validates secret decryption before deploy; K8s secret version tracking	✅ CI validation
Bad deployment config (Helm values error)	Helm install/upgrade fails in CI; pods in CrashLoopBackOff	New deployment does not roll out; previous version keeps running	Roll back Helm release (`helm rollback`); fix values; redeploy	CI runs `helm lint` and `helm template` before apply; canary deployment not yet configured	✅ Helm lint in CI

Failure Severity Classification¶

Severity	Criteria	Examples
P1 — Full outage	API completely unreachable OR all inference fails	RabbitMQ down, FastAPI pods all crashed
P2 — Degraded serving	API responds but inference is slow or partially failing	Redis down (cache miss), MLflow unavailable (new workers can't start)
P3 — Data pipeline blocked	New data not ingested; existing model continues to serve	Selenoid unreachable, scraper broken, raw export failed
P4 — Offline pipeline only	Training cannot produce new model; serving unaffected	DVC stage failure, model regression
P5 — Visibility only	Metrics or dashboards unavailable; no functional impact	Prometheus scrape failure, Grafana pod down

Known Systemic Risks¶

Risk	Root cause	Mitigation
Single-node K8s cluster	VPS failure = full outage	Documented recovery runbook; Helm charts portable to managed K8s
External Selenoid host	Outside K8s; no K8s health probes	Manual monitoring; scraping failure surfaced via Airflow
No automated model promotion gate	Regression may go undetected if not manually reviewed	Mandatory manual review before alias update; metrics logged in MLflow
No alerting rules (Prometheus → alertmanager)	Failures not proactively detected	📋 Planned: alertmanager + alerting rules

Runtime View — what the system does under normal conditions
Deployment View — physical topology and single-node limits
Security — failure modes specific to secrets and access
Runbooks — step-by-step recovery procedures