Data Layer Failure Modes¶

This page documents failure scenarios scoped to the data layer: scraping, ingestion, export, validation, and storage. System-wide failure modes (serving, model registry, infrastructure) are in Architecture: Failure Modes.

Failure mode table¶

Failure	Detection	Impact	Recovery	Status
WhoScored source unavailable	Celery task fails; Airflow DAG marked failed	No new data ingested; existing versioned datasets unaffected; model continues to serve stale predictions	Wait for source recovery; trigger Airflow backfill once available	✅ Retry logic on Celery task
Scraper broken (layout change)	Celery task fails with parse error; data gap in PostgreSQL	New data not ingested; freshness breach accumulates	Update scraper selectors; trigger manual backfill for the affected range	✅ GE `validate_raw` detects downstream schema impact
Selenoid host unreachable	Celery task fails on browser session init	Scraping completely blocked	Verify and restart Selenoid on operator-managed host; no K8s automation available	🚧 Manual only — Selenoid is outside K8s
Raw export failed (PostgreSQL → MinIO)	DVC stage `load_data_from_sources` exits non-zero; no new parquet in MinIO	Pipeline blocked at export; model not updated from new data	Check PostgreSQL connectivity and MinIO credentials; re-run `dvc repro load_data_from_sources`	✅ DVC stage gate
Schema drift (raw data)	GE `validate_raw` suite fails; `dvc repro` stops	No downstream processing; pipeline blocked at validation gate	Investigate WhoScored source change; update GE suite if change is intentional; re-run pipeline	✅ Implemented
Validation failure (interim or features)	GE `validate_finished`, `validate_future`, or `validate_features` fails	Pipeline blocked at that stage; training cannot proceed	Inspect validation report; fix preprocessing or feature logic; re-run	✅ Implemented
MinIO unavailable	DVC pull/push fails; `load_data_from_sources` cannot write output	Pipeline cannot read or write any versioned dataset	Check MinIO pod status in K8s (`ds` namespace); restart pod; re-run `dvc repro`	🚧 K8s liveness probe; no automated recovery
Freshness breach (stale dataset)	Most recent match in `finished.parquet` older than expected	Model predictions based on outdated match history; prediction quality degrades silently	Trigger Airflow ingestion run; run `dvc repro` to produce updated snapshot	📋 No automated alert — manual inspection only
Erroneous backfill / bad replay	GE validation fails on backfilled data; or metrics regress after retraining	New dataset version is invalid or worse than previous; model may degrade if promoted	Revert to pre-backfill Git commit; `dvc pull` restores prior dataset; re-promote previous MLflow model	✅ Rollback via Git + DVC
Data leakage via temporal join	Temporal split unit tests in `tests/unit/` catch it at test time	Inflated training metrics; model fails in production	Fix join condition in preprocessing or feature code; re-run full pipeline	✅ Tested with `hypothesis` and unit tests

Failure severity in the data layer¶

Severity	Criteria
P3 — Data pipeline blocked	Scraping fails, export fails, or validation gate blocks; serving uses existing model unaffected
P4 — Freshness degradation	No new data; existing model serves; quality degrades over time
P4 — Offline pipeline blocked	DVC stage failure during retraining; serving unaffected

Data layer failures do not directly cause serving outages. The serving layer uses the last successfully registered model artifact and the last successfully exported batch inference features. The impact is prediction staleness, not downtime.

What is NOT a data layer failure¶

The following are out of scope for this page:

FastAPI or Celery failures affecting serving — see Architecture: Failure Modes
MLflow unavailability affecting model loading — see Architecture: Failure Modes
Redis unavailability — not a data layer component

Known limitations¶

No automated freshness alerting. Stale data is detected manually.
No cached HTML fallback. If a scrape fails, there is no archive to replay from.
Selenoid is operator-managed outside K8s; no K8s health probe covers it.

Architecture: Failure Modes — full system failure table
Runbook: Troubleshooting — step-by-step recovery
Runbook: Backfills — reprocessing after scraper/schema fixes
Data Contracts — validation gate behavior
Backfills & Freshness — freshness policy and recovery process

Data Layer Failure Modes¶

Failure mode table¶

Failure severity in the data layer¶

What is NOT a data layer failure¶

Known limitations¶

Related¶