Skip to content

Data Layer Failure Modes

This page documents failure scenarios scoped to the data layer: scraping, ingestion, export, validation, and storage. System-wide failure modes (serving, model registry, infrastructure) are in Architecture: Failure Modes.


Failure mode table

Failure Detection Impact Recovery Status
WhoScored source unavailable Celery task fails; Airflow DAG marked failed No new data ingested; existing versioned datasets unaffected; model continues to serve stale predictions Wait for source recovery; trigger Airflow backfill once available ✅ Retry logic on Celery task
Scraper broken (layout change) Celery task fails with parse error; data gap in PostgreSQL New data not ingested; freshness breach accumulates Update scraper selectors; trigger manual backfill for the affected range ✅ GE validate_raw detects downstream schema impact
Selenoid host unreachable Celery task fails on browser session init Scraping completely blocked Verify and restart Selenoid on operator-managed host; no K8s automation available 🚧 Manual only — Selenoid is outside K8s
Raw export failed (PostgreSQL → MinIO) DVC stage load_data_from_sources exits non-zero; no new parquet in MinIO Pipeline blocked at export; model not updated from new data Check PostgreSQL connectivity and MinIO credentials; re-run dvc repro load_data_from_sources ✅ DVC stage gate
Schema drift (raw data) GE validate_raw suite fails; dvc repro stops No downstream processing; pipeline blocked at validation gate Investigate WhoScored source change; update GE suite if change is intentional; re-run pipeline ✅ Implemented
Validation failure (interim or features) GE validate_finished, validate_future, or validate_features fails Pipeline blocked at that stage; training cannot proceed Inspect validation report; fix preprocessing or feature logic; re-run ✅ Implemented
MinIO unavailable DVC pull/push fails; load_data_from_sources cannot write output Pipeline cannot read or write any versioned dataset Check MinIO pod status in K8s (ds namespace); restart pod; re-run dvc repro 🚧 K8s liveness probe; no automated recovery
Freshness breach (stale dataset) Most recent match in finished.parquet older than expected Model predictions based on outdated match history; prediction quality degrades silently Trigger Airflow ingestion run; run dvc repro to produce updated snapshot 📋 No automated alert — manual inspection only
Erroneous backfill / bad replay GE validation fails on backfilled data; or metrics regress after retraining New dataset version is invalid or worse than previous; model may degrade if promoted Revert to pre-backfill Git commit; dvc pull restores prior dataset; re-promote previous MLflow model ✅ Rollback via Git + DVC
Data leakage via temporal join Temporal split unit tests in tests/unit/ catch it at test time Inflated training metrics; model fails in production Fix join condition in preprocessing or feature code; re-run full pipeline ✅ Tested with hypothesis and unit tests

Failure severity in the data layer

Severity Criteria
P3 — Data pipeline blocked Scraping fails, export fails, or validation gate blocks; serving uses existing model unaffected
P4 — Freshness degradation No new data; existing model serves; quality degrades over time
P4 — Offline pipeline blocked DVC stage failure during retraining; serving unaffected

Data layer failures do not directly cause serving outages. The serving layer uses the last successfully registered model artifact and the last successfully exported batch inference features. The impact is prediction staleness, not downtime.


What is NOT a data layer failure

The following are out of scope for this page:


Known limitations

  • No automated freshness alerting. Stale data is detected manually.
  • No cached HTML fallback. If a scrape fails, there is no archive to replay from.
  • Selenoid is operator-managed outside K8s; no K8s health probe covers it.