Data Layer Failure Modes¶
This page documents failure scenarios scoped to the data layer: scraping, ingestion, export, validation, and storage. System-wide failure modes (serving, model registry, infrastructure) are in Architecture: Failure Modes.
Failure mode table¶
| Failure | Detection | Impact | Recovery | Status |
|---|---|---|---|---|
| WhoScored source unavailable | Celery task fails; Airflow DAG marked failed | No new data ingested; existing versioned datasets unaffected; model continues to serve stale predictions | Wait for source recovery; trigger Airflow backfill once available | ✅ Retry logic on Celery task |
| Scraper broken (layout change) | Celery task fails with parse error; data gap in PostgreSQL | New data not ingested; freshness breach accumulates | Update scraper selectors; trigger manual backfill for the affected range | ✅ GE validate_raw detects downstream schema impact |
| Selenoid host unreachable | Celery task fails on browser session init | Scraping completely blocked | Verify and restart Selenoid on operator-managed host; no K8s automation available | 🚧 Manual only — Selenoid is outside K8s |
| Raw export failed (PostgreSQL → MinIO) | DVC stage load_data_from_sources exits non-zero; no new parquet in MinIO |
Pipeline blocked at export; model not updated from new data | Check PostgreSQL connectivity and MinIO credentials; re-run dvc repro load_data_from_sources |
✅ DVC stage gate |
| Schema drift (raw data) | GE validate_raw suite fails; dvc repro stops |
No downstream processing; pipeline blocked at validation gate | Investigate WhoScored source change; update GE suite if change is intentional; re-run pipeline | ✅ Implemented |
| Validation failure (interim or features) | GE validate_finished, validate_future, or validate_features fails |
Pipeline blocked at that stage; training cannot proceed | Inspect validation report; fix preprocessing or feature logic; re-run | ✅ Implemented |
| MinIO unavailable | DVC pull/push fails; load_data_from_sources cannot write output |
Pipeline cannot read or write any versioned dataset | Check MinIO pod status in K8s (ds namespace); restart pod; re-run dvc repro |
🚧 K8s liveness probe; no automated recovery |
| Freshness breach (stale dataset) | Most recent match in finished.parquet older than expected |
Model predictions based on outdated match history; prediction quality degrades silently | Trigger Airflow ingestion run; run dvc repro to produce updated snapshot |
📋 No automated alert — manual inspection only |
| Erroneous backfill / bad replay | GE validation fails on backfilled data; or metrics regress after retraining | New dataset version is invalid or worse than previous; model may degrade if promoted | Revert to pre-backfill Git commit; dvc pull restores prior dataset; re-promote previous MLflow model |
✅ Rollback via Git + DVC |
| Data leakage via temporal join | Temporal split unit tests in tests/unit/ catch it at test time |
Inflated training metrics; model fails in production | Fix join condition in preprocessing or feature code; re-run full pipeline | ✅ Tested with hypothesis and unit tests |
Failure severity in the data layer¶
| Severity | Criteria |
|---|---|
| P3 — Data pipeline blocked | Scraping fails, export fails, or validation gate blocks; serving uses existing model unaffected |
| P4 — Freshness degradation | No new data; existing model serves; quality degrades over time |
| P4 — Offline pipeline blocked | DVC stage failure during retraining; serving unaffected |
Data layer failures do not directly cause serving outages. The serving layer uses the last successfully registered model artifact and the last successfully exported batch inference features. The impact is prediction staleness, not downtime.
What is NOT a data layer failure¶
The following are out of scope for this page:
- FastAPI or Celery failures affecting serving — see Architecture: Failure Modes
- MLflow unavailability affecting model loading — see Architecture: Failure Modes
- Redis unavailability — not a data layer component
Known limitations¶
- No automated freshness alerting. Stale data is detected manually.
- No cached HTML fallback. If a scrape fails, there is no archive to replay from.
- Selenoid is operator-managed outside K8s; no K8s health probe covers it.
Related¶
- Architecture: Failure Modes — full system failure table
- Runbook: Troubleshooting — step-by-step recovery
- Runbook: Backfills — reprocessing after scraper/schema fixes
- Data Contracts — validation gate behavior
- Backfills & Freshness — freshness policy and recovery process