Data Contracts & Quality Gates¶
Purpose¶
Given an unstable upstream source, data validity cannot be assumed. Data contracts formalize what "valid data" means at each stage boundary. A failed contract stops the pipeline. No downstream stage runs on data that has not passed its gate.
Implementation¶
Data contracts are implemented with Great Expectations and executed as blocking DVC stages.
Suite definitions live in src/data_quality/ — one module per dataset.
| DVC stage | Suite module | Dataset validated |
|---|---|---|
validate_raw |
src/data_quality/raw.py |
data/raw/match_raw.parquet |
validate_finished |
src/data_quality/finished.py |
data/interim/finished.parquet |
validate_future |
src/data_quality/future.py |
data/interim/future.parquet |
validate_features |
src/data_quality/features.py |
data/features/features.parquet |
Status: ✅ Implemented — all four suites are active DVC stage gates.
What each suite checks¶
validate_raw¶
Validates match_raw.parquet before any preprocessing:
- Required columns present:
id,homeTeamId,awayTeamId,startTimeUtc,regionId,tournamentId,seasonId,stageId,sex,status - Not-null check on:
id,homeTeamId,awayTeamId,startTimeUtc,status startTimeUtcin plausible range: 1998-01-01 to 2026-12-31statusvalues from the known API code set
validate_finished / validate_future¶
Validates the output of the preprocessing stage:
- Schema integrity after column stripping and type casting
outcome_1x2∈ {0, 1, 2} for finished matches- No future-match rows in the finished set (temporal constraint)
- Score columns within plausible clipped range
validate_features¶
Validates the engineered feature matrix:
- Rate columns (
win_mean,draw_mean,loss_mean) bounded within [0.0, 1.0] withmostly=0.99to tolerate cold-start windows - Coverage columns in [0.0, 1.0] (never > 1.0)
- Goals rolling averages non-negative
- H2H columns allow high null rates (many team pairs have no head-to-head history)
Blocking vs non-blocking checks¶
All checks currently implemented are blocking: any expectation failure causes the DVC stage to exit non-zero, which stops the pipeline.
Distribution drift checks (statistical) are planned but not yet implemented. Evidently integration for drift monitoring is a planned capability. (📋 Planned — see Status)
Contract as code¶
GE suite modules in src/data_quality/ are:
- versioned in Git alongside pipeline code,
- deterministic pure functions (no IO, no side effects),
- invoked by DVC stages defined in
dvc.yaml.
Any change to a contract must be code-reviewed like any other pipeline change. The GE suite is the authoritative definition of "valid data" for its stage.
Schema drift consequence¶
When WhoScored changes its output structure:
- Scraping produces records that fail the
validate_rawsuite. dvc reprostops at thevalidate_rawstage.- No downstream processing, feature engineering, or training runs.
- The operator reviews the schema change and either updates the GE suite (if the change is intentional) or fixes the scraper.
This is the designed failure mode. The pipeline is intentionally conservative: uncertain data does not reach the model.
Contract ownership and change review¶
- Contracts live in
src/data_quality/and are reviewed as code changes. - A contract change that widens or removes a check must be justified.
- A contract change that tightens or adds a check is always safe to merge.
- Breaking schema changes (column renames, type changes) require updating both
the contract suite and the downstream preprocessing code before
dvc repro.
Related¶
- Schemas & Lineage — what each dataset contains
- Raw Export — where
validate_rawis applied - Architecture: Data & ML Flow — gate positions in pipeline
- Failure Modes — what happens when a gate fails