Skip to content

Data Contracts & Quality Gates

Purpose

Given an unstable upstream source, data validity cannot be assumed. Data contracts formalize what "valid data" means at each stage boundary. A failed contract stops the pipeline. No downstream stage runs on data that has not passed its gate.


Implementation

Data contracts are implemented with Great Expectations and executed as blocking DVC stages. Suite definitions live in src/data_quality/ — one module per dataset.

DVC stage Suite module Dataset validated
validate_raw src/data_quality/raw.py data/raw/match_raw.parquet
validate_finished src/data_quality/finished.py data/interim/finished.parquet
validate_future src/data_quality/future.py data/interim/future.parquet
validate_features src/data_quality/features.py data/features/features.parquet

Status: ✅ Implemented — all four suites are active DVC stage gates.


What each suite checks

validate_raw

Validates match_raw.parquet before any preprocessing:

  • Required columns present: id, homeTeamId, awayTeamId, startTimeUtc, regionId, tournamentId, seasonId, stageId, sex, status
  • Not-null check on: id, homeTeamId, awayTeamId, startTimeUtc, status
  • startTimeUtc in plausible range: 1998-01-01 to 2026-12-31
  • status values from the known API code set

validate_finished / validate_future

Validates the output of the preprocessing stage:

  • Schema integrity after column stripping and type casting
  • outcome_1x2 ∈ {0, 1, 2} for finished matches
  • No future-match rows in the finished set (temporal constraint)
  • Score columns within plausible clipped range

validate_features

Validates the engineered feature matrix:

  • Rate columns (win_mean, draw_mean, loss_mean) bounded within [0.0, 1.0] with mostly=0.99 to tolerate cold-start windows
  • Coverage columns in [0.0, 1.0] (never > 1.0)
  • Goals rolling averages non-negative
  • H2H columns allow high null rates (many team pairs have no head-to-head history)

Blocking vs non-blocking checks

All checks currently implemented are blocking: any expectation failure causes the DVC stage to exit non-zero, which stops the pipeline.

Distribution drift checks (statistical) are planned but not yet implemented. Evidently integration for drift monitoring is a planned capability. (📋 Planned — see Status)


Contract as code

GE suite modules in src/data_quality/ are:

  • versioned in Git alongside pipeline code,
  • deterministic pure functions (no IO, no side effects),
  • invoked by DVC stages defined in dvc.yaml.

Any change to a contract must be code-reviewed like any other pipeline change. The GE suite is the authoritative definition of "valid data" for its stage.


Schema drift consequence

When WhoScored changes its output structure:

  1. Scraping produces records that fail the validate_raw suite.
  2. dvc repro stops at the validate_raw stage.
  3. No downstream processing, feature engineering, or training runs.
  4. The operator reviews the schema change and either updates the GE suite (if the change is intentional) or fixes the scraper.

This is the designed failure mode. The pipeline is intentionally conservative: uncertain data does not reach the model.


Contract ownership and change review

  • Contracts live in src/data_quality/ and are reviewed as code changes.
  • A contract change that widens or removes a check must be justified.
  • A contract change that tightens or adds a check is always safe to merge.
  • Breaking schema changes (column renames, type changes) require updating both the contract suite and the downstream preprocessing code before dvc repro.