Skip to content

Canonical Datasets & Lineage

This page documents the datasets that exist in this system, their role in the pipeline, and how schema evolution is managed. It is not a generic data dictionary — it describes what is actually materialized and versioned in data/.


Dataset inventory

data/raw/match.parquet and data/raw/match_raw.parquet

Stage: Raw export (DVC: load_data_from_sources) Source: PostgreSQL canonical tables Update cadence: On dvc repro after new ingestion Consumers: validate_raw, then preprocessing

These are the raw match records as scraped from WhoScored.com, normalized and written to PostgreSQL, then exported as point-in-time snapshots. match.parquet contains the processed view; match_raw.parquet contains the unprocessed raw form.

Key columns retained after preprocessing:

Column Type Description
id int32 Match identifier (WhoScored ID)
startTimeUtc datetime[UTC] Match start time, UTC
tournamentId int16 Tournament identifier
stageId int16 Stage within tournament
regionId int16 Geographic region
seasonId int16 Season identifier
homeTeamId int32 Home team identifier
awayTeamId int32 Away team identifier
homeScore int8 Goals scored by home team
awayScore int8 Goals scored by away team
status int Match status (6=finished, 1=scheduled)
sex int8 Competition sex category

Columns stripped at preprocessing: metadata names (tournamentName, regionName, etc.), operational fields (elapsed, period, incidents), and WhoScored-internal fields.


data/interim/finished.parquet

Stage: Preprocessing (DVC: preprocessing) Upstream: data/raw/match.parquet Consumers: validate_finished, feature_engineering

Completed matches (status=6). Contains all raw columns plus derived target variables:

Column Type Description
outcome_1x2 int8 0=home win, 1=draw, 2=away win — primary classification target
sumScore int8 Total goals (home + away)
diffScore int8 Goal difference (home − away)

Score outliers are clipped at the 99.99th percentile per params.yaml → preprocessing.score_outlier_pct. Records are sorted ascending by startTimeUtc.


data/interim/future.parquet

Stage: Preprocessing (DVC: preprocessing) Upstream: data/raw/match.parquet Consumers: validate_future, feature_engineering, batch_inference

Scheduled matches (status=1) with startTimeUtc after the last finished match. Contains no score or target columns. Used as the prediction universe for batch inference.


data/features/features.parquet

Stage: Feature engineering (DVC: feature_engineering) Upstream: data/interim/finished.parquet, data/interim/future.parquet Consumers: validate_features, train, tune, calibrate

Match-level feature matrix. Each row is a match; each column is a team-level statistic computed over rolling windows. Features include:

  • rolling mean of goals_for, goals_against, wins, draws, losses per team across configurable window sizes (params.yaml → features.window_sizes),
  • ELO ratings per team scoped to tournament,
  • home/away perspective columns with differential (diff_ prefix).

The feature code in src/features/ is the same code used at inference time. There is no separate inference feature implementation.


data/features/features_meta.parquet

Stage: Feature engineering (DVC: feature_engineering) Consumers: batch_inference, serving layer

Metadata for feature rows: match identifiers, team IDs, dates. Used to join predictions back to match context at serving time.


Data lineage

Full traceability path:

WhoScored.com HTML
    ↓  (Airflow + Selenoid + celery-worker-api)
PostgreSQL canonical tables
    ↓  (DVC: load_data_from_sources)
data/raw/match.parquet          ← DVC content hash, Git .dvc pointer
    ↓  (DVC: validate_raw → preprocessing)
data/interim/finished.parquet
data/interim/future.parquet     ← DVC content hash, Git .dvc pointer
    ↓  (DVC: feature_engineering)
data/features/features.parquet  ← DVC content hash, Git .dvc pointer
    ↓  (DVC: train / tune / calibrate)
MLflow experiment run + model artifact

A given Git commit, combined with dvc checkout, restores the exact dataset state that produced any registered MLflow experiment.


Metadata lookups

data/metadata/ contains JSON id→name mappings exported at preprocessing:

  • tournamentId.json, regionId.json, seasonId.json, stageId.json
  • homeTeamId.json, awayTeamId.json

These are used by the serving layer to return human-readable names in prediction responses. They are regenerated on each preprocessing run and are not DVC-tracked independently.


Schema evolution rules

Change type Required action
Adding a nullable column to raw export Update GE validate_raw suite; no breaking change
Removing or renaming a column Breaking change — bump schema version; update downstream stages and GE suites
Adding a derived column to interim/features Update GE suites; check feature code compatibility at inference
Changing a column type Treat as breaking — full pipeline re-run required

All schema changes to the DVC-tracked parquet files are visible through the Git diff on .dvc pointer files. Breaking changes must update the relevant GE suite in src/data_quality/ before dvc repro is re-run.