Canonical Datasets & Lineage¶

This page documents the datasets that exist in this system, their role in the pipeline, and how schema evolution is managed. It is not a generic data dictionary — it describes what is actually materialized and versioned in data/.

Dataset inventory¶

`data/raw/match.parquet` and `data/raw/match_raw.parquet`¶

Stage: Raw export (DVC: load_data_from_sources) Source: PostgreSQL canonical tables Update cadence: On dvc repro after new ingestion Consumers: validate_raw, then preprocessing

These are the raw match records as scraped from WhoScored.com, normalized and written to PostgreSQL, then exported as point-in-time snapshots. match.parquet contains the processed view; match_raw.parquet contains the unprocessed raw form.

Key columns retained after preprocessing:

Column	Type	Description
`id`	int32	Match identifier (WhoScored ID)
`startTimeUtc`	datetime[UTC]	Match start time, UTC
`tournamentId`	int16	Tournament identifier
`stageId`	int16	Stage within tournament
`regionId`	int16	Geographic region
`seasonId`	int16	Season identifier
`homeTeamId`	int32	Home team identifier
`awayTeamId`	int32	Away team identifier
`homeScore`	int8	Goals scored by home team
`awayScore`	int8	Goals scored by away team
`status`	int	Match status (6=finished, 1=scheduled)
`sex`	int8	Competition sex category

Columns stripped at preprocessing: metadata names (tournamentName, regionName, etc.), operational fields (elapsed, period, incidents), and WhoScored-internal fields.

`data/interim/finished.parquet`¶

Stage: Preprocessing (DVC: preprocessing) Upstream: data/raw/match.parquet Consumers: validate_finished, feature_engineering

Completed matches (status=6). Contains all raw columns plus derived target variables:

Column	Type	Description
`outcome_1x2`	int8	0=home win, 1=draw, 2=away win — primary classification target
`sumScore`	int8	Total goals (home + away)
`diffScore`	int8	Goal difference (home − away)

Score outliers are clipped at the 99.99th percentile per params.yaml → preprocessing.score_outlier_pct. Records are sorted ascending by startTimeUtc.

`data/interim/future.parquet`¶

Stage: Preprocessing (DVC: preprocessing) Upstream: data/raw/match.parquet Consumers: validate_future, feature_engineering, batch_inference

Scheduled matches (status=1) with startTimeUtc after the last finished match. Contains no score or target columns. Used as the prediction universe for batch inference.

`data/features/features.parquet`¶

Stage: Feature engineering (DVC: feature_engineering) Upstream: data/interim/finished.parquet, data/interim/future.parquet Consumers: validate_features, train, tune, calibrate

Match-level feature matrix. Each row is a match; each column is a team-level statistic computed over rolling windows. Features include:

rolling mean of goals_for, goals_against, wins, draws, losses per team across configurable window sizes (params.yaml → features.window_sizes),
ELO ratings per team scoped to tournament,
home/away perspective columns with differential (diff_ prefix).

The feature code in src/features/ is the same code used at inference time. There is no separate inference feature implementation.

`data/features/features_meta.parquet`¶

Stage: Feature engineering (DVC: feature_engineering) Consumers: batch_inference, serving layer

Metadata for feature rows: match identifiers, team IDs, dates. Used to join predictions back to match context at serving time.

Data lineage¶

Full traceability path:

WhoScored.com HTML
    ↓  (Airflow + Selenoid + celery-worker-api)
PostgreSQL canonical tables
    ↓  (DVC: load_data_from_sources)
data/raw/match.parquet          ← DVC content hash, Git .dvc pointer
    ↓  (DVC: validate_raw → preprocessing)
data/interim/finished.parquet
data/interim/future.parquet     ← DVC content hash, Git .dvc pointer
    ↓  (DVC: feature_engineering)
data/features/features.parquet  ← DVC content hash, Git .dvc pointer
    ↓  (DVC: train / tune / calibrate)
MLflow experiment run + model artifact

A given Git commit, combined with dvc checkout, restores the exact dataset state that produced any registered MLflow experiment.

Metadata lookups¶

data/metadata/ contains JSON id→name mappings exported at preprocessing:

tournamentId.json, regionId.json, seasonId.json, stageId.json
homeTeamId.json, awayTeamId.json

These are used by the serving layer to return human-readable names in prediction responses. They are regenerated on each preprocessing run and are not DVC-tracked independently.

Schema evolution rules¶

Change type	Required action
Adding a nullable column to raw export	Update GE `validate_raw` suite; no breaking change
Removing or renaming a column	Breaking change — bump schema version; update downstream stages and GE suites
Adding a derived column to interim/features	Update GE suites; check feature code compatibility at inference
Changing a column type	Treat as breaking — full pipeline re-run required

All schema changes to the DVC-tracked parquet files are visible through the Git diff on .dvc pointer files. Breaking changes must update the relevant GE suite in src/data_quality/ before dvc repro is re-run.

Data Contracts — GE suites applied at each stage
Dataset Versioning — how DVC tracks these files
Raw Export — where the parquet snapshots originate
Architecture: Data & ML Flow — stage-by-stage breakdown

Canonical Datasets & Lineage¶

Dataset inventory¶

data/raw/match.parquet and data/raw/match_raw.parquet¶

data/interim/finished.parquet¶

data/interim/future.parquet¶

data/features/features.parquet¶

data/features/features_meta.parquet¶