Canonical Datasets & Lineage¶
This page documents the datasets that exist in this system, their role in the pipeline,
and how schema evolution is managed. It is not a generic data dictionary — it describes
what is actually materialized and versioned in data/.
Dataset inventory¶
data/raw/match.parquet and data/raw/match_raw.parquet¶
Stage: Raw export (DVC: load_data_from_sources)
Source: PostgreSQL canonical tables
Update cadence: On dvc repro after new ingestion
Consumers: validate_raw, then preprocessing
These are the raw match records as scraped from WhoScored.com, normalized and written to PostgreSQL,
then exported as point-in-time snapshots. match.parquet contains the processed view;
match_raw.parquet contains the unprocessed raw form.
Key columns retained after preprocessing:
| Column | Type | Description |
|---|---|---|
id |
int32 | Match identifier (WhoScored ID) |
startTimeUtc |
datetime[UTC] | Match start time, UTC |
tournamentId |
int16 | Tournament identifier |
stageId |
int16 | Stage within tournament |
regionId |
int16 | Geographic region |
seasonId |
int16 | Season identifier |
homeTeamId |
int32 | Home team identifier |
awayTeamId |
int32 | Away team identifier |
homeScore |
int8 | Goals scored by home team |
awayScore |
int8 | Goals scored by away team |
status |
int | Match status (6=finished, 1=scheduled) |
sex |
int8 | Competition sex category |
Columns stripped at preprocessing: metadata names (tournamentName, regionName, etc.),
operational fields (elapsed, period, incidents), and WhoScored-internal fields.
data/interim/finished.parquet¶
Stage: Preprocessing (DVC: preprocessing)
Upstream: data/raw/match.parquet
Consumers: validate_finished, feature_engineering
Completed matches (status=6). Contains all raw columns plus derived target variables:
| Column | Type | Description |
|---|---|---|
outcome_1x2 |
int8 | 0=home win, 1=draw, 2=away win — primary classification target |
sumScore |
int8 | Total goals (home + away) |
diffScore |
int8 | Goal difference (home − away) |
Score outliers are clipped at the 99.99th percentile per params.yaml → preprocessing.score_outlier_pct.
Records are sorted ascending by startTimeUtc.
data/interim/future.parquet¶
Stage: Preprocessing (DVC: preprocessing)
Upstream: data/raw/match.parquet
Consumers: validate_future, feature_engineering, batch_inference
Scheduled matches (status=1) with startTimeUtc after the last finished match.
Contains no score or target columns. Used as the prediction universe for batch inference.
data/features/features.parquet¶
Stage: Feature engineering (DVC: feature_engineering)
Upstream: data/interim/finished.parquet, data/interim/future.parquet
Consumers: validate_features, train, tune, calibrate
Match-level feature matrix. Each row is a match; each column is a team-level statistic computed over rolling windows. Features include:
- rolling mean of
goals_for,goals_against,wins,draws,lossesper team across configurable window sizes (params.yaml → features.window_sizes), - ELO ratings per team scoped to tournament,
- home/away perspective columns with differential (
diff_prefix).
The feature code in src/features/ is the same code used at inference time.
There is no separate inference feature implementation.
data/features/features_meta.parquet¶
Stage: Feature engineering (DVC: feature_engineering)
Consumers: batch_inference, serving layer
Metadata for feature rows: match identifiers, team IDs, dates. Used to join predictions back to match context at serving time.
Data lineage¶
Full traceability path:
WhoScored.com HTML
↓ (Airflow + Selenoid + celery-worker-api)
PostgreSQL canonical tables
↓ (DVC: load_data_from_sources)
data/raw/match.parquet ← DVC content hash, Git .dvc pointer
↓ (DVC: validate_raw → preprocessing)
data/interim/finished.parquet
data/interim/future.parquet ← DVC content hash, Git .dvc pointer
↓ (DVC: feature_engineering)
data/features/features.parquet ← DVC content hash, Git .dvc pointer
↓ (DVC: train / tune / calibrate)
MLflow experiment run + model artifact
A given Git commit, combined with dvc checkout, restores the exact dataset state
that produced any registered MLflow experiment.
Metadata lookups¶
data/metadata/ contains JSON id→name mappings exported at preprocessing:
tournamentId.json,regionId.json,seasonId.json,stageId.jsonhomeTeamId.json,awayTeamId.json
These are used by the serving layer to return human-readable names in prediction responses. They are regenerated on each preprocessing run and are not DVC-tracked independently.
Schema evolution rules¶
| Change type | Required action |
|---|---|
| Adding a nullable column to raw export | Update GE validate_raw suite; no breaking change |
| Removing or renaming a column | Breaking change — bump schema version; update downstream stages and GE suites |
| Adding a derived column to interim/features | Update GE suites; check feature code compatibility at inference |
| Changing a column type | Treat as breaking — full pipeline re-run required |
All schema changes to the DVC-tracked parquet files are visible through the Git diff on .dvc
pointer files. Breaking changes must update the relevant GE suite in src/data_quality/
before dvc repro is re-run.
Related¶
- Data Contracts — GE suites applied at each stage
- Dataset Versioning — how DVC tracks these files
- Raw Export — where the parquet snapshots originate
- Architecture: Data & ML Flow — stage-by-stage breakdown