Validation Strategy & Leakage Prevention¶

Purpose¶

Document the temporal validation discipline, leakage prevention rules, and how they are enforced in code and tested. This is one of the most critical design constraints in the system.

Why temporal validation is mandatory¶

Football match data is strictly ordered in time. Using random cross-validation is a systematic error: a random split allows future matches to appear in the training set, producing inflated metrics that do not reflect real performance. This is not theoretical — it is a structural bias that makes models appear better than they are.

Rule: train on the past, evaluate on the future. Nothing else is valid.

Why random CV fails for this task¶

Consider a random 80/20 split on matches from 2018–2024:

Training set contains matches from January 2023.
Test set contains matches from December 2022.
Rolling form features were computed using a future match.
Result: inflated accuracy, model is overfit to temporal artifacts.

Random CV also fails because team form, league competitiveness, and tactical patterns drift across seasons. A model that scores well on a random holdout may fail on next season's data.

Validation approach¶

Temporal holdout split¶

Time ──────────────────────────────────────────────────────►
│  Training region            │  Holdout (test set)        │
│  matches before test_start  │  matches from test_start   │
│─────────────────────────────│────────────────────────────│
                              ▲
                        temporal.test_start
                        (defined in params.yaml)

Train: all matches before temporal.test_start
Test (holdout): all matches from temporal.test_start onward

The test set is touched once, at final evaluation only. No hyperparameter decisions are made using the test set.

Walk-forward CV folds¶

For hyperparameter tuning, walk-forward cross-validation is used. Folds are generated between temporal.folds_start_year and temporal.folds_end_year, where each fold's validation window is one year ahead of the training window:

Fold N: Train [folds_start_year … year N] → Val [year N+1]

All fold boundaries are startTimeUtc-based. No match ever appears in both a train and validation window within the same fold.

Leakage prevention rules¶

The following rules are enforced in code and verified by tests:

Rule	How enforced
No features derived from post-match information	Feature code has no access to `outcome_1x2` during computation
Rolling aggregations respect the prediction cutoff	`shift(1)` applied before rolling — match N's window contains only matches N−1, N−2, …
ELO ratings are pre-match only	ELO values attached are computed before the match update step
Split is time-based, not random	`split_data` DVC stage uses `temporal.test_start` from `params.yaml`
Walk-forward CV folds are non-leaking	Future data never in training window of any fold

Leakage is treated as a critical bug, not a metric degradation.

How leakage prevention is tested¶

Property-based tests using hypothesis verify that rolling features for match N use only data from matches before N. These run as part of pytest tests/property/ and are required to pass in CI.

Split artifacts¶

Splits are materialised as DVC-tracked Parquet files under data/splits/:

data/splits/
├── train_ids.parquet     # row IDs assigned to training
├── test_ids.parquet      # row IDs assigned to holdout test
└── folds.parquet         # walk-forward CV fold boundaries (start/end timestamps)

The joined feature dataset is at data/processed/dataset.parquet.

Split parameters are defined in params.yaml under the temporal: key:

temporal:
  test_start: "2024-01-01"       # first date of the holdout set
  folds_start_year: 2016         # first year used as validation fold
  folds_end_year: 2024           # exclusive upper bound for fold generation

Changing test_start or fold boundaries triggers all downstream DVC stages automatically.

Metric reporting¶

Metrics are reported: - On the test set only for final model comparison. - With class-level breakdown (precision/recall per outcome) — not as a single average.

No single-number "average accuracy" is reported without context — this hides the effect of class imbalance on evaluation.

Per-competition breakdown is planned but not yet implemented.

Implementation status¶

Aspect	Status
Temporal train/test split	✅ Implemented — `split_data` DVC stage
Walk-forward CV folds	✅ Implemented — `folds.parquet`
`shift(1)` in rolling features	✅ Implemented — `src/features/stats_matches.py`
ELO pre-match gate	✅ Implemented — `src/features/elo.py`
Property tests for leakage	✅ Implemented — `tests/property/`
Split params in `params.yaml`	✅ Implemented
DVC-tracked split artifacts	✅ Implemented
Per-competition metric breakdown	📋 Planned