Skip to content

Validation Strategy & Leakage Prevention

Purpose

Document the temporal validation discipline, leakage prevention rules, and how they are enforced in code and tested. This is one of the most critical design constraints in the system.


Why temporal validation is mandatory

Football match data is strictly ordered in time. Using random cross-validation is a systematic error: a random split allows future matches to appear in the training set, producing inflated metrics that do not reflect real performance. This is not theoretical — it is a structural bias that makes models appear better than they are.

Rule: train on the past, evaluate on the future. Nothing else is valid.


Why random CV fails for this task

Consider a random 80/20 split on matches from 2018–2024:

  • Training set contains matches from January 2023.
  • Test set contains matches from December 2022.
  • Rolling form features were computed using a future match.
  • Result: inflated accuracy, model is overfit to temporal artifacts.

Random CV also fails because team form, league competitiveness, and tactical patterns drift across seasons. A model that scores well on a random holdout may fail on next season's data.


Validation approach

Temporal holdout split

Time ──────────────────────────────────────────────────────►
│  Training region            │  Holdout (test set)        │
│  matches before test_start  │  matches from test_start   │
│─────────────────────────────│────────────────────────────│
                        temporal.test_start
                        (defined in params.yaml)
  • Train: all matches before temporal.test_start
  • Test (holdout): all matches from temporal.test_start onward

The test set is touched once, at final evaluation only. No hyperparameter decisions are made using the test set.

Walk-forward CV folds

For hyperparameter tuning, walk-forward cross-validation is used. Folds are generated between temporal.folds_start_year and temporal.folds_end_year, where each fold's validation window is one year ahead of the training window:

Fold N: Train [folds_start_year … year N] → Val [year N+1]

All fold boundaries are startTimeUtc-based. No match ever appears in both a train and validation window within the same fold.


Leakage prevention rules

The following rules are enforced in code and verified by tests:

Rule How enforced
No features derived from post-match information Feature code has no access to outcome_1x2 during computation
Rolling aggregations respect the prediction cutoff shift(1) applied before rolling — match N's window contains only matches N−1, N−2, …
ELO ratings are pre-match only ELO values attached are computed before the match update step
Split is time-based, not random split_data DVC stage uses temporal.test_start from params.yaml
Walk-forward CV folds are non-leaking Future data never in training window of any fold

Leakage is treated as a critical bug, not a metric degradation.


How leakage prevention is tested

Property-based tests using hypothesis verify that rolling features for match N use only data from matches before N. These run as part of pytest tests/property/ and are required to pass in CI.


Split artifacts

Splits are materialised as DVC-tracked Parquet files under data/splits/:

data/splits/
├── train_ids.parquet     # row IDs assigned to training
├── test_ids.parquet      # row IDs assigned to holdout test
└── folds.parquet         # walk-forward CV fold boundaries (start/end timestamps)

The joined feature dataset is at data/processed/dataset.parquet.

Split parameters are defined in params.yaml under the temporal: key:

temporal:
  test_start: "2024-01-01"       # first date of the holdout set
  folds_start_year: 2016         # first year used as validation fold
  folds_end_year: 2024           # exclusive upper bound for fold generation

Changing test_start or fold boundaries triggers all downstream DVC stages automatically.


Metric reporting

Metrics are reported: - On the test set only for final model comparison. - With class-level breakdown (precision/recall per outcome) — not as a single average.

No single-number "average accuracy" is reported without context — this hides the effect of class imbalance on evaluation.

Per-competition breakdown is planned but not yet implemented.


Implementation status

Aspect Status
Temporal train/test split ✅ Implemented — split_data DVC stage
Walk-forward CV folds ✅ Implemented — folds.parquet
shift(1) in rolling features ✅ Implemented — src/features/stats_matches.py
ELO pre-match gate ✅ Implemented — src/features/elo.py
Property tests for leakage ✅ Implemented — tests/property/
Split params in params.yaml ✅ Implemented
DVC-tracked split artifacts ✅ Implemented
Per-competition metric breakdown 📋 Planned