Validation Strategy & Leakage Prevention¶
Purpose¶
Document the temporal validation discipline, leakage prevention rules, and how they are enforced in code and tested. This is one of the most critical design constraints in the system.
Why temporal validation is mandatory¶
Football match data is strictly ordered in time. Using random cross-validation is a systematic error: a random split allows future matches to appear in the training set, producing inflated metrics that do not reflect real performance. This is not theoretical — it is a structural bias that makes models appear better than they are.
Rule: train on the past, evaluate on the future. Nothing else is valid.
Why random CV fails for this task¶
Consider a random 80/20 split on matches from 2018–2024:
- Training set contains matches from January 2023.
- Test set contains matches from December 2022.
- Rolling form features were computed using a future match.
- Result: inflated accuracy, model is overfit to temporal artifacts.
Random CV also fails because team form, league competitiveness, and tactical patterns drift across seasons. A model that scores well on a random holdout may fail on next season's data.
Validation approach¶
Temporal holdout split¶
Time ──────────────────────────────────────────────────────►
│ Training region │ Holdout (test set) │
│ matches before test_start │ matches from test_start │
│─────────────────────────────│────────────────────────────│
▲
temporal.test_start
(defined in params.yaml)
- Train: all matches before
temporal.test_start - Test (holdout): all matches from
temporal.test_startonward
The test set is touched once, at final evaluation only. No hyperparameter decisions are made using the test set.
Walk-forward CV folds¶
For hyperparameter tuning, walk-forward cross-validation is used.
Folds are generated between temporal.folds_start_year and temporal.folds_end_year,
where each fold's validation window is one year ahead of the training window:
All fold boundaries are startTimeUtc-based. No match ever appears in both a train and
validation window within the same fold.
Leakage prevention rules¶
The following rules are enforced in code and verified by tests:
| Rule | How enforced |
|---|---|
| No features derived from post-match information | Feature code has no access to outcome_1x2 during computation |
| Rolling aggregations respect the prediction cutoff | shift(1) applied before rolling — match N's window contains only matches N−1, N−2, … |
| ELO ratings are pre-match only | ELO values attached are computed before the match update step |
| Split is time-based, not random | split_data DVC stage uses temporal.test_start from params.yaml |
| Walk-forward CV folds are non-leaking | Future data never in training window of any fold |
Leakage is treated as a critical bug, not a metric degradation.
How leakage prevention is tested¶
Property-based tests using hypothesis verify that rolling features for match N
use only data from matches before N. These run as part of pytest tests/property/
and are required to pass in CI.
Split artifacts¶
Splits are materialised as DVC-tracked Parquet files under data/splits/:
data/splits/
├── train_ids.parquet # row IDs assigned to training
├── test_ids.parquet # row IDs assigned to holdout test
└── folds.parquet # walk-forward CV fold boundaries (start/end timestamps)
The joined feature dataset is at data/processed/dataset.parquet.
Split parameters are defined in params.yaml under the temporal: key:
temporal:
test_start: "2024-01-01" # first date of the holdout set
folds_start_year: 2016 # first year used as validation fold
folds_end_year: 2024 # exclusive upper bound for fold generation
Changing test_start or fold boundaries triggers all downstream DVC stages automatically.
Metric reporting¶
Metrics are reported: - On the test set only for final model comparison. - With class-level breakdown (precision/recall per outcome) — not as a single average.
No single-number "average accuracy" is reported without context — this hides the effect of class imbalance on evaluation.
Per-competition breakdown is planned but not yet implemented.
Implementation status¶
| Aspect | Status |
|---|---|
| Temporal train/test split | ✅ Implemented — split_data DVC stage |
| Walk-forward CV folds | ✅ Implemented — folds.parquet |
shift(1) in rolling features |
✅ Implemented — src/features/stats_matches.py |
| ELO pre-match gate | ✅ Implemented — src/features/elo.py |
| Property tests for leakage | ✅ Implemented — tests/property/ |
Split params in params.yaml |
✅ Implemented |
| DVC-tracked split artifacts | ✅ Implemented |
| Per-competition metric breakdown | 📋 Planned |