Validation Strategy & Leakage Prevention¶
Why validation is critical¶
In time-dependent domains like sports, random splits lead to severe data leakage and inflated metrics.
Validation approach¶
The system uses time-aware validation:
- train on historical seasons,
- validate on future matches,
- optionally use rolling or expanding windows.
Dataset splits¶
Typical splits include: - train set: past seasons, - validation set: recent completed matches, - test set: held-out future period.
Splits are: - materialized as datasets, - versioned with DVC, - reused across experiments.
Leakage prevention rules¶
- no features derived from post-match information,
- no aggregation windows crossing prediction timestamp,
- no target-derived features.
Violations are treated as critical bugs.
Metric reporting¶
Metrics are reported: - per split, - per competition where applicable, - with confidence intervals where possible.
This avoids single-number overfitting.