Skip to content

Validation Strategy & Leakage Prevention

Why validation is critical

In time-dependent domains like sports, random splits lead to severe data leakage and inflated metrics.


Validation approach

The system uses time-aware validation:

  • train on historical seasons,
  • validate on future matches,
  • optionally use rolling or expanding windows.

Dataset splits

Typical splits include: - train set: past seasons, - validation set: recent completed matches, - test set: held-out future period.

Splits are: - materialized as datasets, - versioned with DVC, - reused across experiments.


Leakage prevention rules

  • no features derived from post-match information,
  • no aggregation windows crossing prediction timestamp,
  • no target-derived features.

Violations are treated as critical bugs.


Metric reporting

Metrics are reported: - per split, - per competition where applicable, - with confidence intervals where possible.

This avoids single-number overfitting.