Problem Formulation & Targets¶
Purpose¶
Define exactly what the model predicts, how targets are constructed, and why ML is justified over simpler approaches. Success criteria are defined in Baseline & Success Metrics.
Objective¶
Predict the outcome of a football match before it starts, using historical match statistics and contextual data only. The prediction produces calibrated class probabilities that are comparable to bookmaker implied probabilities.
ML task definition¶
Primary task: 3-class classification.
| Class | Label | Meaning |
|---|---|---|
| 0 | Home Win | Home team wins |
| 1 | Draw | Match ends level |
| 2 | Away Win | Away team wins |
Classes are imbalanced: Home Win is most frequent (~45%), Draw least (~25%). Imbalance is addressed via class weights in training — not oversampling, which would risk leakage across the temporal boundary.
Secondary task (experimental): Goal difference regression for calibration analysis. Not part of the primary serving path.
Why ML, not a lookup table¶
Match outcomes depend on time-varying factors that static rules cannot capture:
- Team form evolves across a season and changes after player transfers.
- League competitiveness and tactical patterns shift year to year.
- The relationship between features and outcomes is non-linear and interaction-heavy.
- Historical base rates vary by league tier and stage of season.
A static rule-based system cannot adapt to these dynamics within a season.
Target construction¶
Targets are derived from the final match result (homeScore, awayScore in the raw data)
and encoded into outcome_1x2 (0 / 1 / 2) during preprocessing.
Key constraints:
- Target is computed from data after the match ends.
- No information from within the match enters the feature set.
- Features are computed with a strict pre-match cutoff: rolling statistics use shift(1),
ensuring match N's features never include match N's result.
Target construction is leakage-free by design. See Validation for how this is enforced and tested.
Train vs. inference asymmetry¶
| At inference time | At training time |
|---|---|
| Match has not yet been played | Match result is known |
| Features are pre-match statistics | Target is the actual outcome |
| No future data is available | Pipeline enforces temporal split |
This asymmetry is the fundamental reason temporal validation is mandatory. See Validation Strategy.
Business constraints → ML design¶
| Business requirement | ML implication |
|---|---|
| Prediction before match starts | Strict pre-match feature cutoff; no live/in-play data |
| Works across leagues and seasons | Generalisation evaluated across competition types |
| Low inference latency | Tabular model; no heavy preprocessing at inference |
| Calibrated probabilities | Log-loss primary metric; ECE evaluated at promotion gate |
Out of scope¶
- In-play prediction — requires real-time data stream; not in scope.
- Betting strategy optimisation — the system predicts outcomes, not optimal wagers.
- Financial modelling — no expected value or Kelly criterion computations.
- Player-level features — injury, transfer, and individual player data; planned future improvement.