Skip to content

Problem Formulation & Targets

Purpose

Define exactly what the model predicts, how targets are constructed, and why ML is justified over simpler approaches. Success criteria are defined in Baseline & Success Metrics.


Objective

Predict the outcome of a football match before it starts, using historical match statistics and contextual data only. The prediction produces calibrated class probabilities that are comparable to bookmaker implied probabilities.


ML task definition

Primary task: 3-class classification.

Class Label Meaning
0 Home Win Home team wins
1 Draw Match ends level
2 Away Win Away team wins

Classes are imbalanced: Home Win is most frequent (~45%), Draw least (~25%). Imbalance is addressed via class weights in training — not oversampling, which would risk leakage across the temporal boundary.

Secondary task (experimental): Goal difference regression for calibration analysis. Not part of the primary serving path.


Why ML, not a lookup table

Match outcomes depend on time-varying factors that static rules cannot capture:

  • Team form evolves across a season and changes after player transfers.
  • League competitiveness and tactical patterns shift year to year.
  • The relationship between features and outcomes is non-linear and interaction-heavy.
  • Historical base rates vary by league tier and stage of season.

A static rule-based system cannot adapt to these dynamics within a season.


Target construction

Targets are derived from the final match result (homeScore, awayScore in the raw data) and encoded into outcome_1x2 (0 / 1 / 2) during preprocessing.

Key constraints: - Target is computed from data after the match ends. - No information from within the match enters the feature set. - Features are computed with a strict pre-match cutoff: rolling statistics use shift(1), ensuring match N's features never include match N's result.

Target construction is leakage-free by design. See Validation for how this is enforced and tested.


Train vs. inference asymmetry

At inference time At training time
Match has not yet been played Match result is known
Features are pre-match statistics Target is the actual outcome
No future data is available Pipeline enforces temporal split

This asymmetry is the fundamental reason temporal validation is mandatory. See Validation Strategy.


Business constraints → ML design

Business requirement ML implication
Prediction before match starts Strict pre-match feature cutoff; no live/in-play data
Works across leagues and seasons Generalisation evaluated across competition types
Low inference latency Tabular model; no heavy preprocessing at inference
Calibrated probabilities Log-loss primary metric; ECE evaluated at promotion gate

Out of scope

  • In-play prediction — requires real-time data stream; not in scope.
  • Betting strategy optimisation — the system predicts outcomes, not optimal wagers.
  • Financial modelling — no expected value or Kelly criterion computations.
  • Player-level features — injury, transfer, and individual player data; planned future improvement.