Problem Formulation & Targets¶

Purpose¶

Define exactly what the model predicts, how targets are constructed, and why ML is justified over simpler approaches. Success criteria are defined in Baseline & Success Metrics.

Objective¶

Predict the outcome of a football match before it starts, using historical match statistics and contextual data only. The prediction produces calibrated class probabilities that are comparable to bookmaker implied probabilities.

ML task definition¶

Primary task: 3-class classification.

Class	Label	Meaning
0	Home Win	Home team wins
1	Draw	Match ends level
2	Away Win	Away team wins

Classes are imbalanced: Home Win is most frequent (~45%), Draw least (~25%). Imbalance is addressed via class weights in training — not oversampling, which would risk leakage across the temporal boundary.

Secondary task (experimental): Goal difference regression for calibration analysis. Not part of the primary serving path.

Why ML, not a lookup table¶

Match outcomes depend on time-varying factors that static rules cannot capture:

Team form evolves across a season and changes after player transfers.
League competitiveness and tactical patterns shift year to year.
The relationship between features and outcomes is non-linear and interaction-heavy.
Historical base rates vary by league tier and stage of season.

A static rule-based system cannot adapt to these dynamics within a season.

Target construction¶

Targets are derived from the final match result (homeScore, awayScore in the raw data) and encoded into outcome_1x2 (0 / 1 / 2) during preprocessing.

Key constraints: - Target is computed from data after the match ends. - No information from within the match enters the feature set. - Features are computed with a strict pre-match cutoff: rolling statistics use shift(1), ensuring match N's features never include match N's result.

Target construction is leakage-free by design. See Validation for how this is enforced and tested.

Train vs. inference asymmetry¶

At inference time	At training time
Match has not yet been played	Match result is known
Features are pre-match statistics	Target is the actual outcome
No future data is available	Pipeline enforces temporal split

This asymmetry is the fundamental reason temporal validation is mandatory. See Validation Strategy.

Business constraints → ML design¶

Business requirement	ML implication
Prediction before match starts	Strict pre-match feature cutoff; no live/in-play data
Works across leagues and seasons	Generalisation evaluated across competition types
Low inference latency	Tabular model; no heavy preprocessing at inference
Calibrated probabilities	Log-loss primary metric; ECE evaluated at promotion gate

Out of scope¶

In-play prediction — requires real-time data stream; not in scope.
Betting strategy optimisation — the system predicts outcomes, not optimal wagers.
Financial modelling — no expected value or Kelly criterion computations.
Player-level features — injury, transfer, and individual player data; planned future improvement.