Baseline & Success Metrics¶

Purpose¶

Define the baseline hierarchy, the primary benchmark, and the evaluation gates used to decide whether a model is worth registering and promoting.

Why the bookmaker is the right benchmark¶

Bookmaker odds encode the collective prediction of a liquid market with access to the same public information the model has. If the model cannot match bookmaker log-loss, it has no practical predictive value beyond publicly available market prices.

This is the honest external benchmark. Beating naive baselines proves the model functions. Matching or beating the bookmaker proves it has learned something real.

Note: bookmaker odds are used only as an external evaluation baseline. They are not used as input features — this maintains clean separation between prediction and market data.

Baseline hierarchy¶

Tier 1 — lower bound: uniform random¶

Assign probability 1/3 to each class.

Log-Loss ≈ log(3) ≈ 1.099
Accuracy ≈ 33%

Any trained model that cannot beat this is broken.

Tier 2 — sanity check: hard class prior¶

Always predict class 0 (Home Win) with probability 1.0.

Log-Loss ≈ 1.05
Accuracy ≈ 45%

Beating this confirms the model is not trivially overfitting to the majority class.

Tier 3 — soft prior: historical class frequencies¶

Assign class probabilities proportional to historical base rates.

Log-Loss ≈ 1.01
Accuracy ≈ 45%

Better-calibrated than hard prior; still requires no learned signal.

Tier 4 — primary benchmark: bookmaker implied probabilities¶

Convert 1X2 decimal odds to implied probabilities by normalising out the bookmaker margin:

def odds_to_probs(home_odds, draw_odds, away_odds):
    raw = [1 / home_odds, 1 / draw_odds, 1 / away_odds]
    total = sum(raw)
    return [p / total for p in raw]

Implementation status: ✅ Implemented

Path A — Pari live odds: src/data/odds_pari.py (get_football_odds_snapshot, save_daily_snapshot). Pydantic validation via src/app/config/validate_bets.PariEvents. Mean vig ~1.087 (factors 921/922/923). Daily collection: airflow/dags/etl_odds_01.py → data/raw/odds_pari/date=YYYY-MM-DD/snapshot.parquet.
Path B — football-data.co.uk historical odds: src/data/odds_fdco.py; DVC stage load_odds_fdco → data/raw/odds_fdco.parquet. Leagues/seasons in params.yaml (odds_fdco). Team name join: src/data/odds_join.py.
ROI wiring: src/pipelines/error_analysis.py accepts --odds-path data/raw/odds_fdco.parquet; passes vig-stripped probabilities as reference_proba to compute_flat_stake_roi().
See Data Sources — Bookmaker odds for full scraper and schema details.

Approximate benchmark values (project-specific estimates; see evidence pages for current run values):

Log-Loss ≈ 0.97
Accuracy ≈ 53%

These figures are approximate and dataset-dependent. Do not treat them as universal constants.

Success criteria¶

A model is considered meaningful if it passes all of the following on the held-out test set:

Metric	Threshold	Rationale
Log-Loss	≤ 0.97	Must match or beat bookmaker benchmark
ECE (calibration error)	≤ 0.05	Miscalibrated probabilities are not production-ready
Brier Score	≤ 0.22	Secondary calibration check

Accuracy is tracked for context but is not a gate — it is misleading under class imbalance.

Log-loss alone is not sufficient. A model that achieves good log-loss but fails ECE is not production-ready.

Model promotion gate¶

A model is promoted from Challenger to Champion in the MLflow registry only if:

Log-loss on the held-out test set ≤ current champion's log-loss.
ECE ≤ 0.05.
Temporal split audit passes — no leakage detected.
The model runs without error through the full inference path.

The held-out test set is touched once, at final evaluation. No hyperparameter decisions are made using the test set. See Model Registry & Promotion Rules.

What counts as meaningful improvement¶

A ≥ 1% improvement in log-loss (e.g., 0.97 → 0.96) is considered meaningful. Smaller differences may be within noise and require statistical significance testing before a promotion decision.

All comparisons are made on the same held-out test set. Any comparison using the validation set is considered suspect.

Current model performance¶

See MLflow Evidence for the latest run metrics.