Skip to content

Baseline & Success Metrics

Purpose

Define the baseline hierarchy, the primary benchmark, and the evaluation gates used to decide whether a model is worth registering and promoting.


Why the bookmaker is the right benchmark

Bookmaker odds encode the collective prediction of a liquid market with access to the same public information the model has. If the model cannot match bookmaker log-loss, it has no practical predictive value beyond publicly available market prices.

This is the honest external benchmark. Beating naive baselines proves the model functions. Matching or beating the bookmaker proves it has learned something real.

Note: bookmaker odds are used only as an external evaluation baseline. They are not used as input features — this maintains clean separation between prediction and market data.


Baseline hierarchy

Tier 1 — lower bound: uniform random

Assign probability 1/3 to each class.

  • Log-Loss ≈ log(3) ≈ 1.099
  • Accuracy ≈ 33%

Any trained model that cannot beat this is broken.

Tier 2 — sanity check: hard class prior

Always predict class 0 (Home Win) with probability 1.0.

  • Log-Loss ≈ 1.05
  • Accuracy ≈ 45%

Beating this confirms the model is not trivially overfitting to the majority class.

Tier 3 — soft prior: historical class frequencies

Assign class probabilities proportional to historical base rates.

  • Log-Loss ≈ 1.01
  • Accuracy ≈ 45%

Better-calibrated than hard prior; still requires no learned signal.

Tier 4 — primary benchmark: bookmaker implied probabilities

Convert 1X2 decimal odds to implied probabilities by normalising out the bookmaker margin:

def odds_to_probs(home_odds, draw_odds, away_odds):
    raw = [1 / home_odds, 1 / draw_odds, 1 / away_odds]
    total = sum(raw)
    return [p / total for p in raw]

Implementation status: ✅ Implemented

  • Path A — Pari live odds: src/data/odds_pari.py (get_football_odds_snapshot, save_daily_snapshot). Pydantic validation via src/app/config/validate_bets.PariEvents. Mean vig ~1.087 (factors 921/922/923). Daily collection: airflow/dags/etl_odds_01.pydata/raw/odds_pari/date=YYYY-MM-DD/snapshot.parquet.
  • Path B — football-data.co.uk historical odds: src/data/odds_fdco.py; DVC stage load_odds_fdcodata/raw/odds_fdco.parquet. Leagues/seasons in params.yaml (odds_fdco). Team name join: src/data/odds_join.py.
  • ROI wiring: src/pipelines/error_analysis.py accepts --odds-path data/raw/odds_fdco.parquet; passes vig-stripped probabilities as reference_proba to compute_flat_stake_roi().
  • See Data Sources — Bookmaker odds for full scraper and schema details.

Approximate benchmark values (project-specific estimates; see evidence pages for current run values):

  • Log-Loss ≈ 0.97
  • Accuracy ≈ 53%

These figures are approximate and dataset-dependent. Do not treat them as universal constants.


Success criteria

A model is considered meaningful if it passes all of the following on the held-out test set:

Metric Threshold Rationale
Log-Loss ≤ 0.97 Must match or beat bookmaker benchmark
ECE (calibration error) ≤ 0.05 Miscalibrated probabilities are not production-ready
Brier Score ≤ 0.22 Secondary calibration check

Accuracy is tracked for context but is not a gate — it is misleading under class imbalance.

Log-loss alone is not sufficient. A model that achieves good log-loss but fails ECE is not production-ready.


Model promotion gate

A model is promoted from Challenger to Champion in the MLflow registry only if:

  1. Log-loss on the held-out test set ≤ current champion's log-loss.
  2. ECE ≤ 0.05.
  3. Temporal split audit passes — no leakage detected.
  4. The model runs without error through the full inference path.

The held-out test set is touched once, at final evaluation. No hyperparameter decisions are made using the test set. See Model Registry & Promotion Rules.


What counts as meaningful improvement

A ≥ 1% improvement in log-loss (e.g., 0.97 → 0.96) is considered meaningful. Smaller differences may be within noise and require statistical significance testing before a promotion decision.

All comparisons are made on the same held-out test set. Any comparison using the validation set is considered suspect.


Current model performance

See MLflow Evidence for the latest run metrics.