Feature Engineering

ELO ratings, rolling match statistics and feature inventory

Author

Dima Ivanov

Published

May 11, 2026

Show code

import sys
from pathlib import Path

project_root = Path().resolve().parent.parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import pandas as pd
import numpy as np
import seaborn as sns
from IPython.display import display, HTML, Markdown

from src.app.config import settings

import yaml
with open(project_root / "params.yaml") as _f:
    PARAMS = yaml.safe_load(_f)

Show code

for _p in [
    settings.data_features_path / "features.parquet",
    settings.data_features_path / "features_meta.parquet",
    project_root / "data" / "interim" / "finished.parquet",
]:
    if not _p.exists():
        raise FileNotFoundError(f"{_p} not found. Run `dvc repro feature_engineering` first.")

Note

This report documents the offline feature engineering pipeline for football match prediction. All features are computed from data/interim/finished.parquet such that:

all features are available before kick-off (no data leakage), and
feature computation is deterministic and stateless given the historical window.

Two feature groups are produced: ELO ratings (long-run team quality, updated after every match) and rolling match statistics (short-term form over configurable windows).

1. Feature Inventory

Show code

df_features = pd.read_parquet(settings.data_features_path / "features.parquet")
df_meta = pd.read_parquet(settings.data_features_path / "features_meta.parquet")

# Load finished matches and join with features for ELO calibration,
# signal analysis, and temporal drift sections.
# df_features is indexed by match id (set_index("id") in to_match_level).
_df_finished = pd.read_parquet(project_root / "data" / "interim" / "finished.parquet")
_join_cols = [c for c in ["startTimeUtc", "homeTeamId", "awayTeamId", "outcome_1x2"]
              if c in _df_finished.columns]
_df_full = df_features.join(
    _df_finished[["id"] + _join_cols].set_index("id"),
    how="left",
)

Show code

import subprocess
import yaml as _yaml

_git_hash = subprocess.check_output(
    ["git", "rev-parse", "--short", "HEAD"], cwd=project_root
).decode().strip()

_dvc_md5 = "—"
_dvc_lock_path = project_root / "dvc.lock"
if _dvc_lock_path.exists():
    _dvc_lock = _yaml.safe_load(_dvc_lock_path.read_text())
    for _stage in _dvc_lock.get("stages", {}).values():
        for _out in _stage.get("outs", []):
            if _out.get("path") == "data/features/features.parquet":
                _dvc_md5 = _out.get("md5", "—")
                break

display(Markdown(
    f"- **Git commit:** `{_git_hash}`  \n"
    f"- **features.parquet DVC MD5:** `{_dvc_md5}`  \n"
    f"- **Total features:** {len(df_features.columns):,}  \n"
    f"- **Rows:** {len(df_features):,}"
))
del _dvc_md5, _dvc_lock_path

Git commit: a4f939a
features.parquet DVC MD5: 0c7e5562aac44e4925a374c6c461c5e1
Total features: 456
Rows: 966,140

Caution

Reproducibility note. The figures and statistics in this report are tied to a specific snapshot of the engineered features. The Git commit above identifies the pipeline code; the DVC MD5 above is the content-addressable hash of features.parquet as recorded in dvc.lock. To reproduce this report exactly, run dvc repro feature_engineering from the same commit before rendering.

Show code

# ── Pipeline configuration summary ───────────────────────────────────────────
_stats_cols   = PARAMS["features"]["stats_cols"]
_window_sizes = PARAMS["features"]["window_sizes"]
_elo_params   = PARAMS["features"]["elo"]
_rolling_scopes = ["all", "season", "tournament", "ha", "h2h"]

# Expected feature count breakdown
_n_stats     = len(_stats_cols)
_n_windows   = len(_window_sizes)
_n_scopes    = len(_rolling_scopes)
_n_roll_mean = _n_stats * _n_windows * _n_scopes * 3   # ×3 sides: home / away / diff
_n_coverage  = _n_windows * _n_scopes * 2              # home + away only (no diff coverage)
_n_elo       = 3 if _elo_params.get("include") else 0  # home_elo_pre / away_elo_pre / diff_elo_pre
_n_rest      = 3                                        # home_rest_days / away_rest_days / diff_rest_days
_n_expected  = _n_roll_mean + _n_coverage + _n_elo + _n_rest

display(Markdown("**Pipeline configuration (`params.yaml → features`)**"))
display(
    pd.DataFrame([
        {"Parameter": "stats_cols",      "Value": str(_stats_cols),     "N": _n_stats},
        {"Parameter": "window_sizes",    "Value": str(_window_sizes),   "N": _n_windows},
        {"Parameter": "rolling scopes",  "Value": str(_rolling_scopes), "N": _n_scopes},
        {"Parameter": "ELO (include)",   "Value": str(_elo_params.get("include")), "N": _n_elo},
        {"Parameter": "rest_days",       "Value": "home / away / diff",  "N": _n_rest},
    ]).set_index("Parameter")
    .style.set_caption("Feature engineering params")
)

_n_actual = len(df_features.columns)
display(Markdown(
    f"**Expected feature count:**  \n"
    f"- Rolling mean : {_n_stats} stats × {_n_windows} windows × {_n_scopes} scopes × 3 sides = **{_n_roll_mean}**  \n"
    f"- Coverage     : {_n_windows} windows × {_n_scopes} scopes × 2 sides = **{_n_coverage}**  \n"
    f"- ELO          : **{_n_elo}**  \n"
    f"- rest\_days   : **{_n_rest}**  \n"
    f"- **Expected ≈ {_n_expected}** | **Actual: {_n_actual}**"
    + (f"  \n⚠️ Δ = {_n_actual - _n_expected} (difference may reflect excluded scopes or extra coverage cols)"
       if abs(_n_actual - _n_expected) > 5 else "  \n✅ Count matches expected total.")
))

Pipeline configuration (params.yaml → features)

Table 1: Feature engineering params

	Value	N
Parameter
stats_cols	['win', 'draw', 'loss', 'goals_for', 'goals_against']	5
window_sizes	[1, 2, 3, 5, 10]	5
rolling scopes	['all', 'season', 'tournament', 'ha', 'h2h']	5
ELO (include)	True	3
rest_days	home / away / diff	3

Expected feature count:
- Rolling mean : 5 stats × 5 windows × 5 scopes × 3 sides = 375
- Coverage : 5 windows × 5 scopes × 2 sides = 50
- ELO : 3
- rest_days : 3
- Expected ≈ 431 | Actual: 456
⚠️ Δ = 25 (difference may reflect excluded scopes or extra coverage cols)

Show code

# features_meta.parquet columns: name, side, scope, metric, agg, window.
_name_col = "name" if "name" in df_meta.columns else (
    "feature_name" if "feature_name" in df_meta.columns else None
)

if "feature_group" not in df_meta.columns:
    df_meta = df_meta.copy()
    if "scope" in df_meta.columns:
        df_meta["feature_group"] = df_meta["scope"]
        # ELO features share scope with rolling stats — override by name.
        if _name_col:
            df_meta.loc[df_meta[_name_col].str.contains("elo", case=False, na=False), "feature_group"] = "elo"
        elif "metric" in df_meta.columns:
            df_meta.loc[df_meta["metric"].str.contains("elo", case=False, na=False), "feature_group"] = "elo"
        # rest_days: scope is "all" in meta but represents fatigue signal — show as its own group.
        if _name_col:
            df_meta.loc[df_meta[_name_col].str.contains("rest_days", case=False, na=False), "feature_group"] = "rest"
    else:
        def _infer_group(name: str) -> str:
            if "elo" in name: return "elo"
            if "h2h" in name: return "h2h"
            if "ha" in name: return "ha"
            if "season" in name: return "season"
            if "tournament" in name: return "tournament"
            if "rest" in name: return "rest"
            return "all"
        df_meta = df_meta.copy()
        _col = _name_col or df_meta.columns[0]
        df_meta["feature_group"] = df_meta[_col].apply(_infer_group)

# ── Scope legend ──────────────────────────────────────────────────────────────
_scope_descriptions = {
    "elo":        "ELO rating — long-run team strength, scoped per tournament, updated after every match",
    "all":        "Rolling stats over **all** historical matches (cross-tournament, cross-season)",
    "season":     "Rolling stats over **current season** only — captures in-season form changes",
    "tournament": "Rolling stats within the **current tournament** (league/cup) — removes cross-competition noise",
    "ha":         "**Home/Away** split — stats computed separately for home and away contexts",
    "h2h":        "**Head-to-Head** — rolling stats from prior meetings between the same two teams",
    "rest":       "**Rest days** — days since the team's last match (proxy for fatigue / congestion)",
}

_legend_rows = [
    {"Scope": k, "Description": v}
    for k, v in _scope_descriptions.items()
    if k in df_meta["feature_group"].unique()
]
display(Markdown("**Feature scope legend**"))
display(
    pd.DataFrame(_legend_rows)
    .set_index("Scope")
    .style.set_caption("Scope definitions")
)

# ── Pivot table: rows = scope, columns = side (home / away / diff) ───────────
_agg_col = _name_col if _name_col else df_meta.columns[0]
if "side" in df_meta.columns:
    _pivot = (
        df_meta.groupby(["feature_group", "side"])[_agg_col]
        .count()
        .unstack(fill_value=0)
    )
    # Canonical column order
    _side_order = [s for s in ["home", "away", "diff"] if s in _pivot.columns]
    _pivot = _pivot[_side_order]
    _pivot["total"] = _pivot.sum(axis=1)
    _pivot = _pivot.sort_values("total", ascending=False)
    _group_summary = _pivot[["total"]].rename(columns={"total": "n_features"})

    display(Markdown(f"**Total features: {len(df_features.columns):,}**"))
    display(
        _pivot.style
        .set_caption("Feature inventory: rows = scope, columns = side")
        .format("{:,}")
        .background_gradient(cmap="Blues", subset=_side_order)
    )
else:
    _group_summary = (
        df_meta.groupby("feature_group")
        .agg(n_features=(_agg_col, "count"))
        .sort_values("n_features", ascending=False)
    )
    display(Markdown(f"**Total features: {len(df_features.columns):,}**"))
    display(_group_summary.style.set_caption("Feature groups").bar(color="#5fba7d"))

Feature scope legend

Table 2: Scope definitions

	Description
Scope
elo	ELO rating — long-run team strength, scoped per tournament, updated after every match
all	Rolling stats over all historical matches (cross-tournament, cross-season)
season	Rolling stats over current season only — captures in-season form changes
tournament	Rolling stats within the current tournament (league/cup) — removes cross-competition noise
ha	Home/Away split — stats computed separately for home and away contexts
h2h	Head-to-Head — rolling stats from prior meetings between the same two teams
rest	Rest days — days since the team's last match (proxy for fatigue / congestion)

Total features: 456

Table 3: Feature inventory: rows = scope, columns = side

side	home	away	diff	total
feature_group
all	30	30	30	90
h2h	30	30	30	90
ha	30	30	30	90
tournament	30	30	30	90
season	30	30	30	90
elo	1	1	1	3
rest	1	1	1	3

Show code

# Explicitly surface rest_days columns — they are named outside the rolling-window
# convention and are easy to miss in the aggregated pivot above.
_rest_cols = [c for c in df_features.columns if "rest_days" in c]
if _rest_cols:
    _rest_desc = df_features[_rest_cols].describe().T
    _rest_desc["null_%"] = (df_features[_rest_cols].isnull().mean() * 100).round(2).values
    display(Markdown(
        f"**rest\_days features** (`{', '.join(_rest_cols)}`):  \n"
        "Days since each team's previous match. NaN for a team's debut (no prior match). "
        "`diff_rest_days = home_rest_days − away_rest_days` (positive → home team had more rest)."
    ))
    display(
        _rest_desc[["count", "mean", "std", "min", "50%", "max", "null_%"]]
        .style
        .format("{:.2f}")
        .background_gradient(subset=["null_%"], cmap="Reds")
        .set_caption("rest_days feature summary statistics")
    )
else:
    display(Markdown("⚠️ No `rest_days` columns found in `features.parquet`. Check `add_rest_days()` in `stats_matches.py`."))

rest_days features (home_rest_days, away_rest_days, diff_rest_days):
Days since each team’s previous match. NaN for a team’s debut (no prior match). diff_rest_days = home_rest_days − away_rest_days (positive → home team had more rest).

Table 4: rest_days feature summary statistics

	count	mean	std	min	50%	max	null_%
home_rest_days	959727.00	18.19	129.52	0.00	6.00	8968.00	0.66
away_rest_days	961340.00	15.55	102.94	0.00	6.00	8081.00	0.50
diff_rest_days	956465.00	2.75	142.98	-8032.00	0.00	8965.00	1.00

Show code

_fig, _ax = plt.subplots(figsize=(7, 5))
_group_summary["n_features"].plot(
    kind="bar", ax=_ax, color=sns.color_palette("muted", len(_group_summary))
)
_ax.set_xlabel("Scope")
_ax.set_ylabel("# features")
_ax.set_title("Feature inventory by scope (all sides combined)")
_ax.tick_params(axis="x", rotation=30)
for _bar in _ax.patches:
    _ax.text(_bar.get_x() + _bar.get_width() / 2, _bar.get_height() + 0.3,
             str(int(_bar.get_height())), ha="center", fontsize=9)
plt.tight_layout()
plt.show()

2. Feature Completeness

Show code

_null_pct = (df_features.isnull().mean() * 100).sort_values(ascending=False)
_null_pct = _null_pct[_null_pct > 0]

if not _null_pct.empty:
    _high_null = _null_pct[_null_pct > 5]
    if not _high_null.empty:
        display(HTML(
            '<div style="background:#fff3cd;border:1px solid #ffc107;padding:10px;border-radius:4px;">'
            f'<b>⚠️ {len(_high_null)} features have &gt;5% nulls:</b> '
            + ", ".join(f"<code>{c}</code> ({v:.1f}%)" for c, v in _high_null.items())
            + "</div>"
        ))

    _fig, _ax = plt.subplots(figsize=(10, max(4, len(_null_pct) * 0.3)))
    _colors = ["#e53935" if v > 5 else "#fb8c00" if v > 1 else "#43a047"
               for v in _null_pct.values]
    _ax.barh(_null_pct.index, _null_pct.values, color=_colors)
    _ax.set_xlabel("Null %")
    _ax.set_title("Feature completeness (null % per feature)")
    _ax.invert_yaxis()
    _ax.axvline(5, color="red", lw=1.2, linestyle="--", label="5% threshold")
    _ax.legend()
    plt.tight_layout()
    plt.show()
else:
    display(Markdown("✅ All features are complete — no nulls detected."))

(a) Null % per feature (features with any nulls, sorted descending)

2.1 Cold-Start Profile

Note

Rolling features produce NaN for a team’s first few matches in any grouping context: scope all → cold for a team’s debut; scope season → cold at each season start; scope h2h → cold until the pair has met at least once. This plot quantifies the cold-start null rate per scope × window and explains the GE mostly tolerances configured in src/data_quality/features.py.

Show code

_cold_rows = []
for _col in df_features.columns:
    if "_mean_w" not in _col:
        continue
    if _col.startswith("diff_"):
        continue
    try:
        _, _w_str = _col.rsplit("_w", 1)
        _w = int(_w_str)
    except ValueError:
        continue
    _null_rate = float(df_features[_col].isnull().mean() * 100)
    for _sc in ["h2h", "ha", "tournament", "season", "all"]:
        if f"_{_sc}_" in _col:
            _cold_rows.append({"scope": _sc, "window": _w, "null_pct": _null_rate})
            break

if _cold_rows:
    _cold_df = (
        pd.DataFrame(_cold_rows)
        .groupby(["scope", "window"])["null_pct"]
        .mean()
        .reset_index()
    )
    _cold_pivot = _cold_df.pivot(index="scope", columns="window", values="null_pct")
    _scope_order = [s for s in ["all", "season", "tournament", "ha", "h2h"] if s in _cold_pivot.index]
    _cold_pivot = _cold_pivot.loc[_scope_order]

    _fig, _axes = plt.subplots(1, 2, figsize=(14, 5))

    sns.heatmap(
        _cold_pivot,
        annot=True, fmt=".1f", cmap="YlOrRd",
        linewidths=0.5, ax=_axes[0],
        cbar_kws={"label": "Null %"},
    )
    _axes[0].set_title("Cold-start null rate (%) by scope × window")
    _axes[0].set_xlabel("Window size")
    _axes[0].set_ylabel("Scope")

    for _sc in _scope_order:
        _row = _cold_pivot.loc[_sc]
        _axes[1].plot(_row.index, _row.values, marker="o", label=_sc)
    _axes[1].set_xlabel("Window size")
    _axes[1].set_ylabel("Mean null %")
    _axes[1].set_title("Null rate by window size (per scope)")
    _axes[1].axhline(10, color="red", lw=1.0, linestyle="--", label="10% GE tolerance")
    _axes[1].legend()
    plt.tight_layout()
    plt.show()

    display(Markdown(
        "**Interpretation:** `h2h` scope has the highest null rate — many team pairs have never "
        "met within the historical window. `all` scope shows the lowest null rate; "
        "short windows (w=1, 2) produce the most cold-start NaNs within any scope."
    ))
else:
    display(Markdown("⚠️ No rolling mean columns found for cold-start analysis."))

(a) Mean null rate (%) per scope × window size (home + away sides averaged)

Interpretation: h2h scope has the highest null rate — many team pairs have never met within the historical window. all scope shows the lowest null rate; short windows (w=1, 2) produce the most cold-start NaNs within any scope.

(b)

Figure 3

3. ELO Ratings

Show code

_elo_params = PARAMS["features"]["elo"]
_elo_df = pd.DataFrame([
    {"Parameter": "include", "Value": _elo_params.get("include", "—")},
    {"Parameter": "k_factor", "Value": _elo_params.get("k_factor", "—")},
    {"Parameter": "initial_rating", "Value": _elo_params.get("initial_rating", "—")},
    {"Parameter": "home_advantage", "Value": _elo_params.get("home_advantage", "—")},
    {"Parameter": "group_col", "Value": _elo_params.get("group_col", "—")},
])
display(_elo_df.set_index("Parameter").style.set_caption("ELO configuration (params.yaml)"))

Table 5: ELO configuration (params.yaml)

	Value
Parameter
include	True
k_factor	32.000000
initial_rating	1500.000000
home_advantage	50.000000
group_col	tournamentId

Warning

ELO scope reset. Ratings are maintained per (group_col, teamId) pair — by default per (tournamentId, teamId). A team entering a new tournament starts from initial_rating regardless of its true strength. This is a deliberate trade-off: cross-tournament transfer would require a separate normalisation step and introduces additional hyperparameters. The consequence is that the first few matches of any team in a new tournament carry less discriminative ELO signal, compounding the cold-start effect described in Section 2.1.

Show code

_elo_home = "home_elo_pre"
_elo_away = "away_elo_pre"
_elo_diff = "diff_elo_pre"
_initial_rating = float(_elo_params.get("initial_rating", 1500))

_fig, _axes = plt.subplots(1, 3, figsize=(15, 5))

# Home vs Away KDE
for _col, _label, _color in [(_elo_home, "Home ELO", "#2196F3"), (_elo_away, "Away ELO", "#4CAF50")]:
    if _col in df_features.columns:
        df_features[_col].dropna().plot.kde(ax=_axes[0], label=_label, color=_color)
_axes[0].set_xlabel("ELO rating")
_axes[0].set_title("Home vs Away ELO (KDE)")
_axes[0].legend()
_axes[0].axvline(_initial_rating, color="red", linestyle="--", lw=0.8, label="Initial rating")

# Diff histogram
if _elo_diff in df_features.columns:
    _axes[1].hist(df_features[_elo_diff].dropna(), bins=60, color="#FF9800", edgecolor="white", alpha=0.85)
    _axes[1].axvline(0, color="red", lw=1.5, linestyle="--", label="0 (equal teams)")
    _axes[1].set_xlabel("ELO diff (home − away)")
    _axes[1].set_ylabel("# matches")
    _axes[1].set_title("ELO difference distribution")
    _axes[1].legend()
else:
    _axes[1].set_visible(False)

# ELO delta over time — deviation from initial_rating (top-5 teams by home match count).
# Using _df_full because homeTeamId / startTimeUtc are not in df_features after to_match_level.
if _elo_home in _df_full.columns and "homeTeamId" in _df_full.columns and "startTimeUtc" in _df_full.columns:
    _top_teams = _df_full["homeTeamId"].value_counts().head(5).index
    _df_elo_time = _df_full[_df_full["homeTeamId"].isin(_top_teams)].copy()
    _df_elo_time["_dt"] = pd.to_datetime(_df_elo_time["startTimeUtc"])
    for _tid in _top_teams:
        _sub = _df_elo_time[_df_elo_time["homeTeamId"] == _tid].sort_values("_dt")
        _axes[2].plot(_sub["_dt"], _sub[_elo_home] - _initial_rating, alpha=0.75, label=str(_tid))
    _axes[2].axhline(0, color="gray", lw=0.8, linestyle="--", label="Δ=0 (initial)")
    _axes[2].set_title("Home ELO Δ over time (top-5 teams by home frequency)")
    _axes[2].set_xlabel("Date")
    _axes[2].set_ylabel("ELO delta (rating − initial)")
    _axes[2].legend(title="homeTeamId", fontsize=7)
else:
    _axes[2].set_visible(False)

plt.tight_layout()
plt.show()

Figure 4: Pre-match ELO rating distributions

3.1 ELO Calibration

Note

ELO expected score is defined as E[score] = P(home win) + 0.5 × P(draw), matching the ELO update formula in src/features/elo.py. A calibrated feature should show E[score] rising monotonically with diff_elo_pre and tracking the theoretical sigmoid 1 / (1 + 10^(−diff/400)). Two curves are shown: without home advantage (HA=0) and with the configured home_advantage parameter to confirm the parameter is effective. Perfect calibration means the bars align with the HA=50 curve, not the HA=0 curve.

Show code

if _elo_diff in _df_full.columns and "outcome_1x2" in _df_full.columns:
    _df_calib = _df_full[[_elo_diff, "outcome_1x2"]].dropna().copy()
    _df_calib["home_win"] = (_df_calib["outcome_1x2"] == 0).astype(float)
    _df_calib["draw"]     = (_df_calib["outcome_1x2"] == 1).astype(float)
    # ELO expected score includes half-credit for draws, matching the update formula
    _df_calib["exp_score"] = _df_calib["home_win"] + 0.5 * _df_calib["draw"]
    _df_calib["bucket"] = pd.cut(_df_calib[_elo_diff], bins=12)
    _calib_grp = (
        _df_calib.groupby("bucket", observed=True)
        .agg(exp_score=("exp_score", "mean"), count=("exp_score", "size"))
        .reset_index()
    )
    _calib_grp["bucket_mid"] = _calib_grp["bucket"].apply(lambda x: x.mid).astype(float)
    _z = 1.96
    _calib_grp["ci"] = _z * np.sqrt(
        _calib_grp["exp_score"] * (1 - _calib_grp["exp_score"]) / _calib_grp["count"].clip(1)
    )
    _calib_ha = float(_elo_params.get("home_advantage", 50.0))
    _diff_range = np.linspace(_df_calib[_elo_diff].min(), _df_calib[_elo_diff].max(), 200)
    _elo_curve_no_ha  = 1.0 / (1.0 + 10.0 ** (-_diff_range / 400.0))
    _elo_curve_with_ha = 1.0 / (1.0 + 10.0 ** (-(_diff_range + _calib_ha) / 400.0))

    _fig, _ax = plt.subplots(figsize=(9, 5))
    _bar_w = float(_calib_grp["bucket_mid"].diff().median()) * 0.8
    _ax.bar(
        _calib_grp["bucket_mid"], _calib_grp["exp_score"], width=_bar_w,
        alpha=0.6, color="steelblue", label="Empirical E[score] = P(win) + 0.5·P(draw)",
        yerr=_calib_grp["ci"], capsize=3, error_kw={"elinewidth": 1},
    )
    _ax.plot(_diff_range, _elo_curve_no_ha,   color="gray",   lw=1.5, linestyle="--",
             label="Theoretical (HA = 0)")
    _ax.plot(_diff_range, _elo_curve_with_ha, color="crimson", lw=2,
             label=f"Theoretical (HA = {_calib_ha:.0f})")
    _ax.axhline(
        _df_calib["exp_score"].mean(), color="steelblue", lw=1, linestyle=":",
        label=f"Overall E[score] = {_df_calib['exp_score'].mean():.3f}"
    )
    _ax.set_xlabel("ELO diff (home − away)")
    _ax.set_ylabel("E[score] = P(home win) + 0.5·P(draw)")
    _ax.set_title("ELO calibration — empirical expected score vs theoretical")
    _ax.legend(fontsize=8)
    plt.tight_layout()
    plt.show()

    _mono_corr = float(_calib_grp["bucket_mid"].corr(_calib_grp["exp_score"]))
    from sklearn.metrics import mean_absolute_error as _mae
    _mae_no_ha = float(np.abs(_calib_grp["exp_score"] - _elo_curve_no_ha[
        [np.argmin(np.abs(_diff_range - m)) for m in _calib_grp["bucket_mid"]]
    ]).mean())
    _mae_with_ha = float(np.abs(_calib_grp["exp_score"] - _elo_curve_with_ha[
        [np.argmin(np.abs(_diff_range - m)) for m in _calib_grp["bucket_mid"]]
    ]).mean())
    display(Markdown(
        f"**Pearson r** (bucket midpoint vs empirical E[score]): `{_mono_corr:.3f}`  \n"
        f"**MAE** vs theoretical (HA=0): `{_mae_no_ha:.4f}`  \n"
        f"**MAE** vs theoretical (HA={_calib_ha:.0f}): `{_mae_with_ha:.4f}`  \n\n"
        f"Lower MAE for HA={_calib_ha:.0f} confirms the configured `home_advantage` parameter "
        "improves calibration fit."
    ))
else:
    display(Markdown("⚠️ ELO features or `outcome_1x2` not available — calibration chart skipped."))

(a) ELO calibration: empirical expected score (P(win) + 0.5·P(draw)) vs ELO differential

3.2 Empirical Home Advantage

Note

For matches where both teams are roughly equal (|ELO diff| < 50), the theoretical ELO expected score at diff=0 is exactly 1 / (1 + 10^(−HA/400)). The implied home advantage is derived by inverting this formula using the empirical expected score E[score] = P(home win) + 0.5 × P(draw) — not P(home win) alone, since ELO assigns half-credit to draws in its update rule.

Show code

if _elo_diff in _df_full.columns and "outcome_1x2" in _df_full.columns:
    _ha_val = float(_elo_params.get("home_advantage", 0.0))
    _neutral_df = _df_full.loc[_df_full[_elo_diff].abs() < 50, "outcome_1x2"].dropna()
    _outcome_labels = {0: "Home win", 1: "Draw", 2: "Away win"}
    _vc = _neutral_df.value_counts(normalize=True).reindex([0, 1, 2], fill_value=0)
    _p_hw   = float(_vc[0])
    _p_draw = float(_vc[1])
    _p_aw   = float(_vc[2])
    # ELO expected score = P(win) + 0.5*P(draw)  (matches the update rule in elo.py)
    _emp_exp_score = _p_hw + 0.5 * _p_draw
    _empirical = _vc.copy()
    _empirical.index = [_outcome_labels[i] for i in _empirical.index]

    _fig, _axes = plt.subplots(1, 2, figsize=(12, 5))

    _empirical.plot(kind="bar", ax=_axes[0],
                    color=["#2196F3", "#9E9E9E", "#F44336"], edgecolor="white")
    _axes[0].set_title(f"Outcomes for |ELO diff| < 50  (n={len(_neutral_df):,})")
    _axes[0].set_ylabel("Proportion")
    _axes[0].tick_params(axis="x", rotation=15)
    for _bar in _axes[0].patches:
        _axes[0].text(_bar.get_x() + _bar.get_width() / 2, _bar.get_height() + 0.005,
                      f"{_bar.get_height():.2%}", ha="center", fontsize=9)

    # Right panel: E[score] vs home_advantage parameter sweep
    _ha_range = np.linspace(0, 150, 200)
    _exp_theory = 1.0 / (1.0 + 10.0 ** (-_ha_range / 400.0))
    _axes[1].plot(_ha_range, _exp_theory, color="steelblue", lw=2,
                  label="Theoretical E[score] = 1/(1+10^(−HA/400))")
    _axes[1].axhline(_emp_exp_score, color="crimson", lw=1.5, linestyle="--",
                     label=f"Empirical E[score] = {_emp_exp_score:.3f}")
    _axes[1].axvline(_ha_val, color="orange", lw=1.5, linestyle="--",
                     label=f"Configured home_advantage = {_ha_val}")
    _axes[1].set_xlabel("home_advantage parameter")
    _axes[1].set_ylabel("E[score] = P(win) + 0.5·P(draw)")
    _axes[1].set_title("Empirical E[score] vs theoretical (|ELO diff| < 50)")
    _axes[1].legend(fontsize=8)
    plt.tight_layout()
    plt.show()

    # Correct implied HA: invert E[score] = 1/(1+10^(−HA/400))
    _implied_ha = -400.0 * np.log10(max(1e-9, 1.0 / _emp_exp_score - 1.0))
    _diff_pct = abs(_implied_ha - _ha_val) / max(1.0, abs(_ha_val)) * 100
    _status = "✅" if _diff_pct < 20 else "⚠️"
    display(Markdown(
        f"**Empirical E[score]** for |ELO diff| < 50: "
        f"`{_p_hw:.3f}` (win) + 0.5 × `{_p_draw:.3f}` (draw) = **`{_emp_exp_score:.3f}`**  \n"
        f"**Implied home_advantage** (inverted ELO formula): **`{_implied_ha:.1f}`** rating points  \n"
        f"**Configured home_advantage:** `{_ha_val}`  \n"
        f"{_status} Difference: `{abs(_implied_ha - _ha_val):.1f}` pts ({_diff_pct:.0f}%)  \n\n"
        "> **Note:** `P(home win) = 0.446` appears low because draws (~27%) are shared "
        "between home and away. The correct calibration quantity is "
        "`E[score] = P(win) + 0.5·P(draw) ≈ 0.58`, which implies an "
        f"effective home advantage of **{_implied_ha:.1f} rating points** — "
        f"consistent with the configured value of {_ha_val}."
    ))
else:
    display(Markdown("⚠️ ELO features or `outcome_1x2` not available — home advantage analysis skipped."))

(a) Outcome distribution for near-equal matches (|ELO diff| < 50) vs theoretical

4. Rolling Statistics

Show code

_windows = PARAMS["features"]["window_sizes"]
_stats_cols = PARAMS["features"]["stats_cols"]
_reference_stat = _stats_cols[0] if _stats_cols else "win"
display(Markdown(f"""
**Window sizes:** {_windows}
**Statistics computed:** {_stats_cols}
**Reference stat for window analysis:** `{_reference_stat}`
"""))

Window sizes: [1, 2, 3, 5, 10] Statistics computed: [‘win’, ‘draw’, ‘loss’, ‘goals_for’, ‘goals_against’] Reference stat for window analysis: win

Show code

_rolling_cols = [c for c in df_features.columns if any(f"_w{w}" in c for w in _windows)]
if _rolling_cols:
    _var_table = df_features[_rolling_cols].agg(["mean", "std", "min", "max"]).T
    _var_table = _var_table.sort_values("std", ascending=False).head(40)

    _fig, _ax = plt.subplots(figsize=(10, max(5, len(_var_table) * 0.3)))
    _ax.barh(_var_table.index, _var_table["std"], color="steelblue")
    _ax.set_xlabel("Standard deviation")
    _ax.set_title("Rolling feature variance (std) — top 40")
    _ax.invert_yaxis()
    plt.tight_layout()
    plt.show()

    display(
        _var_table.style
        .format("{:.4f}")
        .background_gradient(subset=["std"], cmap="Blues")
        .set_caption("Rolling feature statistics (top-40 by std)")
    )

(a) Feature variance — rolling stats columns sorted by std (top 40)

	mean	std	min	max
diff_h2h_goals_for_mean_w1	-0.1699	1.7781	-13.0000	13.0000
diff_h2h_goals_against_mean_w1	0.1699	1.7781	-13.0000	13.0000
diff_all_goals_for_mean_w1	-0.1667	1.7553	-13.0000	13.0000
diff_tournament_goals_for_mean_w1	-0.1910	1.7454	-13.0000	13.0000
diff_ha_goals_for_mean_w1	0.3231	1.7442	-13.0000	11.0000
diff_all_goals_against_mean_w1	0.1749	1.7424	-13.0000	13.0000
diff_season_goals_for_mean_w1	-0.2038	1.7372	-13.0000	13.0000
diff_tournament_goals_against_mean_w1	0.1941	1.7276	-13.0000	13.0000
diff_ha_goals_against_mean_w1	-0.3249	1.7246	-11.0000	13.0000
diff_season_goals_against_mean_w1	0.2043	1.7079	-13.0000	13.0000
diff_h2h_goals_against_mean_w2	0.1010	1.4148	-13.0000	13.0000
diff_h2h_goals_for_mean_w2	-0.1010	1.4148	-13.0000	13.0000
home_ha_goals_for_mean_w1	1.5050	1.3117	0.0000	11.0000
away_ha_goals_against_mean_w1	1.4937	1.3033	0.0000	11.0000
away_season_goals_for_mean_w1	1.4598	1.3020	0.0000	13.0000
away_tournament_goals_for_mean_w1	1.4400	1.2979	0.0000	13.0000
diff_season_goals_for_mean_w2	-0.0545	1.2946	-13.0000	11.5000
away_all_goals_for_mean_w1	1.4264	1.2915	0.0000	13.0000
diff_h2h_goals_against_mean_w3	0.0840	1.2873	-13.0000	13.0000
diff_h2h_goals_for_mean_w3	-0.0840	1.2873	-13.0000	13.0000
home_all_goals_against_mean_w1	1.4207	1.2852	0.0000	13.0000
home_tournament_goals_against_mean_w1	1.4241	1.2823	0.0000	13.0000
diff_ha_goals_for_mean_w2	0.3230	1.2743	-13.0000	11.0000
diff_tournament_goals_for_mean_w2	-0.0480	1.2694	-11.5000	11.5000
diff_all_goals_for_mean_w2	-0.0494	1.2687	-13.0000	11.5000
diff_all_goals_against_mean_w2	0.0525	1.2652	-12.0000	13.0000
diff_season_goals_against_mean_w2	0.0522	1.2648	-12.5000	12.5000
diff_ha_goals_against_mean_w2	-0.3249	1.2636	-11.0000	12.5000
home_season_goals_against_mean_w1	1.3965	1.2627	0.0000	13.0000
diff_tournament_goals_against_mean_w2	0.0494	1.2620	-12.0000	12.5000
home_h2h_goals_against_mean_w1	1.4071	1.2619	0.0000	13.0000
away_h2h_goals_for_mean_w1	1.4071	1.2619	0.0000	13.0000
home_all_goals_for_mean_w1	1.2598	1.2271	0.0000	13.0000
home_tournament_goals_for_mean_w1	1.2505	1.2194	0.0000	13.0000
home_season_goals_for_mean_w1	1.2571	1.2194	0.0000	13.0000
away_all_goals_against_mean_w1	1.2464	1.2076	0.0000	13.0000
away_tournament_goals_against_mean_w1	1.2325	1.2000	0.0000	13.0000
away_ha_goals_for_mean_w1	1.1825	1.1895	0.0000	13.0000
diff_h2h_goals_against_mean_w5	0.0708	1.1894	-13.0000	13.0000
diff_h2h_goals_for_mean_w5	-0.0708	1.1894	-13.0000	13.0000

Show code

if _rolling_cols:
    _corr_cols = _rolling_cols[:30]
    _corr = df_features[_corr_cols].corr()
    _fig, _ax = plt.subplots(figsize=(12, 11))
    sns.heatmap(_corr, cmap="coolwarm", center=0, linewidths=0.2, ax=_ax,
                annot=False, square=True, cbar_kws={"shrink": 0.8})
    _ax.set_title("Rolling features — Pearson correlation (first 30 features)")
    plt.tight_layout()
    plt.show()

Figure 8: Correlation heatmap of rolling stats features (first 30)

4.1 Window Redundancy Analysis

Note

If two window sizes produce near-identical features (Pearson r > 0.95), one adds little information beyond the other. This motivates feature selection and directly informs the final window set in params.yaml.

Show code

_win_ref_cols = {
    _w: f"home_all_{_reference_stat}_mean_w{_w}"
    for _w in _windows
    if f"home_all_{_reference_stat}_mean_w{_w}" in df_features.columns
}
if len(_win_ref_cols) > 1:
    _win_df = df_features[[c for c in _win_ref_cols.values()]].copy()
    _win_df.columns = [f"w{w}" for w in _win_ref_cols]
    _win_corr = _win_df.corr()

    _fig, _axes = plt.subplots(1, 2, figsize=(13, 5))
    sns.heatmap(
        _win_corr, annot=True, fmt=".2f", cmap="Blues",
        vmin=0, vmax=1, linewidths=0.5, ax=_axes[0],
        cbar_kws={"label": "Pearson r"},
    )
    _axes[0].set_title(f"Cross-window correlation — home_all_{_reference_stat}")

    _w_min = min(_win_ref_cols)
    _w_max = max(_win_ref_cols)
    _axes[1].scatter(_win_df[f"w{_w_min}"], _win_df[f"w{_w_max}"],
                     alpha=0.15, s=6, color="steelblue")
    _r = float(_win_corr.loc[f"w{_w_min}", f"w{_w_max}"])
    _axes[1].set_xlabel(f"w{_w_min}")
    _axes[1].set_ylabel(f"w{_w_max}")
    _axes[1].set_title(f"w{_w_min} vs w{_w_max}  (r = {_r:.3f})")
    plt.tight_layout()
    plt.show()

    _high_corr = [
        (f"w{r}", f"w{c}", float(_win_corr.loc[f"w{r}", f"w{c}"]))
        for r in _win_ref_cols for c in _win_ref_cols
        if r < c and _win_corr.loc[f"w{r}", f"w{c}"] > 0.95
    ]
    if _high_corr:
        display(Markdown(
            "⚠️ **Highly correlated window pairs (r > 0.95):** "
            + "; ".join(f"{a}↔{b} (r={rv:.3f})" for a, b, rv in _high_corr)
            + " — consider removing the redundant window from training."
        ))
    else:
        display(Markdown("✅ No window pairs exceed r = 0.95 — all window sizes add distinct signal."))
else:
    display(Markdown("⚠️ Not enough window columns found for redundancy analysis."))

(a) Cross-window Pearson correlation for the reference stat (home_all)

4.2 Temporal Drift

Note

Structural changes in football (tactical trends, rule changes, squad inflation) can cause rolling statistics to drift over time. A stable distribution validates the temporal split (test_start: 2024-01-01); pronounced drift motivates ongoing monitoring in production.

Show code

if "startTimeUtc" in _df_full.columns:
    _df_drift = _df_full.copy()
    _df_drift["year"] = pd.to_datetime(_df_drift["startTimeUtc"]).dt.year
    _drift_candidates = [
        f"home_all_{_reference_stat}_mean_w{max(_windows)}",
        f"home_all_goals_for_mean_w{max(_windows)}",
        f"diff_all_{_reference_stat}_mean_w{max(_windows)}",
    ]
    _drift_cols = [c for c in _drift_candidates if c in _df_drift.columns]
    if _drift_cols:
        _drift_stats = _df_drift.groupby("year")[_drift_cols].agg(["mean", "std"])

        _n_cols = len(_drift_cols)
        _fig, _axes = plt.subplots(1, _n_cols, figsize=(5 * _n_cols, 5))
        if _n_cols == 1:
            _axes = [_axes]
        for _i, _col in enumerate(_drift_cols):
            _means = _drift_stats[(_col, "mean")]
            _stds = _drift_stats[(_col, "std")]
            _axes[_i].plot(_means.index, _means.values, marker="o", color="steelblue", lw=2)
            _axes[_i].fill_between(
                _means.index,
                _means.values - _stds.values,
                _means.values + _stds.values,
                alpha=0.2, color="steelblue", label="±1 std",
            )
            _axes[_i].axvline(
                pd.to_datetime(PARAMS["temporal"]["test_start"]).year - 0.5,
                color="red", lw=1.5, linestyle="--", label="test split"
            )
            _axes[_i].set_title(_col, fontsize=9)
            _axes[_i].set_xlabel("Year")
            _axes[_i].set_ylabel("Mean value")
            _axes[_i].legend(fontsize=7)
        plt.suptitle(f"Temporal drift — window w{max(_windows)} features", fontsize=11)
        plt.tight_layout()
        plt.show()
    else:
        display(Markdown("⚠️ Target drift columns not found — check stats_cols and window_sizes in params.yaml."))
else:
    display(Markdown("⚠️ `startTimeUtc` not available in joined dataset — temporal drift analysis skipped."))

Figure 10: Year-over-year distribution of key rolling features (mean ± 1 std)

5. Predictive Signal Preview

Note

This section estimates which features carry the most signal for predicting match outcome using Mutual Information and Spearman rank correlation against P(home win). Analysis is restricted to the training split (data/splits/train_ids.parquet) to avoid any information bleed from the held-out test set.

Show code

from sklearn.feature_selection import mutual_info_classif
from scipy.stats import spearmanr as _spearmanr

_train_ids_path = project_root / "data" / "splits" / "train_ids.parquet"
_signal_available = _train_ids_path.exists() and "outcome_1x2" in _df_full.columns

if _signal_available:
    _train_ids = pd.read_parquet(_train_ids_path)
    _id_col = "id" if "id" in _train_ids.columns else _train_ids.columns[0]
    _train_id_set = set(_train_ids[_id_col].values)
    _df_train = _df_full[_df_full.index.isin(_train_id_set)].copy()
    _df_train["home_win"] = (_df_train["outcome_1x2"] == 0).astype(int)

    _signal_cols = [c for c in df_features.columns if c.startswith("diff_")]
    _signal_cols = [c for c in _signal_cols if c in _df_train.columns]
    _df_signal = _df_train[_signal_cols + ["home_win"]].dropna()

    if len(_df_signal) > 100 and _signal_cols:
        _X = _df_signal[_signal_cols].values
        _y = _df_signal["home_win"].values
        _mi = mutual_info_classif(_X, _y, discrete_features=False, random_state=42)
        _spearman_r = np.array([
            abs(float(_spearmanr(_df_signal[c], _y).statistic))
            for c in _signal_cols
        ])
        _sig_df = (
            pd.DataFrame({"feature": _signal_cols, "mutual_info": _mi, "spearman_abs": _spearman_r})
            .sort_values("mutual_info", ascending=False)
            .head(20)
        )

        _fig, _axes = plt.subplots(1, 2, figsize=(14, 7))
        for _i, (_metric, _title, _color) in enumerate([
            ("mutual_info", "Mutual Information (nats)", "#2196F3"),
            ("spearman_abs", "|Spearman ρ| with home win", "#4CAF50"),
        ]):
            _plot_df = _sig_df.sort_values(_metric, ascending=True).tail(20)
            _axes[_i].barh(_plot_df["feature"], _plot_df[_metric], color=_color, alpha=0.85)
            _axes[_i].set_xlabel(_title)
            _axes[_i].set_title(f"Top 20 features — {_title}")
            _axes[_i].tick_params(axis="y", labelsize=8)
        plt.tight_layout()
        plt.show()

        _group_mi = {}
        for _sc in ["elo", "h2h", "ha", "tournament", "season", "all"]:
            _in_scope = [c for c in _sig_df["feature"]
                         if ("elo" in c and _sc == "elo") or f"_{_sc}_" in c]
            if _in_scope:
                _group_mi[_sc] = float(_sig_df[_sig_df["feature"].isin(_in_scope)]["mutual_info"].mean())
        if _group_mi:
            _group_mi_s = pd.Series(_group_mi).sort_values(ascending=False)
            display(Markdown("**Mean MI by feature group (top-20 features, train split):**"))
            display(_group_mi_s.to_frame("mean_mi").style.format("{:.4f}").bar(color="#2196F3"))
    else:
        display(Markdown("⚠️ Not enough training samples after dropna — signal analysis skipped."))
else:
    display(Markdown(
        "⚠️ Training split or `outcome_1x2` not available — "
        "signal analysis skipped. Run `dvc repro split_data` first."
    ))

(a) Top features by Mutual Information and |Spearman ρ| with home win (train split)

	mean_mi
elo	0.0401
h2h	0.0364
tournament	0.0236
ha	0.0231
all	0.0220
season	0.0214

6. Conclusions

Show code

_n_total = len(df_features.columns)
_elo_included = PARAMS["features"]["elo"].get("include", False)
_n_elo = (
    int(_group_summary.loc["elo", "n_features"]) if "elo" in _group_summary.index
    else sum(1 for c in df_features.columns if "elo" in c)
)
_n_h2h = (
    int(_group_summary.loc["h2h", "n_features"]) if "h2h" in _group_summary.index
    else sum(1 for c in df_features.columns if "_h2h_" in c)
)
_n_rest = (
    int(_group_summary.loc["rest", "n_features"]) if "rest" in _group_summary.index
    else sum(1 for c in df_features.columns if "rest" in c)
)
_rolling_scopes = {"all", "season", "tournament", "ha"}
_n_rolling = int(_group_summary[_group_summary.index.isin(_rolling_scopes)]["n_features"].sum())
_n_null_features = int((df_features.isnull().mean() > 0).sum())
_n_high_null = int((df_features.isnull().mean() > 0.05).sum())

_lines = [
    f"1. **{_n_total} features** total: {_n_elo} ELO-derived, {_n_rolling} rolling-stat "
    f"(all/season/tournament/ha scopes), {_n_h2h} H2H, {_n_rest} rest-days.",
    f"2. **ELO ratings {'included' if _elo_included else 'excluded'}** "
    f"(k={PARAMS['features']['elo'].get('k_factor', '?')}, "
    f"home_advantage={PARAMS['features']['elo'].get('home_advantage', '?')}). "
    f"Ratings are scoped per `{PARAMS['features']['elo'].get('group_col', '?')}` to prevent "
    f"cross-tournament leakage. ELO resets to `initial_rating` when a team enters a new tournament.",
    f"3. **Rolling windows used**: {PARAMS['features']['window_sizes']} — shorter windows capture recent form; "
    f"longer windows capture structural team quality.",
    f"4. **Feature completeness**: {_n_null_features} features have any nulls; {_n_high_null} exceed the 5% threshold "
    f"{'(requires review before training)' if _n_high_null > 0 else '(all within acceptable range)'}. "
    f"H2H features are expected to have high null rates (many team pairs never meet).",
    "5. **No future leakage**: all features use only pre-match data. "
    "The `batch_inference` pipeline applies identical feature computation to upcoming matches.",
]
print("\n".join(_lines))

456 features total: 3 ELO-derived, 360 rolling-stat (all/season/tournament/ha scopes), 90 H2H, 3 rest-days.
ELO ratings included (k=32.0, home_advantage=50.0). Ratings are scoped per tournamentId to prevent cross-tournament leakage. ELO resets to initial_rating when a team enters a new tournament.
Rolling windows used: [1, 2, 3, 5, 10] — shorter windows capture recent form; longer windows capture structural team quality.
Feature completeness: 378 features have any nulls; 150 exceed the 5% threshold (requires review before training). H2H features are expected to have high null rates (many team pairs never meet).
No future leakage: all features use only pre-match data. The batch_inference pipeline applies identical feature computation to upcoming matches.

This report is generated from DVC-versioned artifacts and re-renders automatically after dvc repro feature_engineering.