Skip to content

Feature Engineering & Offline/Online Parity

Purpose

Document the implemented feature families, how leakage is prevented at the feature level, how feature logic is shared between offline training and online serving, and what feature types are excluded and why.


Design constraints

Every feature in this system must satisfy all of the following before it can be used in training:

  1. Pre-match only — computable from data available before kick-off.
  2. Point-in-time correct — no information from the current match (or any future match) enters the feature.
  3. Deterministic — same input data + same params.yaml → same feature values.
  4. Parity-safe — the same logic path is used at training time and at inference time.

A feature that violates any of these constraints is excluded. This is not a best-practice recommendation — it is a hard requirement enforced by tests.


Implemented feature families

1. Rolling match statistics

Source: src/features/stats_matches.pybuild_team_match_table + add_rolling_features

Each match produces per-team rolling aggregates of recent performance. The pipeline first reshapes match-level data into a long team-match table, then computes rolling windows per team.

Metrics aggregated:

Metric Meaning
win Win flag (1/0)
draw Draw flag (1/0)
loss Loss flag (1/0)
goals_for Goals scored
goals_against Goals conceded

Window sizes (configurable via params.yaml → features.window_sizes): [1, 2, 3, 5, 10]

For each metric and window, three columns are produced at match level: home_{metric}_w{window}, away_{metric}_w{window}, diff_{metric}_w{window}

A coverage column ({prefix}_coverage_w{window}) tracks how many matches contributed to each rolling value — important for teams with short history.

Leakage guard: shift(1) is applied to the team-match series before the rolling window. Match N's feature uses only matches N−1, N−2, … — never match N itself.

The model is trained on diff features by default (classification.side = "diff"), giving the difference between home and away rolling statistics.


2. ELO ratings

Source: src/features/elo.pycompute_elo_ratings

ELO ratings capture relative team strength within a tournament. Ratings are: - Computed per tournamentId — each competition maintains independent state. - Updated after each match; the value attached to a row is the pre-match rating. - Scoped to the team's history in that tournament only (no cross-competition bleed).

Columns produced per match:

Column Meaning
home_elo_pre Home team ELO before this match
away_elo_pre Away team ELO before this match
diff_elo_pre home_elo_pre − away_elo_pre

ELO configuration (from params.yaml → features.elo): - k_factor: 32.0 — update step size - initial_rating: 1500.0 — rating for teams with no prior history - home_advantage: 50.0 — additive bonus in expected-score calculation

Teams with no history in a tournament receive the initial_rating. The home advantage factor is applied in the expected-score formula, not as a feature column.


3. Rest days

Source: src/features/stats_matches.pyadd_rest_days

Days elapsed since each team's previous match, computed from startTimeUtc. Captures fixture congestion and recovery time.

Columns produced: home_rest_days, away_rest_days, diff_rest_days


4. Head-to-head (H2H) statistics

Source: src/features/stats_matches.pyadd_h2h_features

Rolling historical statistics between the two specific teams playing each other, regardless of venue. Uses the same shift(1) + rolling approach as general stats.

Columns are prefixed h2h_. The ablation study (full_no_h2h vs full) quantifies the contribution of H2H features.


5. Categorical context

Source: params.yaml → classification.cat_cols

Currently: sex (men's vs. women's competition). Passed as a categorical column to the model.


Feature selection: diff side

The model is trained on the differential features (home − away) for all stats and ELO. This reduces dimensionality while preserving the relative strength signal that drives outcomes. The side parameter (classification.side: "diff") controls this selection.


Excluded feature types

Feature type Reason for exclusion
In-match events (goals scored, cards) Post-kickoff data; strict pre-match cutoff violated
Player-level stats Not available in current data source; planned future improvement
Bookmaker odds as input features Clean separation of prediction from market data
Weather / pitch conditions Not available in current source
Live standings / table position at match time Requires careful point-in-time join; not yet implemented safely

Offline/online parity

At training time: features are computed by the DVC feature_engineering stage and stored in data/features/features.parquet. These are the features the model is trained on.

At inference time (batch): the DVC batch_inference stage runs the same feature code (src/features/stats_matches.py, src/features/elo.py) on upcoming matches to produce data/predictions/match_features.parquet. The serving layer reads from this artifact.

Parity is maintained because: - The same source modules are used in both paths. - No ad-hoc transformations are applied at inference. - Feature column names and dtypes are recorded in features_meta.parquet and validated.

If parity cannot be guaranteed for a feature, that feature is excluded from the model. This is the governing rule for all serving-path decisions.


Feature metadata

data/features/features_meta.parquet records each feature's name, type (numeric/categorical), and origin family. The training pipeline reads features_meta.parquet to determine X_cols, num_cols, and cat_cols — no hardcoded column lists in model code.


Implementation status

Feature family Status
Rolling stats (win/draw/loss/goals) ✅ Implemented
ELO ratings per tournament ✅ Implemented
Rest days ✅ Implemented
H2H rolling statistics ✅ Implemented
Categorical context (sex) ✅ Implemented
Feature metadata contract ✅ Implemented
Player-level features 📋 Planned
Live standings join 📋 Planned