Reproducibility note. The figures and statistics in this report are tied to a specific snapshot of the raw data. The Git commit above identifies the pipeline code; the DVC MD5 above is the content-addressable hash of match_raw.parquet as recorded in dvc.lock — it is stable across file copies and touch operations. To reproduce this report exactly, run dvc repro from the same commit before rendering.
status=6 (Finished) rows form the training dataset; status=1 (Upcoming) rows after the last finished match form the inference-ready future split. Together these two groups account for the vast majority of the dataset. All other statuses — Postponed, Cancelled, In progress, Unknown — are discarded at the preprocessing stage and do not enter the model pipeline.
Seasonal gap. June–July show a consistent drop in match count across all years, reflecting the European summer break. For rolling-window features (e.g. 5-match form) computed on historical data, matches immediately after the break are computed over a stale window spanning the previous season. This is expected and acceptable: the feature engineering stage does not reset rolling accumulators at season boundaries, so the last pre-break games remain the most recent context available at inference time.
Figure 4: Goal count distributions — raw finished matches
Matches with in-match statistics
A subset of finished matches has rich live match center (MC) data collected during the game: per-minute team stats (shots, possession, passes, aerials, tackles, dribbles), referees, lineups, venue, and score timeline. This data is stored across dedicated tables — matches_live_header, matches_live_info, matches_live_{home,away}_stats, etc.
Show code
_df_mc = df_match_raw[ (df_match_raw["status"] ==6)& (df_match_raw["period"] !=0)& (df_match_raw["hasPreview"])]_total =len(df_match_raw[df_match_raw["status"] ==6])display(Markdown(f"**{len(_df_mc):,}** of **{_total:,}** finished matches "f"({len(_df_mc)/_total*100:.1f}%) have in-match statistics available."))del _df_mc
Table 4
65,391 of 966,140 finished matches (6.8%) have in-match statistics available.
Show code
_df_finished_raw = df_match_raw[df_match_raw["status"] ==6].copy()_df_finished_raw["_year"] = pd.to_datetime(_df_finished_raw["startTimeUtc"]).dt.year_df_finished_raw["_has_mc"] = ( (_df_finished_raw["period"] !=0) & (_df_finished_raw["hasPreview"]))_mc_map = {True: "With MC stats", False: "Without MC stats"}_by_year_mc = ( _df_finished_raw.groupby(["_year", "_has_mc"]) .size() .reset_index(name="count"))_fig, _ax = plt.subplots(figsize=(9, 5))for _has_mc, _grp in _by_year_mc.groupby("_has_mc"): _ax.bar( _grp["_year"].astype(str), _grp["count"], label=_mc_map[_has_mc], alpha=0.85, color="#4CAF50"if _has_mc else"#90CAF9", )_ax.set_xlabel("Year")_ax.set_ylabel("# matches")_ax.set_title("Finished matches per year — with vs without in-match statistics")_ax.tick_params(axis="x", rotation=45)_ax.legend(title="MC stats")plt.tight_layout()plt.show()del _df_finished_raw
Figure 5: Finished matches per year — with vs without in-match statistics
Note
MC data availability. In-match statistics are available only from a certain year onward, reflecting the point at which live data collection was integrated into the pipeline. Matches before that cutoff have hasPreview = FALSE or period = 0 and are treated as without MC. This structural gap is visible in the chart above as a step-change in the “With MC stats” series. Any future feature set built on MC data will be constrained to the post-cutoff subset, reducing training set size relative to the full finished split.
Enable the “Only with MC stats” checkbox in the filters panel.
Click any match link — this opens the match detail page.
Select the “Match Centre” tab to view live stats: timeline, possession, shots, passes, and more.
Important
Out of scope for v1. In-match statistics are not used as features in the current pipeline. All models rely solely on pre-match information to avoid data leakage. Live stats are a natural candidate for future feature iterations (e.g. half-time retraining, live odds adjustment).
2. Data Preparation Pipeline
This section documents the full data preparation phase: from raw ingestion and exploratory profiling through preprocessing and artifact output. The implementation lives in src/data/preprocess.py and is orchestrated as a DVC stage (preprocess).
Implementation — preprocess_and_split()
The following steps are applied inside preprocess_and_split() in the order they execute:
Drop irrelevant columns — removes display/UI fields (stageName, regionName, isOpta, sort-order keys, etc.) and all live/post-match columns (elapsed, scoreChangedAt, period, incidents, etc.) — approximately 40 columns in total.
Downcast ID types — casts identifier columns to compact integer types (tournamentId, regionId, stageId, seasonId → int16; id, homeTeamId, awayTeamId → int32; sex → int8) to reduce memory footprint.
Parse and sort by time — converts startTimeUtc to UTC-aware datetime, then sorts all rows ascending by kickoff time.
Split by status — selects status=6 rows as finished and status=1 rows whose kickoff is strictly after the last finished match as future. All other statuses (postponed, cancelled, in-progress, unknown) are discarded. The status column is dropped after partitioning.
Downcast scores — casts homeScore and awayScore to int8 on the finished split after null removal.
Compute classification target — derives outcome_1x2 (0 = home win, 1 = draw, 2 = away win) from raw scores before clipping, so the label is unaffected by the clip operation.
Clip outlier scores — for each of homeScore / awayScore independently, computes the params.yaml → preprocessing.score_outlier_pct quantile (default 0.9999 ≈ 99.99-th percentile) on the finished split and clips values above that threshold. Percentile is preferred over IQR (whose fence falls inside the normal-score range for Poisson-concentrated data) and Z-score (assumes normality). At 0.9999 the threshold is ~10–12 goals, clipping only technical results/data errors (~0.01% of matches) while keeping all legitimate high-scoring games. Matches are kept (not dropped) so team history remains continuous for rolling stats and ELO.
Compute regression targets — derives sumScore = homeScore + awayScore and diffScore = homeScore − awayScore (after clipping, so derived targets are consistent with the clipped scores).
Drop intermediate columns — removes the temporary binary flags used during target derivation (homeWin, awayWin, draw) and disciplinary columns (homeYellowCards, awayYellowCards, homeRedCards, awayRedCards) that are excluded from the v1 feature set.
Drop score columns from future — removes homeScore, awayScore, and all extra-time / penalty score fields from the future split to prevent any target leakage.
Note
All steps use only pre-match information or post-match results that are applied to the finished split only. The future split never receives target or score columns.
Raw data — columns selected for preprocessing
The table below shows only the columns retained after dropping display/UI and live-match fields. homeScore and awayScore are null for future matches and will be separated in the next step.
Table 7: Column types, null counts and cardinality
dtype
null_count
null_%
unique
min
max
startTimeUtc
datetime64[ns, UTC]
0
0.00%
574
2026-04-16 12:00:00+00:00
2026-05-01 21:30:00+00:00
id
int32
0
0.00%
3186
1901243
1978419
sex
int8
0
0.00%
2
1
2
regionId
int16
0
0.00%
90
3
265
tournamentId
int16
0
0.00%
195
1
783
seasonId
int16
0
0.00%
195
10720
11075
stageId
int16
0
0.00%
261
24478
25346
homeTeamId
int32
0
0.00%
2515
1
32635
awayTeamId
int32
0
0.00%
2510
1
32633
Warning
Cold-start risk. Teams that appear in the future split but have no history in the finished split will receive zero rolling statistics and the initial ELO rating (params.yaml → features.elo.initial_rating). This is expected behaviour and is handled at the feature engineering stage with explicit fallback defaults.
Show code
_known_home =set(df_finished["homeTeamId"]) |set(df_finished["awayTeamId"])_future_home =set(df_future["homeTeamId"])_future_away =set(df_future["awayTeamId"])_cold_home = _future_home - _known_home_cold_away = _future_away - _known_home_cold_any = (_future_home | _future_away) - _known_homedisplay(Markdown(f"**Cold-start teams (no history in finished split):** "f"{len(_cold_home)} appearing as home, "f"{len(_cold_away)} appearing as away "f"({len(_cold_any)} unique teams total)"))del _known_home, _future_home, _future_away, _cold_home, _cold_away, _cold_any
Table 8
Cold-start teams (no history in finished split): 9 appearing as home, 2 appearing as away (11 unique teams total)
3. Dataset definition and targets
Matches — Finished vs Future (column availability + target/label roles)
Column
Finished
Future
Role
startTimeUtc
✓
✓
Feature (time)
id
✓
✓
Feature (match identity)
sex
✓
✓
Feature (competition)
regionId
✓
✓
Feature (geography)
tournamentId
✓
✓
Feature (competition)
seasonId
✓
✓
Feature (season)
stageId
✓
✓
Feature (competition)
homeTeamId
✓
✓
Feature (entity)
awayTeamId
✓
✓
Feature (entity)
outcome_1x2
✓
—
Classification target (3-class label)
homeScore
✓
—
Regression target
awayScore
✓
—
Regression target
sumScore
✓
—
Regression target (derived)
diffScore
✓
—
Regression target (derived)
Important
v1 scope: This report focuses on the classification target (outcome_1x2: Home win / Draw / Away win). Regression targets (homeScore, awayScore, sumScore, diffScore) are out of scope for v1 and will be analysed in a future iteration.
Class imbalance. Home win is consistently the most frequent outcome; draws are the minority class. The imbalance is moderate (roughly 1.7–2× between majority and minority) and does not require resampling — tree-based models handle it natively. Probability calibration is applied at the final training stage to correct systematic bias in predicted probabilities (see params.yaml → final_train.calibration).
Temporal stability of outcome_1x2
A drift-check: if the home-win rate shifts substantially over time, the temporal split strategy must account for it and calibration must be re-validated on recent data.
Figure 7: outcome_1x2 proportions per year — temporal stability
Note
Distribution is stable. Home-win, draw, and away-win rates remain within a narrow band across all years, with no sustained structural shift. This confirms that a single-split temporal train/test strategy is appropriate — the label distribution the model trains on is representative of the distribution it will be evaluated and deployed against. Any year-to-year variation visible in the chart is within expected sampling noise for the per-year match counts. Calibration should nonetheless be validated on the most recent data slice, as even small shifts compound in probability output.
Population heterogeneity — by tournament
The global class proportions above are an aggregate that masks substantial variation across competitions. This plot examines whether outcome distributions are homogeneous across the most-played tournaments, directly motivating params.yaml → classification.groupby_cols: ["regionId", "sex"] as stratification axes for prior estimation and probability calibration.
Show code
_TOP_N =20_OUTCOME_LABELS = {0: "Home win", 1: "Draw", 2: "Away win"}_OUTCOME_COLORS = ["#2196F3", "#FF9800", "#4CAF50"]# Recover tournament names from raw (dropped during preprocessing)_meta = df_match_raw[["id", "tournamentName"]].drop_duplicates("id").set_index("id")_df_h = df_finished.join(_meta, how="left")_top_t = _df_h["tournamentName"].value_counts().head(_TOP_N).index_df_top = _df_h[_df_h["tournamentName"].isin(_top_t)].copy()_rates_t = ( _df_top.groupby(["tournamentName", "outcome_1x2"]) .size() .unstack(fill_value=0) .div(_df_top.groupby("tournamentName").size(), axis=0) .sort_values(0, ascending=False) # sort by home-win rate descending)_rates_t.columns = [_OUTCOME_LABELS[c] for c in _rates_t.columns]_fig, _ax = plt.subplots(figsize=(9, max(5, len(_rates_t) *0.42)))_rates_t.plot(kind="barh", stacked=True, color=_OUTCOME_COLORS, ax=_ax, width=0.75)_ax.invert_yaxis()_ax.set_xlabel("Proportion")_ax.set_title(f"outcome_1x2 distribution — top {_TOP_N} tournaments")_ax.legend(title="Outcome", bbox_to_anchor=(1.01, 1), loc="upper left")for _bar_container, _col inzip(_ax.containers, _rates_t.columns): _ax.bar_label(_bar_container, fmt="%.2f", label_type="center", fontsize=7, color="white")plt.tight_layout()plt.show()del _meta, _df_h, _df_top, _rates_t
Figure 8: outcome_1x2 proportions — top-20 tournaments by match count
Note
Observed heterogeneity. Home-win rates vary across tournaments (typically 38–52%), confirming that a single global prior is insufficient. regionId and sex are used as stratification axes because tournament-level granularity would introduce sparse strata for lower-division leagues. This is the direct justification for groupby_cols: ["regionId", "sex"] in calibration.
Regression targets
Four regression targets are derived from the final, clipped scores. homeScore and awayScore are the primary outputs; sumScore (total goals) and diffScore (home minus away) are derived aggregates that encode different aspects of the result and may be predicted independently or used as auxiliary signals.
Show code
_cols_and_colors = [("homeScore", "#2196F3"), ("awayScore", "#4CAF50"), ("sumScore", "#FF9800"), ("diffScore", "#9C27B0")]_fig, _axes = plt.subplots(4, 1, figsize=(9, 16), constrained_layout=True)_fig.suptitle("Regression target distributions", fontsize=14)for _ax, (_col, _color) inzip(_axes, _cols_and_colors):if _col notin df_finished.columns: _ax.set_visible(False)continue _props_r = df_finished[_col].value_counts(normalize=True).sort_index()# Use numeric x-axis to preserve correct order for columns with negative values (diffScore) _x_vals = _props_r.index.tolist() _ax.bar(range(len(_x_vals)), _props_r.values, color=_color, width=0.9) _ax.set_xticks(range(len(_x_vals))) _ax.set_xticklabels([str(v) for v in _x_vals], fontsize=8) _ax.set_ylabel("Proportion") _ax.set_xlabel(_col) _ax.bar_label(_ax.containers[0], fmt="%.3f", fontsize=9, padding=2)
Figure 9: Score distributions — finished matches
Note
Distribution shapes.homeScore and awayScore are right-skewed and concentrated at 0–3 goals, consistent with a Poisson-like process — motivating Poisson regression or count-based models for v2. sumScore inherits the same right skew. diffScore is the only target that takes negative values (away win margin), making it incompatible with Poisson regression without transformation; a Skellam distribution (difference of two Poissons) would be the natural parametric choice. These characteristics are out of scope for v1 and are documented here for the v2 feature iteration.
4. Data Quality Checks
Data quality is validated automatically by three DVC stages using Great Expectations:
DVC stage
Input
Suite
Report
validate_raw
data/raw/match_raw.parquet
raw_match_suite
data/evaluation/ge_raw.json
validate_finished
data/interim/finished.parquet
finished_suite
data/evaluation/ge_finished.json
validate_future
data/interim/future.parquet
future_match_suite
data/evaluation/ge_future.json
Each stage exits with code 1 on any expectation failure — the DVC pipeline will not proceed further. validate_future also acts as an anti-leakage gate: it asserts (via exact_match=True) that score and target columns are absent from the future split.
validate_raw — raw data
Checks: required columns present (exact_match=False); row count ≥ 1; id unique; id, homeTeamId, awayTeamId, startTimeUtc, status non-null; startTimeUtc in range 1998-01-01 → 2026-12-31.
Checks: exactly the 9 identity columns and nothing else (exact_match=True — anti-leakage gate verifying score and target columns are absent); row count ≥ 1; all columns non-null; id unique.
The table below consolidates all phases covered in this report — from raw data loading to quality-gated artifact output — so the scope of each step and its verification gate are visible in one place.
Phase
Scope
Artifact / Quality gate
Data loading
match_raw.parquet ingested from MinIO
df_match_raw
EDA
Status distribution, regions/tournaments, seasonality, gender, score distributions, MC coverage
Sections 1–2 of this report
Missing values
Null share visualised per column; mandatory key columns enforced
GE ExpectColumnValuesToNotBeNull (validate_raw)
Duplicates
id uniqueness enforced at ingestion and after preprocessing
GE ExpectColumnValuesToBeUnique (validate_raw, validate_finished, validate_future)
Column selection
UI/display and live-match columns dropped (≈40); see Implementation above
finished.parquet, future.parquet
Type downcasting
IDs → int16 / int32; scores → int8
Memory-efficient parquet schema
Temporal sort & split
Sort ascending by UTC kickoff; partition status=6 (finished) vs status=1 after last finished (future)
finished.parquet, future.parquet
Outlier clipping
Scores clipped at score_outlier_pct quantile (~0.01% of matches; forfeits and data errors only); threshold visualised on score charts above
GE ExpectColumnValuesToBeBetween (validate_finished)