EDA & Preprocessing

Football 1X2 — raw data exploration, preprocessing and target analysis

Author

Dima Ivanov

Published

May 11, 2026

Show code

import sys
from pathlib import Path

project_root = Path().resolve().parent.parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from IPython.display import display, Markdown

from src.app.config import settings

import yaml
with open(project_root / "params.yaml") as _f:
    PARAMS = yaml.safe_load(_f)

Show code

df_match_raw = pd.read_parquet(settings.data_raw_path / "match_raw.parquet")
df_finished = pd.read_parquet(settings.data_interim_path / "finished.parquet")
df_future = pd.read_parquet(settings.data_interim_path / "future.parquet")

Show code

import subprocess
import yaml as _yaml
_git_hash = subprocess.check_output(
    ["git", "rev-parse", "--short", "HEAD"], cwd=project_root
).decode().strip()

# Read DVC MD5 from dvc.lock — content-addressable hash stable across
# cp / rsync / touch operations, unlike mtime.
_dvc_md5 = "—"
_dvc_lock_path = project_root / "dvc.lock"
if _dvc_lock_path.exists():
    _dvc_lock = _yaml.safe_load(_dvc_lock_path.read_text())
    for _stage in _dvc_lock.get("stages", {}).values():
        for _out in _stage.get("outs", []):
            if _out.get("path") == "data/raw/match_raw.parquet":
                _dvc_md5 = _out.get("md5", "—")
                break

_future_start = pd.to_datetime(df_future["startTimeUtc"]).min().date()
_future_end = pd.to_datetime(df_future["startTimeUtc"]).max().date()
display(Markdown(
    f"- **Git commit:** `{_git_hash}`  \n"
    f"- **match_raw.parquet DVC MD5:** `{_dvc_md5}`  \n"
    f"- **finished.parquet rows:** {len(df_finished):,}  \n"
    f"- **future.parquet rows:** {len(df_future):,} (`{_future_start}` → `{_future_end}`)"
))
del _future_start, _future_end, _dvc_md5, _dvc_lock_path

Git commit: a4f939a
match_raw.parquet DVC MD5: 60735b9fb3553c811941dd441c8aceb9
finished.parquet rows: 966,140
future.parquet rows: 3,186 (2026-04-16 → 2026-05-01)

Caution

Reproducibility note. The figures and statistics in this report are tied to a specific snapshot of the raw data. The Git commit above identifies the pipeline code; the DVC MD5 above is the content-addressable hash of match_raw.parquet as recorded in dvc.lock — it is stable across file copies and touch operations. To reproduce this report exactly, run dvc repro from the same commit before rendering.

1. Raw Data Overview

Show code

display(Markdown(
f"""* **Shape:** {df_match_raw.shape[0]:,} rows × {df_match_raw.shape[1]} columns
* **Unique matches:** {df_match_raw['id'].nunique():,}
* **Date range:** {df_match_raw['startTimeUtc'].min().date()} → {df_match_raw['startTimeUtc'].max().date()}
"""))

Table 1

Shape: 988,189 rows × 57 columns
Unique matches: 988,189
Date range: 1998-06-30 → 2026-05-01

Show code

_null_counts = df_match_raw.isnull().sum()
_dtype_table = pd.DataFrame({
    "dtype": df_match_raw.dtypes.astype(str),
    "null_count": _null_counts,
    "null_%": (_null_counts / len(df_match_raw) * 100).round(2),
    "unique": df_match_raw.nunique(),
}).sort_values("null_%", ascending=False)

display(
    _dtype_table.style
    .background_gradient(subset=["null_%"], cmap="Reds")
    .format({"null_%": "{:.2f}%"})
    .set_caption("Column types, null counts and cardinality")
)

del _null_counts
del _dtype_table

Table 2: Column types, null counts and cardinality

	dtype	null_count	null_%	unique
matchArgs	object	988189	100.00%	0
matchHeader	object	988189	100.00%	0
aggregateWinnerField	float64	987346	99.91%	2
homePenaltyScore	float64	980482	99.22%	18
awayPenaltyScore	float64	980483	99.22%	19
awayExtratimeScore	float64	978953	99.07%	12
homeExtratimeScore	float64	978915	99.06%	11
extraResultField	float64	975099	98.68%	2
winnerField	float64	258588	26.17%	2
lastScorer	float64	257016	26.01%	2
scoreChangedAt	datetime64[ns]	253444	25.65%	716160
firstHalfEndedAtUtc	datetime64[ns]	88797	8.99%	804531
secondHalfStartedAtUtc	datetime64[ns]	37560	3.80%	721667
startedAtUtc	datetime64[ns]	31961	3.23%	702055
tournamentSortOrder	float64	26561	2.69%	58
awayScore	float64	20640	2.09%	28
homeScore	float64	20640	2.09%	23
regionCode	object	880	0.09%	101
tournamentName	object	728	0.07%	313
awayTeamCountryCode	object	449	0.05%	213
homeTeamCountryCode	object	417	0.04%	214
sex	int64	0	0.00%	2
tournamentId	int64	0	0.00%	468
stageName	object	0	0.00%	2136
stageId	int64	0	0.00%	12438
homeTeamCountryName	object	0	0.00%	215
homeTeamName	object	0	0.00%	10272
homeYellowCards	int64	0	0.00%	11
homeRedCards	int64	0	0.00%	7
homeTeamId	int64	0	0.00%	10428
id	int64	0	0.00%	988189
status	int64	0	0.00%	8
startTime	datetime64[ns]	0	0.00%	247929
isOpta	bool	0	0.00%	2
regionName	object	0	0.00%	102
navigationDisplayMode	int64	0	0.00%	3
regionId	int64	0	0.00%	102
seasonName	object	0	0.00%	68
seasonId	int64	0	0.00%	6121
stageSortOrder	int64	0	0.00%	590
awayTeamId	int64	0	0.00%	9240
isTopMatch	bool	0	0.00%	2
elapsed	object	0	0.00%	13
hasIncidentsSummary	bool	0	0.00%	2
hasPreview	bool	0	0.00%	2
awayYellowCards	int64	0	0.00%	12
awayTeamName	object	0	0.00%	9068
awayRedCards	int64	0	0.00%	7
awayTeamCountryName	object	0	0.00%	214
period	int64	0	0.00%	7
startTimeUtc	datetime64[ns]	0	0.00%	247934
isStreamAvailable	bool	0	0.00%	1
matchIsOpta	bool	0	0.00%	2
isLineupConfirmed	bool	0	0.00%	2
commentCount	int64	0	0.00%	265
bets	int64	0	0.00%	2
incidents	int64	0	0.00%	27

Describe for Matches Raw Dataframe

Competition / geography

Column	Type	Description
`tournamentId`	int	Tournament/league identifier.
`tournamentName`	string	Tournament/league name.
`tournamentSortOrder`	float	Provider sort order for tournaments (UI/display).
`stageId`	int	Stage identifier (e.g., league, playoffs).
`stageName`	string	Stage name.
`stageSortOrder`	int	Provider sort order for stages (UI/display).
`seasonId`	int	Season identifier.
`seasonName`	string	Season label (e.g., `"2002"`, `"2023/2024"`).
`regionId`	int	Region identifier (country/region in provider taxonomy).
`regionName`	string	Region name (e.g., `"USA"`, `"Sweden"`).
`regionCode`	string	Region short code (e.g., `"us"`, `"se"`).
`sex`	int	Competition gender category (1 = male, 2 = female)

Match identity, status, and time

Column	Type	Description
`id`	int	Match identifier.
`status*`	int	Match status code.
`startTime`	datetime	Scheduled kickoff time (provider/local representation).
`startTimeUtc`	datetime	Scheduled kickoff time in UTC.
`navigationDisplayMode`	int	Provider UI/navigation display mode.
`isOpta`	bool	Indicates provider data sourced from Opta.
`matchIsOpta`	bool	Match-level Opta flag (often redundant with `isOpta`).

Home team

Column	Type	Description
`homeTeamId`	int	Home team identifier.
`homeTeamName`	string	Home team name.
`homeTeamCountryCode`	string	Home team country code.
`homeTeamCountryName`	string	Home team country name.
`homeYellowCards*`	int	Home team yellow cards.
`homeRedCards*`	int	Home team red cards.

Away team

Column	Type	Description
`awayTeamId`	int	Away team identifier.
`awayTeamName`	string	Away team name.
`awayTeamCountryCode`	string	Away team country code.
`awayTeamCountryName`	string	Away team country name.
`awayYellowCards*`	int	Away team yellow cards.
`awayRedCards*`	int	Away team red cards.

Scores (final / extra time / penalties)

Column	Type	Description
`homeScore*`	float	Home goals (regular time / final main score).
`awayScore*`	float	Away goals (regular time / final main score).
`homeExtratimeScore*`	float	Home goals in extra time.
`awayExtratimeScore*`	float	Away goals in extra time.
`homePenaltyScore*`	float	Home goals in penalty shootout.
`awayPenaltyScore*`	float	Away goals in penalty shootout.

Result / winner (provider fields)

Column	Type	Description
`winnerField*`	float	Provider “winner” code/flag.
`aggregateWinnerField*`	float	Winner over two legs (aggregate).
`extraResultField*`	float	Provider extra result code (AET/PEN/awarded/etc.).
`period*`	int	Provider period/state code (FT/AET/PEN/etc.).

Live / content availability & timeline

Column	Type	Description
`hasIncidentsSummary*`	bool	Whether an incidents/events summary is available.
`hasPreview`	bool	Whether a match preview is available.
`scoreChangedAt*`	datetime	Timestamp of the last score update.
`elapsed*`	string	Match clock or textual state.
`lastScorer*`	float	Provider ID of the last scorer.
`isTopMatch`	bool	Provider “top match” flag (promoted/high-interest).
`commentCount*`	int	Number of comments in provider UI.
`isLineupConfirmed`	bool	Whether starting lineups are confirmed.
`isStreamAvailable`	bool	Whether a stream is available in provider UI.
`startedAtUtc*`	datetime	Actual kickoff timestamp in UTC (when match really started).
`firstHalfEndedAtUtc*`	datetime	End of 1st half timestamp in UTC.
`secondHalfStartedAtUtc*`	datetime	Start of 2nd half timestamp in UTC.

Misc / nested blocks (often provider-specific)

Column	Type	Description
`incidents*`	int	Incidents/events indicator or count.
`bets`	int	Betting/offers indicator or count.
`matchArgs`	object	Placeholder for nested match arguments payload.
`matchHeader`	object	Placeholder for nested match header payload.

Columns marked with * available only after or during match completion.

Match status codes

Code	Meaning	Notes
0	Unknown (Not played)	Unclear status from source.
1	Upcoming	Scheduled upcoming match.
2	Postponed	Postponed (per source).
3	In progress	Started but not finished yet.
4	Unknown (Not played)	Unclear status from source.
5	Unknown	Unclear status from source.
6	Finished	Match completed (final result available).
7	Cancelled	Match cancelled.

Show code

_STATUS_LABELS = {
    0: "Unknown (Not played)",
    1: "Upcoming",
    2: "Postponed",
    3: "In progress",
    4: "Unknown (Not played)",
    5: "Unknown",
    6: "Finished",
    7: "Cancelled",
    61: "Custom: finished-like (no MC)",
    62: "Custom: finished-like (error)",
    63: "Custom: finished-like (empty MC)",
}
_status_counts = df_match_raw["status"].value_counts().sort_index().rename("count")
_status_df = _status_counts.to_frame()
_status_df.index.name = "code"
_status_df["meaning"] = _status_df.index.map(_STATUS_LABELS).fillna("—")
_status_df["share_%"] = (_status_df["count"] / _status_df["count"].sum() * 100).round(2)

display(
    _status_df[["meaning", "count", "share_%"]]
    .style
    .format({"share_%": "{:.2f}%"})
    .set_caption("Match status distribution")
)

del _status_counts
del _status_df

Table 3: Match status distribution

	meaning	count	share_%
code
0	Unknown (Not played)	418	0.04%
1	Upcoming	3611	0.37%
2	Postponed	204	0.02%
3	In progress	1123	0.11%
4	Unknown (Not played)	2	0.00%
5	Unknown	284	0.03%
6	Finished	966140	97.77%
7	Cancelled	16407	1.66%

Note

status=6 (Finished) rows form the training dataset; status=1 (Upcoming) rows after the last finished match form the inference-ready future split. Together these two groups account for the vast majority of the dataset. All other statuses — Postponed, Cancelled, In progress, Unknown — are discarded at the preprocessing stage and do not enter the model pipeline.

Note

Live match examples are available at http://time2bet.ru/.

Use Status → Choose options to filter by status code, then click any match to inspect its details.

Regions and tournaments

Show code

_fig, _axes = plt.subplots(2, 1, figsize=(9, 10))

_col_name_map = {"tournamentId": "tournamentName", "regionId": "regionName"}

for _ax, _col, _title in zip(
    _axes,
    ["regionId", "tournamentId"],
    ["Matches by region (top 20)", "Matches by tournament (top 20)"],
):
    _name_col = _col_name_map[_col]

    _vc = (
        df_match_raw.groupby([_col, _name_col])
        .size()
        .reset_index(name="count")
        .sort_values("count", ascending=False)
        .head(20)
    )
    _labels = _vc[_name_col].astype(str)

    _ax.barh(_labels, _vc["count"], color="steelblue")
    _ax.invert_yaxis()
    _ax.set_xlabel("# matches")
    _ax.set_title(_title)
    for _bar, _val in zip(_ax.patches, _vc["count"]):
        _ax.text(_bar.get_width() * 1.01, _bar.get_y() + _bar.get_height() / 2,
                 f"{_val:,}", va="center", fontsize=7)

plt.tight_layout()
plt.show()

Figure 1: Top-20 distributions: regionId and tournamentId

Matches per month and seasonality

Show code

_df_t = df_finished.copy()
_df_t["_dt"] = pd.to_datetime(_df_t["startTimeUtc"])
_df_t["_year"] = _df_t["_dt"].dt.year
_df_t["_month"] = _df_t["_dt"].dt.month

_heat = _df_t.groupby(["_year", "_month"]).size().unstack(fill_value=0)
_heat.columns = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
                 "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"][:len(_heat.columns)]

_fig, _ax = plt.subplots(figsize=(9, max(4, len(_heat) * 0.35)))
sns.heatmap(_heat, annot=True, fmt="d", cmap="YlOrRd", linewidths=0.3, ax=_ax)
_ax.set_title("Matches per month — seasonality heatmap")
_ax.set_xlabel("Month")
_ax.set_ylabel("Year")
plt.tight_layout()
plt.show()

del _df_t

Figure 2: Monthly seasonality heatmap (year × month → match count)

Note

Seasonal gap. June–July show a consistent drop in match count across all years, reflecting the European summer break. For rolling-window features (e.g. 5-match form) computed on historical data, matches immediately after the break are computed over a stale window spanning the previous season. This is expected and acceptable: the feature engineering stage does not reset rolling accumulators at season boundaries, so the last pre-break games remain the most recent context available at inference time.

Matches by gender

Show code

_df_gender = df_finished.copy()
_df_gender["_year"] = pd.to_datetime(_df_gender["startTimeUtc"]).dt.year

_by_year_sex = (
    _df_gender.groupby(["_year", "sex"]).size().reset_index(name="count")
    if "sex" in _df_gender.columns else
    _df_gender.groupby("_year").size().reset_index(name="count")
)

_fig, _ax = plt.subplots(figsize=(9, 5))
if "sex" in _by_year_sex.columns:
    _sex_map = {1: "Male", 2: "Female"}
    for _s, _grp in _by_year_sex.groupby("sex"):
        _ax.bar(_grp["_year"].astype(str), _grp["count"], label=_sex_map.get(_s, str(_s)), alpha=0.85)
    _ax.legend(title="Gender")
else:
    _ax.bar(_by_year_sex["_year"].astype(str), _by_year_sex["count"], color="steelblue")

_ax.set_xlabel("Year")
_ax.set_ylabel("# matches")
_ax.set_title("Finished matches per year")
_ax.tick_params(axis="x", rotation=45)
plt.tight_layout()
plt.show()

del _df_gender

Figure 3: Matches per year by competition gender

Score

Show code

_df_scores = df_match_raw[["homeScore", "awayScore"]].dropna()

_fig, _axes = plt.subplots(2, 1, figsize=(9, 10), sharey=False)
_fig.suptitle("Goal count distributions — raw finished matches", fontsize=13)

for _ax, (_col, _label, _color) in zip(
    _axes,
    [("homeScore", "Home goals", "#2196F3"), ("awayScore", "Away goals", "#4CAF50")],
):
    _counts = _df_scores[_col].value_counts().sort_index()
    _ax.bar(_counts.index.astype(int).astype(str), _counts.values, color=_color, width=0.85)
    _ax.set_xlabel("Goals scored")
    _ax.set_ylabel("# matches")
    _ax.set_title(_label)
    for _bar, _val in zip(_ax.patches, _counts.values):
        if _val > 0:
            _ax.text(
                _bar.get_x() + _bar.get_width() / 2,
                _bar.get_height() + _counts.max() * 0.01,
                f"{_val:,}", ha="center", va="bottom", fontsize=7,
            )
    # Mark the outlier clip threshold (computed on the finished split, consistent with preprocess.py)
    _upper = int(df_finished[_col].quantile(PARAMS["preprocessing"]["score_outlier_pct"]))
    _x_labels = [str(int(x)) for x in sorted(_counts.index.astype(int).unique())]
    if str(_upper) in _x_labels:
        _ax.axvline(x=_x_labels.index(str(_upper)), color="crimson",
                    linestyle="--", linewidth=1.5, label=f"clip threshold = {_upper}")
        _ax.legend(fontsize=8)

plt.tight_layout()
plt.show()

del _df_scores

Figure 4: Goal count distributions — raw finished matches

Matches with in-match statistics

A subset of finished matches has rich live match center (MC) data collected during the game: per-minute team stats (shots, possession, passes, aerials, tackles, dribbles), referees, lineups, venue, and score timeline. This data is stored across dedicated tables — matches_live_header, matches_live_info, matches_live_{home,away}_stats, etc.

Show code

_df_mc = df_match_raw[
    (df_match_raw["status"] == 6)
    & (df_match_raw["period"] != 0)
    & (df_match_raw["hasPreview"])
]
_total = len(df_match_raw[df_match_raw["status"] == 6])
display(Markdown(
    f"**{len(_df_mc):,}** of **{_total:,}** finished matches "
    f"({len(_df_mc)/_total*100:.1f}%) have in-match statistics available."
))

del _df_mc

Table 4

65,391 of 966,140 finished matches (6.8%) have in-match statistics available.

Show code

_df_finished_raw = df_match_raw[df_match_raw["status"] == 6].copy()
_df_finished_raw["_year"] = pd.to_datetime(_df_finished_raw["startTimeUtc"]).dt.year
_df_finished_raw["_has_mc"] = (
    (_df_finished_raw["period"] != 0) & (_df_finished_raw["hasPreview"])
)
_mc_map = {True: "With MC stats", False: "Without MC stats"}
_by_year_mc = (
    _df_finished_raw.groupby(["_year", "_has_mc"])
    .size()
    .reset_index(name="count")
)

_fig, _ax = plt.subplots(figsize=(9, 5))
for _has_mc, _grp in _by_year_mc.groupby("_has_mc"):
    _ax.bar(
        _grp["_year"].astype(str), _grp["count"],
        label=_mc_map[_has_mc], alpha=0.85,
        color="#4CAF50" if _has_mc else "#90CAF9",
    )
_ax.set_xlabel("Year")
_ax.set_ylabel("# matches")
_ax.set_title("Finished matches per year — with vs without in-match statistics")
_ax.tick_params(axis="x", rotation=45)
_ax.legend(title="MC stats")
plt.tight_layout()
plt.show()

del _df_finished_raw

Figure 5: Finished matches per year — with vs without in-match statistics

Note

MC data availability. In-match statistics are available only from a certain year onward, reflecting the point at which live data collection was integrated into the pipeline. Matches before that cutoff have hasPreview = FALSE or period = 0 and are treated as without MC. This structural gap is visible in the chart above as a step-change in the “With MC stats” series. Any future feature set built on MC data will be constrained to the post-cutoff subset, reducing training set size relative to the full finished split.

Note

To explore matches with in-match statistics on http://time2bet.ru/:

Enable the “Only with MC stats” checkbox in the filters panel.
Click any match link — this opens the match detail page.
Select the “Match Centre” tab to view live stats: timeline, possession, shots, passes, and more.

Important

Out of scope for v1. In-match statistics are not used as features in the current pipeline. All models rely solely on pre-match information to avoid data leakage. Live stats are a natural candidate for future feature iterations (e.g. half-time retraining, live odds adjustment).

2. Data Preparation Pipeline

This section documents the full data preparation phase: from raw ingestion and exploratory profiling through preprocessing and artifact output. The implementation lives in src/data/preprocess.py and is orchestrated as a DVC stage (preprocess).

Implementation — `preprocess_and_split()`

The following steps are applied inside preprocess_and_split() in the order they execute:

Drop irrelevant columns — removes display/UI fields (stageName, regionName, isOpta, sort-order keys, etc.) and all live/post-match columns (elapsed, scoreChangedAt, period, incidents, etc.) — approximately 40 columns in total.
Downcast ID types — casts identifier columns to compact integer types (tournamentId, regionId, stageId, seasonId → int16; id, homeTeamId, awayTeamId → int32; sex → int8) to reduce memory footprint.
Parse and sort by time — converts startTimeUtc to UTC-aware datetime, then sorts all rows ascending by kickoff time.
Split by status — selects status=6 rows as finished and status=1 rows whose kickoff is strictly after the last finished match as future. All other statuses (postponed, cancelled, in-progress, unknown) are discarded. The status column is dropped after partitioning.
Downcast scores — casts homeScore and awayScore to int8 on the finished split after null removal.
Compute classification target — derives outcome_1x2 (0 = home win, 1 = draw, 2 = away win) from raw scores before clipping, so the label is unaffected by the clip operation.
Clip outlier scores — for each of homeScore / awayScore independently, computes the params.yaml → preprocessing.score_outlier_pct quantile (default 0.9999 ≈ 99.99-th percentile) on the finished split and clips values above that threshold. Percentile is preferred over IQR (whose fence falls inside the normal-score range for Poisson-concentrated data) and Z-score (assumes normality). At 0.9999 the threshold is ~10–12 goals, clipping only technical results/data errors (~0.01% of matches) while keeping all legitimate high-scoring games. Matches are kept (not dropped) so team history remains continuous for rolling stats and ELO.
Compute regression targets — derives sumScore = homeScore + awayScore and diffScore = homeScore − awayScore (after clipping, so derived targets are consistent with the clipped scores).
Drop intermediate columns — removes the temporary binary flags used during target derivation (homeWin, awayWin, draw) and disciplinary columns (homeYellowCards, awayYellowCards, homeRedCards, awayRedCards) that are excluded from the v1 feature set.
Drop score columns from future — removes homeScore, awayScore, and all extra-time / penalty score fields from the future split to prevent any target leakage.

Note

All steps use only pre-match information or post-match results that are applied to the finished split only. The future split never receives target or score columns.

Raw data — columns selected for preprocessing

The table below shows only the columns retained after dropping display/UI and live-match fields. homeScore and awayScore are null for future matches and will be separated in the next step.

Show code

_columns_for_preprocessing = [
    "startTimeUtc",
    "id",
    "sex",
    "regionId",
    "tournamentId",
    "seasonId",
    "stageId",
    "homeTeamId",
    "awayTeamId",
    "homeScore",
    "awayScore",
]
_df = df_match_raw[_columns_for_preprocessing]
_null_counts = _df.isnull().sum()
_table = pd.DataFrame({
    "dtype": _df.dtypes.astype(str),
    "null_count": _null_counts,
    "null_%": (_null_counts / len(_df) * 100).round(2),
    "unique": _df.nunique(),
    "min": _df.min(),
    "max": _df.max(),
}).loc[_columns_for_preprocessing]

display(
    _table.style
    .background_gradient(subset=["null_%"], cmap="Reds")
    .format({"null_%": "{:.2f}%"})
    .set_caption("Column types, null counts and cardinality")
)

del _df
del _null_counts
del _table

Table 5: Column types, null counts and cardinality

	dtype	null_count	null_%	unique	min	max
startTimeUtc	datetime64[ns]	0	0.00%	247934	1998-06-30 19:00:00	2026-05-01 21:30:00
id	int64	0	0.00%	988189	1	1978419
sex	int64	0	0.00%	2	1	2
regionId	int64	0	0.00%	102	3	265
tournamentId	int64	0	0.00%	468	1	783
seasonId	int64	0	0.00%	6121	1	11075
stageId	int64	0	0.00%	12438	1	25346
homeTeamId	int64	0	0.00%	10428	1	32635
awayTeamId	int64	0	0.00%	9240	1	32633
homeScore	float64	20640	2.09%	23	0.000000	101.000000
awayScore	float64	20640	2.09%	28	0.000000	95.000000

Finished split — schema after preprocessing

finished.parquet contains only status=6 matches. Derived targets (sumScore, diffScore, outcome_1x2) are appended.

Show code

_columns = [
    "startTimeUtc",
    "id",
    "sex",
    "regionId",
    "tournamentId",
    "seasonId",
    "stageId",
    "homeTeamId",
    "awayTeamId",
    "homeScore",
    "awayScore",
    "sumScore",
    "diffScore",
    "outcome_1x2",
]
_df = df_finished
_null_counts = _df.isnull().sum()
_table = pd.DataFrame({
    "dtype": _df.dtypes.astype(str),
    "null_count": _null_counts,
    "null_%": (_null_counts / len(_df) * 100).round(2),
    "unique": _df.nunique(),
    "min": _df.min(),
    "max": _df.max(),
}).loc[_columns]

display(
    _table.style
    .background_gradient(subset=["null_%"], cmap="Reds")
    .format({"null_%": "{:.2f}%"})
    .set_caption("Column types, null counts and cardinality")
)

del _df
del _null_counts
del _table

Table 6: Column types, null counts and cardinality

	dtype	null_%	unique	min	max
startTimeUtc	datetime64[ns, UTC]	0.00%	245557	1998-06-30 19:00:00+00:00	2026-04-16 02:00:00+00:00
id	int32	0.00%	966140	1	1978288
sex	int8	0.00%	2	1	2
regionId	int16	0.00%	102	3	265
tournamentId	int16	0.00%	461	1	783
seasonId	int16	0.00%	6091	1	11071
stageId	int16	0.00%	12379	1	25334
homeTeamId	int32	0.00%	10335	1	32621
awayTeamId	int32	0.00%	9146	1	32621
homeScore	int8	0.00%	12	0	11
awayScore	int8	0.00%	14	0	13
sumScore	int8	0.00%	20	0	24
diffScore	int8	0.00%	25	-13	11
outcome_1x2	int8	0.00%	3	0	2

Future split — schema after preprocessing

future.parquet contains only status=1 matches with kickoff strictly after the last finished match. Score and target columns are absent by design.

Show code

_columns = [
    "startTimeUtc",
    "id",
    "sex",
    "regionId",
    "tournamentId",
    "seasonId",
    "stageId",
    "homeTeamId",
    "awayTeamId",
]
_df = df_future
_null_counts = _df.isnull().sum()
_table = pd.DataFrame({
    "dtype": _df.dtypes.astype(str),
    "null_count": _null_counts,
    "null_%": (_null_counts / len(_df) * 100).round(2),
    "unique": _df.nunique(),
    "min": _df.min(),
    "max": _df.max(),
}).loc[_columns]

display(
    _table.style
    .background_gradient(subset=["null_%"], cmap="Reds")
    .format({"null_%": "{:.2f}%"})
    .set_caption("Column types, null counts and cardinality")
)

del _df
del _null_counts
del _table

Table 7: Column types, null counts and cardinality

	dtype	null_%	unique	min	max
startTimeUtc	datetime64[ns, UTC]	0.00%	574	2026-04-16 12:00:00+00:00	2026-05-01 21:30:00+00:00
id	int32	0.00%	3186	1901243	1978419
sex	int8	0.00%	2	1	2
regionId	int16	0.00%	90	3	265
tournamentId	int16	0.00%	195	1	783
seasonId	int16	0.00%	195	10720	11075
stageId	int16	0.00%	261	24478	25346
homeTeamId	int32	0.00%	2515	1	32635
awayTeamId	int32	0.00%	2510	1	32633

Warning

Cold-start risk. Teams that appear in the future split but have no history in the finished split will receive zero rolling statistics and the initial ELO rating (params.yaml → features.elo.initial_rating). This is expected behaviour and is handled at the feature engineering stage with explicit fallback defaults.

Show code

_known_home = set(df_finished["homeTeamId"]) | set(df_finished["awayTeamId"])
_future_home = set(df_future["homeTeamId"])
_future_away = set(df_future["awayTeamId"])
_cold_home = _future_home - _known_home
_cold_away = _future_away - _known_home
_cold_any = (_future_home | _future_away) - _known_home
display(Markdown(
    f"**Cold-start teams (no history in finished split):** "
    f"{len(_cold_home)} appearing as home, "
    f"{len(_cold_away)} appearing as away "
    f"({len(_cold_any)} unique teams total)"
))
del _known_home, _future_home, _future_away, _cold_home, _cold_away, _cold_any

Table 8

Cold-start teams (no history in finished split): 9 appearing as home, 2 appearing as away (11 unique teams total)

3. Dataset definition and targets

Matches — Finished vs Future (column availability + target/label roles)

Column	Finished	Future	Role
`startTimeUtc`	✓	✓	Feature (time)
`id`	✓	✓	Feature (match identity)
`sex`	✓	✓	Feature (competition)
`regionId`	✓	✓	Feature (geography)
`tournamentId`	✓	✓	Feature (competition)
`seasonId`	✓	✓	Feature (season)
`stageId`	✓	✓	Feature (competition)
`homeTeamId`	✓	✓	Feature (entity)
`awayTeamId`	✓	✓	Feature (entity)
`outcome_1x2`	✓	—	Classification target (3-class label)
`homeScore`	✓	—	Regression target
`awayScore`	✓	—	Regression target
`sumScore`	✓	—	Regression target (derived)
`diffScore`	✓	—	Regression target (derived)

Important

v1 scope: This report focuses on the classification target (outcome_1x2: Home win / Draw / Away win). Regression targets (homeScore, awayScore, sumScore, diffScore) are out of scope for v1 and will be analysed in a future iteration.

Classification target

Show code

_outcome_labels = {0: "Home win", 1: "Draw", 2: "Away win"}
_props = df_finished["outcome_1x2"].value_counts(normalize=True).sort_index()
_props.index = [_outcome_labels.get(i, str(i)) for i in _props.index]

_fig, _ax = plt.subplots(figsize=(9, 5))
_ax.bar(_props.index, _props.values, color=["#2196F3", "#FF9800", "#4CAF50"])
_ax.set_ylabel("Proportion")
_ax.set_title("Overall class proportions")
for _bar, _val in zip(_ax.patches, _props.values):
    _ax.text(_bar.get_x() + _bar.get_width() / 2, _bar.get_height() + 0.002,
             f"{_val:.3f}", ha="center", fontsize=10)
plt.tight_layout()
plt.show()

display(
    _props.rename("proportion").to_frame()
    .assign(**{"count": df_finished["outcome_1x2"].value_counts().sort_index().values})
    .style.format({"proportion": "{:.4f}"})
    .set_caption("Classification target counts")
)

	proportion	count
Home win	0.4482	432984
Draw	0.2528	244265
Away win	0.2990	288891

Note

Class imbalance. Home win is consistently the most frequent outcome; draws are the minority class. The imbalance is moderate (roughly 1.7–2× between majority and minority) and does not require resampling — tree-based models handle it natively. Probability calibration is applied at the final training stage to correct systematic bias in predicted probabilities (see params.yaml → final_train.calibration).

Temporal stability of outcome_1x2

A drift-check: if the home-win rate shifts substantially over time, the temporal split strategy must account for it and calibration must be re-validated on recent data.

Show code

_OUTCOME_LABELS = {0: "Home win", 1: "Draw", 2: "Away win"}
_OUTCOME_COLORS = {0: "#2196F3", 1: "#FF9800", 2: "#4CAF50"}

_df_t = df_finished.copy()
_df_t["_year"] = pd.to_datetime(_df_t["startTimeUtc"]).dt.year

_year_totals = _df_t.groupby("_year").size()
_rates_yr = (
    _df_t.groupby(["_year", "outcome_1x2"])
    .size()
    .unstack(fill_value=0)
    .div(_year_totals, axis=0)
)

_fig, _ax = plt.subplots(figsize=(9, 5))
for _outcome, _label in _OUTCOME_LABELS.items():
    if _outcome in _rates_yr.columns:
        _ax.plot(
            _rates_yr.index.astype(str), _rates_yr[_outcome],
            marker="o", label=_label, color=_OUTCOME_COLORS[_outcome], linewidth=2,
        )

_ax.set_xlabel("Year")
_ax.set_ylabel("Proportion")
_ax.set_title("outcome_1x2 proportions per year — temporal stability")
_ax.legend(title="Outcome")
_ax.tick_params(axis="x", rotation=45)
_ax.set_ylim(0, 0.7)
plt.tight_layout()
plt.show()

del _df_t, _year_totals, _rates_yr

Figure 7: outcome_1x2 proportions per year — temporal stability

Note

Distribution is stable. Home-win, draw, and away-win rates remain within a narrow band across all years, with no sustained structural shift. This confirms that a single-split temporal train/test strategy is appropriate — the label distribution the model trains on is representative of the distribution it will be evaluated and deployed against. Any year-to-year variation visible in the chart is within expected sampling noise for the per-year match counts. Calibration should nonetheless be validated on the most recent data slice, as even small shifts compound in probability output.

Population heterogeneity — by tournament

The global class proportions above are an aggregate that masks substantial variation across competitions. This plot examines whether outcome distributions are homogeneous across the most-played tournaments, directly motivating params.yaml → classification.groupby_cols: ["regionId", "sex"] as stratification axes for prior estimation and probability calibration.

Show code

_TOP_N = 20
_OUTCOME_LABELS = {0: "Home win", 1: "Draw", 2: "Away win"}
_OUTCOME_COLORS = ["#2196F3", "#FF9800", "#4CAF50"]

# Recover tournament names from raw (dropped during preprocessing)
_meta = df_match_raw[["id", "tournamentName"]].drop_duplicates("id").set_index("id")
_df_h = df_finished.join(_meta, how="left")

_top_t = _df_h["tournamentName"].value_counts().head(_TOP_N).index
_df_top = _df_h[_df_h["tournamentName"].isin(_top_t)].copy()

_rates_t = (
    _df_top.groupby(["tournamentName", "outcome_1x2"])
    .size()
    .unstack(fill_value=0)
    .div(_df_top.groupby("tournamentName").size(), axis=0)
    .sort_values(0, ascending=False)   # sort by home-win rate descending
)
_rates_t.columns = [_OUTCOME_LABELS[c] for c in _rates_t.columns]

_fig, _ax = plt.subplots(figsize=(9, max(5, len(_rates_t) * 0.42)))
_rates_t.plot(kind="barh", stacked=True, color=_OUTCOME_COLORS, ax=_ax, width=0.75)
_ax.invert_yaxis()
_ax.set_xlabel("Proportion")
_ax.set_title(f"outcome_1x2 distribution — top {_TOP_N} tournaments")
_ax.legend(title="Outcome", bbox_to_anchor=(1.01, 1), loc="upper left")
for _bar_container, _col in zip(_ax.containers, _rates_t.columns):
    _ax.bar_label(_bar_container, fmt="%.2f", label_type="center", fontsize=7, color="white")
plt.tight_layout()
plt.show()

del _meta, _df_h, _df_top, _rates_t

Figure 8: outcome_1x2 proportions — top-20 tournaments by match count

Note

Observed heterogeneity. Home-win rates vary across tournaments (typically 38–52%), confirming that a single global prior is insufficient. regionId and sex are used as stratification axes because tournament-level granularity would introduce sparse strata for lower-division leagues. This is the direct justification for groupby_cols: ["regionId", "sex"] in calibration.

Regression targets

Four regression targets are derived from the final, clipped scores. homeScore and awayScore are the primary outputs; sumScore (total goals) and diffScore (home minus away) are derived aggregates that encode different aspects of the result and may be predicted independently or used as auxiliary signals.

Show code

_cols_and_colors = [("homeScore", "#2196F3"), ("awayScore", "#4CAF50"),
                    ("sumScore", "#FF9800"), ("diffScore", "#9C27B0")]
_fig, _axes = plt.subplots(4, 1, figsize=(9, 16), constrained_layout=True)
_fig.suptitle("Regression target distributions", fontsize=14)

for _ax, (_col, _color) in zip(_axes, _cols_and_colors):
    if _col not in df_finished.columns:
        _ax.set_visible(False)
        continue
    _props_r = df_finished[_col].value_counts(normalize=True).sort_index()
    # Use numeric x-axis to preserve correct order for columns with negative values (diffScore)
    _x_vals = _props_r.index.tolist()
    _ax.bar(range(len(_x_vals)), _props_r.values, color=_color, width=0.9)
    _ax.set_xticks(range(len(_x_vals)))
    _ax.set_xticklabels([str(v) for v in _x_vals], fontsize=8)
    _ax.set_ylabel("Proportion")
    _ax.set_xlabel(_col)
    _ax.bar_label(_ax.containers[0], fmt="%.3f", fontsize=9, padding=2)

Figure 9: Score distributions — finished matches

Note

Distribution shapes. homeScore and awayScore are right-skewed and concentrated at 0–3 goals, consistent with a Poisson-like process — motivating Poisson regression or count-based models for v2. sumScore inherits the same right skew. diffScore is the only target that takes negative values (away win margin), making it incompatible with Poisson regression without transformation; a Skellam distribution (difference of two Poissons) would be the natural parametric choice. These characteristics are out of scope for v1 and are documented here for the v2 feature iteration.

4. Data Quality Checks

Data quality is validated automatically by three DVC stages using Great Expectations:

DVC stage	Input	Suite	Report
`validate_raw`	`data/raw/match_raw.parquet`	`raw_match_suite`	`data/evaluation/ge_raw.json`
`validate_finished`	`data/interim/finished.parquet`	`finished_suite`	`data/evaluation/ge_finished.json`
`validate_future`	`data/interim/future.parquet`	`future_match_suite`	`data/evaluation/ge_future.json`

Each stage exits with code 1 on any expectation failure — the DVC pipeline will not proceed further. validate_future also acts as an anti-leakage gate: it asserts (via exact_match=True) that score and target columns are absent from the future split.

validate_raw — raw data

Checks: required columns present (exact_match=False); row count ≥ 1; id unique; id, homeTeamId, awayTeamId, startTimeUtc, status non-null; startTimeUtc in range 1998-01-01 → 2026-12-31.

Show code

import json

_GE_RAW_PATH = project_root / "data/evaluation/ge_raw.json"

if _GE_RAW_PATH.exists():
    _raw_report = json.loads(_GE_RAW_PATH.read_text())
    _raw_results = _raw_report.get("results", [])
    _raw_success = _raw_report.get("success", None)

    _rows = []
    for _r in _raw_results:
        _etype = _r["expectation_config"]["type"]
        _kwargs = _r["expectation_config"].get("kwargs", {})
        _col    = _kwargs.get("column", "—")
        _ok     = _r["success"]
        _result = _r.get("result", {})
        _detail = ""
        if not _ok:
            _detail = str(_result)
        _rows.append({"Expectation": _etype, "Column": _col,
                      "Status": "✅ PASS" if _ok else "❌ FAIL", "Detail": _detail})

    _df_ge = pd.DataFrame(_rows)
    _n_pass = (_df_ge["Status"] == "✅ PASS").sum()
    _n_fail = (_df_ge["Status"] == "❌ FAIL").sum()

    _suite_status = "✅ PASS" if _raw_success else "❌ FAIL"
    display(Markdown(f"**Suite result: {_suite_status}** — {_n_pass} passed, {_n_fail} failed"))
    display(
        _df_ge.style
        .apply(lambda x: ["background-color: #fdd" if v == "❌ FAIL" else "" for v in x], subset=["Status"])
        .set_caption("Great Expectations — raw_match_suite")
    )
else:
    display(Markdown("::: {.callout-warning}\n`data/evaluation/ge_raw.json` not found. Run `dvc repro validate_raw` first.\n:::"))

Table 9: Great Expectations — raw_match_suite

Suite result: ✅ PASS — 9 passed, 0 failed

	Expectation	Column	Status
0	expect_table_columns_to_match_set	—	✅ PASS
1	expect_table_row_count_to_be_between	—	✅ PASS
2	expect_column_values_to_not_be_null	id	✅ PASS
3	expect_column_values_to_be_unique	id	✅ PASS
4	expect_column_values_to_not_be_null	homeTeamId	✅ PASS
5	expect_column_values_to_not_be_null	awayTeamId	✅ PASS
6	expect_column_values_to_not_be_null	startTimeUtc	✅ PASS
7	expect_column_values_to_be_between	startTimeUtc	✅ PASS
8	expect_column_values_to_not_be_null	status	✅ PASS

validate_finished — preprocessed finished split

Checks: required columns present; row count ≥ 1; all columns non-null; id unique; outcome_1x2 ∈ {0, 1, 2}; homeScore and awayScore in range [0, 15].

Show code

_GE_FINISHED_PATH = project_root / "data/evaluation/ge_finished.json"

if _GE_FINISHED_PATH.exists():
    _int_report = json.loads(_GE_FINISHED_PATH.read_text())
    _int_results = _int_report.get("results", [])
    _int_success = _int_report.get("success", None)

    _rows = []
    for _r in _int_results:
        _etype = _r["expectation_config"]["type"]
        _kwargs = _r["expectation_config"].get("kwargs", {})
        _col    = _kwargs.get("column", "—")
        _ok     = _r["success"]
        _result = _r.get("result", {})
        _detail = ""
        if not _ok:
            _detail = str(_result)
        _rows.append({"Expectation": _etype, "Column": _col,
                      "Status": "✅ PASS" if _ok else "❌ FAIL", "Detail": _detail})

    _df_ge = pd.DataFrame(_rows)
    _n_pass = (_df_ge["Status"] == "✅ PASS").sum()
    _n_fail = (_df_ge["Status"] == "❌ FAIL").sum()

    _suite_status = "✅ PASS" if _int_success else "❌ FAIL"
    display(Markdown(f"**Suite result: {_suite_status}** — {_n_pass} passed, {_n_fail} failed"))
    display(
        _df_ge.style
        .apply(lambda x: ["background-color: #fdd" if v == "❌ FAIL" else "" for v in x], subset=["Status"])
        .set_caption("Great Expectations — finished_suite")
    )
else:
    display(Markdown("::: {.callout-warning}\n`data/evaluation/ge_finished.json` not found. Run `dvc repro validate_finished` first.\n:::"))

Table 10: Great Expectations — finished_suite

Suite result: ✅ PASS — 18 passed, 0 failed

	Expectation	Column	Status
0	expect_table_columns_to_match_set	—	✅ PASS
1	expect_table_row_count_to_be_between	—	✅ PASS
2	expect_column_values_to_not_be_null	id	✅ PASS
3	expect_column_values_to_be_unique	id	✅ PASS
4	expect_column_values_to_not_be_null	homeTeamId	✅ PASS
5	expect_column_values_to_not_be_null	awayTeamId	✅ PASS
6	expect_column_values_to_not_be_null	startTimeUtc	✅ PASS
7	expect_column_values_to_not_be_null	regionId	✅ PASS
8	expect_column_values_to_not_be_null	tournamentId	✅ PASS
9	expect_column_values_to_not_be_null	seasonId	✅ PASS
10	expect_column_values_to_not_be_null	homeScore	✅ PASS
11	expect_column_values_to_be_between	homeScore	✅ PASS
12	expect_column_values_to_not_be_null	awayScore	✅ PASS
13	expect_column_values_to_be_between	awayScore	✅ PASS
14	expect_column_values_to_not_be_null	sumScore	✅ PASS
15	expect_column_values_to_not_be_null	diffScore	✅ PASS
16	expect_column_values_to_not_be_null	outcome_1x2	✅ PASS
17	expect_column_values_to_be_in_set	outcome_1x2	✅ PASS

validate_future — upcoming matches

Checks: exactly the 9 identity columns and nothing else (exact_match=True — anti-leakage gate verifying score and target columns are absent); row count ≥ 1; all columns non-null; id unique.

Show code

_GE_FUTURE_PATH = project_root / "data/evaluation/ge_future.json"

if _GE_FUTURE_PATH.exists():
    _fut_report = json.loads(_GE_FUTURE_PATH.read_text())
    _fut_results = _fut_report.get("results", [])
    _fut_success = _fut_report.get("success", None)

    _rows = []
    for _r in _fut_results:
        _etype = _r["expectation_config"]["type"]
        _kwargs = _r["expectation_config"].get("kwargs", {})
        _col    = _kwargs.get("column", "—")
        _ok     = _r["success"]
        _result = _r.get("result", {})
        _detail = ""
        if not _ok:
            _detail = str(_result)
        _rows.append({"Expectation": _etype, "Column": _col,
                      "Status": "✅ PASS" if _ok else "❌ FAIL", "Detail": _detail})

    _df_ge = pd.DataFrame(_rows)
    _n_pass = (_df_ge["Status"] == "✅ PASS").sum()
    _n_fail = (_df_ge["Status"] == "❌ FAIL").sum()

    _suite_status = "✅ PASS" if _fut_success else "❌ FAIL"
    display(Markdown(f"**Suite result: {_suite_status}** — {_n_pass} passed, {_n_fail} failed"))
    display(
        _df_ge.style
        .apply(lambda x: ["background-color: #fdd" if v == "❌ FAIL" else "" for v in x], subset=["Status"])
        .set_caption("Great Expectations — future_match_suite")
    )
else:
    display(Markdown("::: {.callout-warning}\n`data/evaluation/ge_future.json` not found. Run `dvc repro validate_future` first.\n:::"))

Table 11: Great Expectations — future_match_suite

Suite result: ✅ PASS — 12 passed, 0 failed

	Expectation	Column	Status
0	expect_table_columns_to_match_set	—	✅ PASS
1	expect_table_row_count_to_be_between	—	✅ PASS
2	expect_column_values_to_not_be_null	id	✅ PASS
3	expect_column_values_to_be_unique	id	✅ PASS
4	expect_column_values_to_not_be_null	startTimeUtc	✅ PASS
5	expect_column_values_to_not_be_null	sex	✅ PASS
6	expect_column_values_to_not_be_null	regionId	✅ PASS
7	expect_column_values_to_not_be_null	tournamentId	✅ PASS
8	expect_column_values_to_not_be_null	seasonId	✅ PASS
9	expect_column_values_to_not_be_null	stageId	✅ PASS
10	expect_column_values_to_not_be_null	homeTeamId	✅ PASS
11	expect_column_values_to_not_be_null	awayTeamId	✅ PASS

5. EDA and Preprocessing Summary

The table below consolidates all phases covered in this report — from raw data loading to quality-gated artifact output — so the scope of each step and its verification gate are visible in one place.

Phase	Scope	Artifact / Quality gate
Data loading	`match_raw.parquet` ingested from MinIO	`df_match_raw`
EDA	Status distribution, regions/tournaments, seasonality, gender, score distributions, MC coverage	Sections 1–2 of this report
Missing values	Null share visualised per column; mandatory key columns enforced	GE `ExpectColumnValuesToNotBeNull` (`validate_raw`)
Duplicates	`id` uniqueness enforced at ingestion and after preprocessing	GE `ExpectColumnValuesToBeUnique` (`validate_raw`, `validate_finished`, `validate_future`)
Column selection	UI/display and live-match columns dropped (≈40); see Implementation above	`finished.parquet`, `future.parquet`
Type downcasting	IDs → `int16` / `int32`; scores → `int8`	Memory-efficient parquet schema
Temporal sort & split	Sort ascending by UTC kickoff; partition `status=6` (finished) vs `status=1` after last finished (future)	`finished.parquet`, `future.parquet`
Outlier clipping	Scores clipped at `score_outlier_pct` quantile (~0.01% of matches; forfeits and data errors only); threshold visualised on score charts above	GE `ExpectColumnValuesToBeBetween` (`validate_finished`)
Target derivation	`outcome_1x2` (3-class classification); `homeScore`, `awayScore`, `sumScore`, `diffScore` (regression)	`finished.parquet`
Anti-leakage gate	Score and target columns verified absent from future split	GE `ExpectTableColumnsToMatchSet(exact_match=True)` (`validate_future`)

1. Raw Data Overview

Describe for Matches Raw Dataframe

Match status codes

Regions and tournaments

Matches per month and seasonality

Matches by gender

Score

Matches with in-match statistics

2. Data Preparation Pipeline

Implementation — preprocess_and_split()

Raw data — columns selected for preprocessing

Finished split — schema after preprocessing

Future split — schema after preprocessing

3. Dataset definition and targets

Classification target

Regression targets

4. Data Quality Checks

validate_raw — raw data

validate_finished — preprocessed finished split

validate_future — upcoming matches

5. EDA and Preprocessing Summary

Implementation — `preprocess_and_split()`