SoccerPredictAI
  • Home
  • Reports
    • 01 · EDA & Data Profiling
  • Back to Docs (MkDocs)

On this page

  • 1. Raw Data Overview
    • Describe for Matches Raw Dataframe
    • Match status codes
    • Regions and tournaments
    • Matches per month and seasonality
    • Matches by gender
    • Score
    • Matches with in-match statistics
  • 2. Data Preparation Pipeline
    • Implementation — preprocess_and_split()
    • Raw data — columns selected for preprocessing
    • Finished split — schema after preprocessing
    • Future split — schema after preprocessing
  • 3. Dataset definition and targets
    • Classification target
    • Regression targets
  • 4. Data Quality Checks
    • validate_raw — raw data
    • validate_finished — preprocessed finished split
    • validate_future — upcoming matches
  • 5. EDA and Preprocessing Summary

EDA & Preprocessing

Football 1X2 — raw data exploration, preprocessing and target analysis

Author

Dima Ivanov

Published

May 11, 2026

Show code
import sys
from pathlib import Path

project_root = Path().resolve().parent.parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from IPython.display import display, Markdown

from src.app.config import settings

import yaml
with open(project_root / "params.yaml") as _f:
    PARAMS = yaml.safe_load(_f)
Show code
df_match_raw = pd.read_parquet(settings.data_raw_path / "match_raw.parquet")
df_finished = pd.read_parquet(settings.data_interim_path / "finished.parquet")
df_future = pd.read_parquet(settings.data_interim_path / "future.parquet")
Show code
import subprocess
import yaml as _yaml
_git_hash = subprocess.check_output(
    ["git", "rev-parse", "--short", "HEAD"], cwd=project_root
).decode().strip()

# Read DVC MD5 from dvc.lock — content-addressable hash stable across
# cp / rsync / touch operations, unlike mtime.
_dvc_md5 = "—"
_dvc_lock_path = project_root / "dvc.lock"
if _dvc_lock_path.exists():
    _dvc_lock = _yaml.safe_load(_dvc_lock_path.read_text())
    for _stage in _dvc_lock.get("stages", {}).values():
        for _out in _stage.get("outs", []):
            if _out.get("path") == "data/raw/match_raw.parquet":
                _dvc_md5 = _out.get("md5", "—")
                break

_future_start = pd.to_datetime(df_future["startTimeUtc"]).min().date()
_future_end = pd.to_datetime(df_future["startTimeUtc"]).max().date()
display(Markdown(
    f"- **Git commit:** `{_git_hash}`  \n"
    f"- **match_raw.parquet DVC MD5:** `{_dvc_md5}`  \n"
    f"- **finished.parquet rows:** {len(df_finished):,}  \n"
    f"- **future.parquet rows:** {len(df_future):,} (`{_future_start}` → `{_future_end}`)"
))
del _future_start, _future_end, _dvc_md5, _dvc_lock_path
  • Git commit: a4f939a
  • match_raw.parquet DVC MD5: 60735b9fb3553c811941dd441c8aceb9
  • finished.parquet rows: 966,140
  • future.parquet rows: 3,186 (2026-04-16 → 2026-05-01)
Caution

Reproducibility note. The figures and statistics in this report are tied to a specific snapshot of the raw data. The Git commit above identifies the pipeline code; the DVC MD5 above is the content-addressable hash of match_raw.parquet as recorded in dvc.lock — it is stable across file copies and touch operations. To reproduce this report exactly, run dvc repro from the same commit before rendering.

1. Raw Data Overview

Show code
display(Markdown(
f"""* **Shape:** {df_match_raw.shape[0]:,} rows × {df_match_raw.shape[1]} columns
* **Unique matches:** {df_match_raw['id'].nunique():,}
* **Date range:** {df_match_raw['startTimeUtc'].min().date()} → {df_match_raw['startTimeUtc'].max().date()}
"""))
Table 1
  • Shape: 988,189 rows × 57 columns
  • Unique matches: 988,189
  • Date range: 1998-06-30 → 2026-05-01
Show code
_null_counts = df_match_raw.isnull().sum()
_dtype_table = pd.DataFrame({
    "dtype": df_match_raw.dtypes.astype(str),
    "null_count": _null_counts,
    "null_%": (_null_counts / len(df_match_raw) * 100).round(2),
    "unique": df_match_raw.nunique(),
}).sort_values("null_%", ascending=False)

display(
    _dtype_table.style
    .background_gradient(subset=["null_%"], cmap="Reds")
    .format({"null_%": "{:.2f}%"})
    .set_caption("Column types, null counts and cardinality")
)

del _null_counts
del _dtype_table
Table 2: Column types, null counts and cardinality
  dtype null_count null_% unique
matchArgs object 988189 100.00% 0
matchHeader object 988189 100.00% 0
aggregateWinnerField float64 987346 99.91% 2
homePenaltyScore float64 980482 99.22% 18
awayPenaltyScore float64 980483 99.22% 19
awayExtratimeScore float64 978953 99.07% 12
homeExtratimeScore float64 978915 99.06% 11
extraResultField float64 975099 98.68% 2
winnerField float64 258588 26.17% 2
lastScorer float64 257016 26.01% 2
scoreChangedAt datetime64[ns] 253444 25.65% 716160
firstHalfEndedAtUtc datetime64[ns] 88797 8.99% 804531
secondHalfStartedAtUtc datetime64[ns] 37560 3.80% 721667
startedAtUtc datetime64[ns] 31961 3.23% 702055
tournamentSortOrder float64 26561 2.69% 58
awayScore float64 20640 2.09% 28
homeScore float64 20640 2.09% 23
regionCode object 880 0.09% 101
tournamentName object 728 0.07% 313
awayTeamCountryCode object 449 0.05% 213
homeTeamCountryCode object 417 0.04% 214
sex int64 0 0.00% 2
tournamentId int64 0 0.00% 468
stageName object 0 0.00% 2136
stageId int64 0 0.00% 12438
homeTeamCountryName object 0 0.00% 215
homeTeamName object 0 0.00% 10272
homeYellowCards int64 0 0.00% 11
homeRedCards int64 0 0.00% 7
homeTeamId int64 0 0.00% 10428
id int64 0 0.00% 988189
status int64 0 0.00% 8
startTime datetime64[ns] 0 0.00% 247929
isOpta bool 0 0.00% 2
regionName object 0 0.00% 102
navigationDisplayMode int64 0 0.00% 3
regionId int64 0 0.00% 102
seasonName object 0 0.00% 68
seasonId int64 0 0.00% 6121
stageSortOrder int64 0 0.00% 590
awayTeamId int64 0 0.00% 9240
isTopMatch bool 0 0.00% 2
elapsed object 0 0.00% 13
hasIncidentsSummary bool 0 0.00% 2
hasPreview bool 0 0.00% 2
awayYellowCards int64 0 0.00% 12
awayTeamName object 0 0.00% 9068
awayRedCards int64 0 0.00% 7
awayTeamCountryName object 0 0.00% 214
period int64 0 0.00% 7
startTimeUtc datetime64[ns] 0 0.00% 247934
isStreamAvailable bool 0 0.00% 1
matchIsOpta bool 0 0.00% 2
isLineupConfirmed bool 0 0.00% 2
commentCount int64 0 0.00% 265
bets int64 0 0.00% 2
incidents int64 0 0.00% 27

Describe for Matches Raw Dataframe

Competition / geography

Column Type Description
tournamentId int Tournament/league identifier.
tournamentName string Tournament/league name.
tournamentSortOrder float Provider sort order for tournaments (UI/display).
stageId int Stage identifier (e.g., league, playoffs).
stageName string Stage name.
stageSortOrder int Provider sort order for stages (UI/display).
seasonId int Season identifier.
seasonName string Season label (e.g., "2002", "2023/2024").
regionId int Region identifier (country/region in provider taxonomy).
regionName string Region name (e.g., "USA", "Sweden").
regionCode string Region short code (e.g., "us", "se").
sex int Competition gender category (1 = male, 2 = female)

Match identity, status, and time

Column Type Description
id int Match identifier.
status* int Match status code.
startTime datetime Scheduled kickoff time (provider/local representation).
startTimeUtc datetime Scheduled kickoff time in UTC.
navigationDisplayMode int Provider UI/navigation display mode.
isOpta bool Indicates provider data sourced from Opta.
matchIsOpta bool Match-level Opta flag (often redundant with isOpta).

Home team

Column Type Description
homeTeamId int Home team identifier.
homeTeamName string Home team name.
homeTeamCountryCode string Home team country code.
homeTeamCountryName string Home team country name.
homeYellowCards* int Home team yellow cards.
homeRedCards* int Home team red cards.

Away team

Column Type Description
awayTeamId int Away team identifier.
awayTeamName string Away team name.
awayTeamCountryCode string Away team country code.
awayTeamCountryName string Away team country name.
awayYellowCards* int Away team yellow cards.
awayRedCards* int Away team red cards.

Scores (final / extra time / penalties)

Column Type Description
homeScore* float Home goals (regular time / final main score).
awayScore* float Away goals (regular time / final main score).
homeExtratimeScore* float Home goals in extra time.
awayExtratimeScore* float Away goals in extra time.
homePenaltyScore* float Home goals in penalty shootout.
awayPenaltyScore* float Away goals in penalty shootout.

Result / winner (provider fields)

Column Type Description
winnerField* float Provider “winner” code/flag.
aggregateWinnerField* float Winner over two legs (aggregate).
extraResultField* float Provider extra result code (AET/PEN/awarded/etc.).
period* int Provider period/state code (FT/AET/PEN/etc.).

Live / content availability & timeline

Column Type Description
hasIncidentsSummary* bool Whether an incidents/events summary is available.
hasPreview bool Whether a match preview is available.
scoreChangedAt* datetime Timestamp of the last score update.
elapsed* string Match clock or textual state.
lastScorer* float Provider ID of the last scorer.
isTopMatch bool Provider “top match” flag (promoted/high-interest).
commentCount* int Number of comments in provider UI.
isLineupConfirmed bool Whether starting lineups are confirmed.
isStreamAvailable bool Whether a stream is available in provider UI.
startedAtUtc* datetime Actual kickoff timestamp in UTC (when match really started).
firstHalfEndedAtUtc* datetime End of 1st half timestamp in UTC.
secondHalfStartedAtUtc* datetime Start of 2nd half timestamp in UTC.

Misc / nested blocks (often provider-specific)

Column Type Description
incidents* int Incidents/events indicator or count.
bets int Betting/offers indicator or count.
matchArgs object Placeholder for nested match arguments payload.
matchHeader object Placeholder for nested match header payload.
  • Columns marked with * available only after or during match completion.

Match status codes

Code Meaning Notes
0 Unknown (Not played) Unclear status from source.
1 Upcoming Scheduled upcoming match.
2 Postponed Postponed (per source).
3 In progress Started but not finished yet.
4 Unknown (Not played) Unclear status from source.
5 Unknown Unclear status from source.
6 Finished Match completed (final result available).
7 Cancelled Match cancelled.
Show code
_STATUS_LABELS = {
    0: "Unknown (Not played)",
    1: "Upcoming",
    2: "Postponed",
    3: "In progress",
    4: "Unknown (Not played)",
    5: "Unknown",
    6: "Finished",
    7: "Cancelled",
    61: "Custom: finished-like (no MC)",
    62: "Custom: finished-like (error)",
    63: "Custom: finished-like (empty MC)",
}
_status_counts = df_match_raw["status"].value_counts().sort_index().rename("count")
_status_df = _status_counts.to_frame()
_status_df.index.name = "code"
_status_df["meaning"] = _status_df.index.map(_STATUS_LABELS).fillna("—")
_status_df["share_%"] = (_status_df["count"] / _status_df["count"].sum() * 100).round(2)

display(
    _status_df[["meaning", "count", "share_%"]]
    .style
    .format({"share_%": "{:.2f}%"})
    .set_caption("Match status distribution")
)

del _status_counts
del _status_df
Table 3: Match status distribution
  meaning count share_%
code      
0 Unknown (Not played) 418 0.04%
1 Upcoming 3611 0.37%
2 Postponed 204 0.02%
3 In progress 1123 0.11%
4 Unknown (Not played) 2 0.00%
5 Unknown 284 0.03%
6 Finished 966140 97.77%
7 Cancelled 16407 1.66%
Note

status=6 (Finished) rows form the training dataset; status=1 (Upcoming) rows after the last finished match form the inference-ready future split. Together these two groups account for the vast majority of the dataset. All other statuses — Postponed, Cancelled, In progress, Unknown — are discarded at the preprocessing stage and do not enter the model pipeline.

Note

Live match examples are available at http://time2bet.ru/.

Use Status → Choose options to filter by status code, then click any match to inspect its details.

Regions and tournaments

Show code
_fig, _axes = plt.subplots(2, 1, figsize=(9, 10))

_col_name_map = {"tournamentId": "tournamentName", "regionId": "regionName"}

for _ax, _col, _title in zip(
    _axes,
    ["regionId", "tournamentId"],
    ["Matches by region (top 20)", "Matches by tournament (top 20)"],
):
    _name_col = _col_name_map[_col]

    _vc = (
        df_match_raw.groupby([_col, _name_col])
        .size()
        .reset_index(name="count")
        .sort_values("count", ascending=False)
        .head(20)
    )
    _labels = _vc[_name_col].astype(str)

    _ax.barh(_labels, _vc["count"], color="steelblue")
    _ax.invert_yaxis()
    _ax.set_xlabel("# matches")
    _ax.set_title(_title)
    for _bar, _val in zip(_ax.patches, _vc["count"]):
        _ax.text(_bar.get_width() * 1.01, _bar.get_y() + _bar.get_height() / 2,
                 f"{_val:,}", va="center", fontsize=7)

plt.tight_layout()
plt.show()
Figure 1: Top-20 distributions: regionId and tournamentId

Matches per month and seasonality

Show code
_df_t = df_finished.copy()
_df_t["_dt"] = pd.to_datetime(_df_t["startTimeUtc"])
_df_t["_year"] = _df_t["_dt"].dt.year
_df_t["_month"] = _df_t["_dt"].dt.month

_heat = _df_t.groupby(["_year", "_month"]).size().unstack(fill_value=0)
_heat.columns = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
                 "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"][:len(_heat.columns)]

_fig, _ax = plt.subplots(figsize=(9, max(4, len(_heat) * 0.35)))
sns.heatmap(_heat, annot=True, fmt="d", cmap="YlOrRd", linewidths=0.3, ax=_ax)
_ax.set_title("Matches per month — seasonality heatmap")
_ax.set_xlabel("Month")
_ax.set_ylabel("Year")
plt.tight_layout()
plt.show()

del _df_t
Figure 2: Monthly seasonality heatmap (year × month → match count)
Note

Seasonal gap. June–July show a consistent drop in match count across all years, reflecting the European summer break. For rolling-window features (e.g. 5-match form) computed on historical data, matches immediately after the break are computed over a stale window spanning the previous season. This is expected and acceptable: the feature engineering stage does not reset rolling accumulators at season boundaries, so the last pre-break games remain the most recent context available at inference time.

Matches by gender

Show code
_df_gender = df_finished.copy()
_df_gender["_year"] = pd.to_datetime(_df_gender["startTimeUtc"]).dt.year

_by_year_sex = (
    _df_gender.groupby(["_year", "sex"]).size().reset_index(name="count")
    if "sex" in _df_gender.columns else
    _df_gender.groupby("_year").size().reset_index(name="count")
)

_fig, _ax = plt.subplots(figsize=(9, 5))
if "sex" in _by_year_sex.columns:
    _sex_map = {1: "Male", 2: "Female"}
    for _s, _grp in _by_year_sex.groupby("sex"):
        _ax.bar(_grp["_year"].astype(str), _grp["count"], label=_sex_map.get(_s, str(_s)), alpha=0.85)
    _ax.legend(title="Gender")
else:
    _ax.bar(_by_year_sex["_year"].astype(str), _by_year_sex["count"], color="steelblue")

_ax.set_xlabel("Year")
_ax.set_ylabel("# matches")
_ax.set_title("Finished matches per year")
_ax.tick_params(axis="x", rotation=45)
plt.tight_layout()
plt.show()

del _df_gender
Figure 3: Matches per year by competition gender

Score

Show code
_df_scores = df_match_raw[["homeScore", "awayScore"]].dropna()

_fig, _axes = plt.subplots(2, 1, figsize=(9, 10), sharey=False)
_fig.suptitle("Goal count distributions — raw finished matches", fontsize=13)

for _ax, (_col, _label, _color) in zip(
    _axes,
    [("homeScore", "Home goals", "#2196F3"), ("awayScore", "Away goals", "#4CAF50")],
):
    _counts = _df_scores[_col].value_counts().sort_index()
    _ax.bar(_counts.index.astype(int).astype(str), _counts.values, color=_color, width=0.85)
    _ax.set_xlabel("Goals scored")
    _ax.set_ylabel("# matches")
    _ax.set_title(_label)
    for _bar, _val in zip(_ax.patches, _counts.values):
        if _val > 0:
            _ax.text(
                _bar.get_x() + _bar.get_width() / 2,
                _bar.get_height() + _counts.max() * 0.01,
                f"{_val:,}", ha="center", va="bottom", fontsize=7,
            )
    # Mark the outlier clip threshold (computed on the finished split, consistent with preprocess.py)
    _upper = int(df_finished[_col].quantile(PARAMS["preprocessing"]["score_outlier_pct"]))
    _x_labels = [str(int(x)) for x in sorted(_counts.index.astype(int).unique())]
    if str(_upper) in _x_labels:
        _ax.axvline(x=_x_labels.index(str(_upper)), color="crimson",
                    linestyle="--", linewidth=1.5, label=f"clip threshold = {_upper}")
        _ax.legend(fontsize=8)

plt.tight_layout()
plt.show()

del _df_scores
Figure 4: Goal count distributions — raw finished matches

Matches with in-match statistics

A subset of finished matches has rich live match center (MC) data collected during the game: per-minute team stats (shots, possession, passes, aerials, tackles, dribbles), referees, lineups, venue, and score timeline. This data is stored across dedicated tables — matches_live_header, matches_live_info, matches_live_{home,away}_stats, etc.

Show code
_df_mc = df_match_raw[
    (df_match_raw["status"] == 6)
    & (df_match_raw["period"] != 0)
    & (df_match_raw["hasPreview"])
]
_total = len(df_match_raw[df_match_raw["status"] == 6])
display(Markdown(
    f"**{len(_df_mc):,}** of **{_total:,}** finished matches "
    f"({len(_df_mc)/_total*100:.1f}%) have in-match statistics available."
))

del _df_mc
Table 4

65,391 of 966,140 finished matches (6.8%) have in-match statistics available.

Show code
_df_finished_raw = df_match_raw[df_match_raw["status"] == 6].copy()
_df_finished_raw["_year"] = pd.to_datetime(_df_finished_raw["startTimeUtc"]).dt.year
_df_finished_raw["_has_mc"] = (
    (_df_finished_raw["period"] != 0) & (_df_finished_raw["hasPreview"])
)
_mc_map = {True: "With MC stats", False: "Without MC stats"}
_by_year_mc = (
    _df_finished_raw.groupby(["_year", "_has_mc"])
    .size()
    .reset_index(name="count")
)

_fig, _ax = plt.subplots(figsize=(9, 5))
for _has_mc, _grp in _by_year_mc.groupby("_has_mc"):
    _ax.bar(
        _grp["_year"].astype(str), _grp["count"],
        label=_mc_map[_has_mc], alpha=0.85,
        color="#4CAF50" if _has_mc else "#90CAF9",
    )
_ax.set_xlabel("Year")
_ax.set_ylabel("# matches")
_ax.set_title("Finished matches per year — with vs without in-match statistics")
_ax.tick_params(axis="x", rotation=45)
_ax.legend(title="MC stats")
plt.tight_layout()
plt.show()

del _df_finished_raw
Figure 5: Finished matches per year — with vs without in-match statistics
Note

MC data availability. In-match statistics are available only from a certain year onward, reflecting the point at which live data collection was integrated into the pipeline. Matches before that cutoff have hasPreview = FALSE or period = 0 and are treated as without MC. This structural gap is visible in the chart above as a step-change in the “With MC stats” series. Any future feature set built on MC data will be constrained to the post-cutoff subset, reducing training set size relative to the full finished split.

Note

To explore matches with in-match statistics on http://time2bet.ru/:

  1. Enable the “Only with MC stats” checkbox in the filters panel.
  2. Click any match link — this opens the match detail page.
  3. Select the “Match Centre” tab to view live stats: timeline, possession, shots, passes, and more.
Important

Out of scope for v1. In-match statistics are not used as features in the current pipeline. All models rely solely on pre-match information to avoid data leakage. Live stats are a natural candidate for future feature iterations (e.g. half-time retraining, live odds adjustment).

2. Data Preparation Pipeline

This section documents the full data preparation phase: from raw ingestion and exploratory profiling through preprocessing and artifact output. The implementation lives in src/data/preprocess.py and is orchestrated as a DVC stage (preprocess).

Implementation — preprocess_and_split()

The following steps are applied inside preprocess_and_split() in the order they execute:

  1. Drop irrelevant columns — removes display/UI fields (stageName, regionName, isOpta, sort-order keys, etc.) and all live/post-match columns (elapsed, scoreChangedAt, period, incidents, etc.) — approximately 40 columns in total.
  2. Downcast ID types — casts identifier columns to compact integer types (tournamentId, regionId, stageId, seasonId → int16; id, homeTeamId, awayTeamId → int32; sex → int8) to reduce memory footprint.
  3. Parse and sort by time — converts startTimeUtc to UTC-aware datetime, then sorts all rows ascending by kickoff time.
  4. Split by status — selects status=6 rows as finished and status=1 rows whose kickoff is strictly after the last finished match as future. All other statuses (postponed, cancelled, in-progress, unknown) are discarded. The status column is dropped after partitioning.
  5. Downcast scores — casts homeScore and awayScore to int8 on the finished split after null removal.
  6. Compute classification target — derives outcome_1x2 (0 = home win, 1 = draw, 2 = away win) from raw scores before clipping, so the label is unaffected by the clip operation.
  7. Clip outlier scores — for each of homeScore / awayScore independently, computes the params.yaml → preprocessing.score_outlier_pct quantile (default 0.9999 ≈ 99.99-th percentile) on the finished split and clips values above that threshold. Percentile is preferred over IQR (whose fence falls inside the normal-score range for Poisson-concentrated data) and Z-score (assumes normality). At 0.9999 the threshold is ~10–12 goals, clipping only technical results/data errors (~0.01% of matches) while keeping all legitimate high-scoring games. Matches are kept (not dropped) so team history remains continuous for rolling stats and ELO.
  8. Compute regression targets — derives sumScore = homeScore + awayScore and diffScore = homeScore − awayScore (after clipping, so derived targets are consistent with the clipped scores).
  9. Drop intermediate columns — removes the temporary binary flags used during target derivation (homeWin, awayWin, draw) and disciplinary columns (homeYellowCards, awayYellowCards, homeRedCards, awayRedCards) that are excluded from the v1 feature set.
  10. Drop score columns from future — removes homeScore, awayScore, and all extra-time / penalty score fields from the future split to prevent any target leakage.
Note

All steps use only pre-match information or post-match results that are applied to the finished split only. The future split never receives target or score columns.

Raw data — columns selected for preprocessing

The table below shows only the columns retained after dropping display/UI and live-match fields. homeScore and awayScore are null for future matches and will be separated in the next step.

Show code
_columns_for_preprocessing = [
    "startTimeUtc",
    "id",
    "sex",
    "regionId",
    "tournamentId",
    "seasonId",
    "stageId",
    "homeTeamId",
    "awayTeamId",
    "homeScore",
    "awayScore",
]
_df = df_match_raw[_columns_for_preprocessing]
_null_counts = _df.isnull().sum()
_table = pd.DataFrame({
    "dtype": _df.dtypes.astype(str),
    "null_count": _null_counts,
    "null_%": (_null_counts / len(_df) * 100).round(2),
    "unique": _df.nunique(),
    "min": _df.min(),
    "max": _df.max(),
}).loc[_columns_for_preprocessing]

display(
    _table.style
    .background_gradient(subset=["null_%"], cmap="Reds")
    .format({"null_%": "{:.2f}%"})
    .set_caption("Column types, null counts and cardinality")
)

del _df
del _null_counts
del _table
Table 5: Column types, null counts and cardinality
  dtype null_count null_% unique min max
startTimeUtc datetime64[ns] 0 0.00% 247934 1998-06-30 19:00:00 2026-05-01 21:30:00
id int64 0 0.00% 988189 1 1978419
sex int64 0 0.00% 2 1 2
regionId int64 0 0.00% 102 3 265
tournamentId int64 0 0.00% 468 1 783
seasonId int64 0 0.00% 6121 1 11075
stageId int64 0 0.00% 12438 1 25346
homeTeamId int64 0 0.00% 10428 1 32635
awayTeamId int64 0 0.00% 9240 1 32633
homeScore float64 20640 2.09% 23 0.000000 101.000000
awayScore float64 20640 2.09% 28 0.000000 95.000000

Finished split — schema after preprocessing

finished.parquet contains only status=6 matches. Derived targets (sumScore, diffScore, outcome_1x2) are appended.

Show code
_columns = [
    "startTimeUtc",
    "id",
    "sex",
    "regionId",
    "tournamentId",
    "seasonId",
    "stageId",
    "homeTeamId",
    "awayTeamId",
    "homeScore",
    "awayScore",
    "sumScore",
    "diffScore",
    "outcome_1x2",
]
_df = df_finished
_null_counts = _df.isnull().sum()
_table = pd.DataFrame({
    "dtype": _df.dtypes.astype(str),
    "null_count": _null_counts,
    "null_%": (_null_counts / len(_df) * 100).round(2),
    "unique": _df.nunique(),
    "min": _df.min(),
    "max": _df.max(),
}).loc[_columns]

display(
    _table.style
    .background_gradient(subset=["null_%"], cmap="Reds")
    .format({"null_%": "{:.2f}%"})
    .set_caption("Column types, null counts and cardinality")
)

del _df
del _null_counts
del _table
Table 6: Column types, null counts and cardinality
  dtype null_count null_% unique min max
startTimeUtc datetime64[ns, UTC] 0 0.00% 245557 1998-06-30 19:00:00+00:00 2026-04-16 02:00:00+00:00
id int32 0 0.00% 966140 1 1978288
sex int8 0 0.00% 2 1 2
regionId int16 0 0.00% 102 3 265
tournamentId int16 0 0.00% 461 1 783
seasonId int16 0 0.00% 6091 1 11071
stageId int16 0 0.00% 12379 1 25334
homeTeamId int32 0 0.00% 10335 1 32621
awayTeamId int32 0 0.00% 9146 1 32621
homeScore int8 0 0.00% 12 0 11
awayScore int8 0 0.00% 14 0 13
sumScore int8 0 0.00% 20 0 24
diffScore int8 0 0.00% 25 -13 11
outcome_1x2 int8 0 0.00% 3 0 2

Future split — schema after preprocessing

future.parquet contains only status=1 matches with kickoff strictly after the last finished match. Score and target columns are absent by design.

Show code
_columns = [
    "startTimeUtc",
    "id",
    "sex",
    "regionId",
    "tournamentId",
    "seasonId",
    "stageId",
    "homeTeamId",
    "awayTeamId",
]
_df = df_future
_null_counts = _df.isnull().sum()
_table = pd.DataFrame({
    "dtype": _df.dtypes.astype(str),
    "null_count": _null_counts,
    "null_%": (_null_counts / len(_df) * 100).round(2),
    "unique": _df.nunique(),
    "min": _df.min(),
    "max": _df.max(),
}).loc[_columns]

display(
    _table.style
    .background_gradient(subset=["null_%"], cmap="Reds")
    .format({"null_%": "{:.2f}%"})
    .set_caption("Column types, null counts and cardinality")
)

del _df
del _null_counts
del _table
Table 7: Column types, null counts and cardinality
  dtype null_count null_% unique min max
startTimeUtc datetime64[ns, UTC] 0 0.00% 574 2026-04-16 12:00:00+00:00 2026-05-01 21:30:00+00:00
id int32 0 0.00% 3186 1901243 1978419
sex int8 0 0.00% 2 1 2
regionId int16 0 0.00% 90 3 265
tournamentId int16 0 0.00% 195 1 783
seasonId int16 0 0.00% 195 10720 11075
stageId int16 0 0.00% 261 24478 25346
homeTeamId int32 0 0.00% 2515 1 32635
awayTeamId int32 0 0.00% 2510 1 32633
Warning

Cold-start risk. Teams that appear in the future split but have no history in the finished split will receive zero rolling statistics and the initial ELO rating (params.yaml → features.elo.initial_rating). This is expected behaviour and is handled at the feature engineering stage with explicit fallback defaults.

Show code
_known_home = set(df_finished["homeTeamId"]) | set(df_finished["awayTeamId"])
_future_home = set(df_future["homeTeamId"])
_future_away = set(df_future["awayTeamId"])
_cold_home = _future_home - _known_home
_cold_away = _future_away - _known_home
_cold_any = (_future_home | _future_away) - _known_home
display(Markdown(
    f"**Cold-start teams (no history in finished split):** "
    f"{len(_cold_home)} appearing as home, "
    f"{len(_cold_away)} appearing as away "
    f"({len(_cold_any)} unique teams total)"
))
del _known_home, _future_home, _future_away, _cold_home, _cold_away, _cold_any
Table 8

Cold-start teams (no history in finished split): 9 appearing as home, 2 appearing as away (11 unique teams total)

3. Dataset definition and targets

Matches — Finished vs Future (column availability + target/label roles)

Column Finished Future Role
startTimeUtc ✓ ✓ Feature (time)
id ✓ ✓ Feature (match identity)
sex ✓ ✓ Feature (competition)
regionId ✓ ✓ Feature (geography)
tournamentId ✓ ✓ Feature (competition)
seasonId ✓ ✓ Feature (season)
stageId ✓ ✓ Feature (competition)
homeTeamId ✓ ✓ Feature (entity)
awayTeamId ✓ ✓ Feature (entity)
outcome_1x2 ✓ — Classification target (3-class label)
homeScore ✓ — Regression target
awayScore ✓ — Regression target
sumScore ✓ — Regression target (derived)
diffScore ✓ — Regression target (derived)
Important

v1 scope: This report focuses on the classification target (outcome_1x2: Home win / Draw / Away win). Regression targets (homeScore, awayScore, sumScore, diffScore) are out of scope for v1 and will be analysed in a future iteration.

Classification target

Show code
_outcome_labels = {0: "Home win", 1: "Draw", 2: "Away win"}
_props = df_finished["outcome_1x2"].value_counts(normalize=True).sort_index()
_props.index = [_outcome_labels.get(i, str(i)) for i in _props.index]

_fig, _ax = plt.subplots(figsize=(9, 5))
_ax.bar(_props.index, _props.values, color=["#2196F3", "#FF9800", "#4CAF50"])
_ax.set_ylabel("Proportion")
_ax.set_title("Overall class proportions")
for _bar, _val in zip(_ax.patches, _props.values):
    _ax.text(_bar.get_x() + _bar.get_width() / 2, _bar.get_height() + 0.002,
             f"{_val:.3f}", ha="center", fontsize=10)
plt.tight_layout()
plt.show()

display(
    _props.rename("proportion").to_frame()
    .assign(**{"count": df_finished["outcome_1x2"].value_counts().sort_index().values})
    .style.format({"proportion": "{:.4f}"})
    .set_caption("Classification target counts")
)
(a) outcome_1x2 class distribution
  proportion count
Home win 0.4482 432984
Draw 0.2528 244265
Away win 0.2990 288891
(b)
Figure 6: Classification target counts
Note

Class imbalance. Home win is consistently the most frequent outcome; draws are the minority class. The imbalance is moderate (roughly 1.7–2× between majority and minority) and does not require resampling — tree-based models handle it natively. Probability calibration is applied at the final training stage to correct systematic bias in predicted probabilities (see params.yaml → final_train.calibration).

Temporal stability of outcome_1x2

A drift-check: if the home-win rate shifts substantially over time, the temporal split strategy must account for it and calibration must be re-validated on recent data.

Show code
_OUTCOME_LABELS = {0: "Home win", 1: "Draw", 2: "Away win"}
_OUTCOME_COLORS = {0: "#2196F3", 1: "#FF9800", 2: "#4CAF50"}

_df_t = df_finished.copy()
_df_t["_year"] = pd.to_datetime(_df_t["startTimeUtc"]).dt.year

_year_totals = _df_t.groupby("_year").size()
_rates_yr = (
    _df_t.groupby(["_year", "outcome_1x2"])
    .size()
    .unstack(fill_value=0)
    .div(_year_totals, axis=0)
)

_fig, _ax = plt.subplots(figsize=(9, 5))
for _outcome, _label in _OUTCOME_LABELS.items():
    if _outcome in _rates_yr.columns:
        _ax.plot(
            _rates_yr.index.astype(str), _rates_yr[_outcome],
            marker="o", label=_label, color=_OUTCOME_COLORS[_outcome], linewidth=2,
        )

_ax.set_xlabel("Year")
_ax.set_ylabel("Proportion")
_ax.set_title("outcome_1x2 proportions per year — temporal stability")
_ax.legend(title="Outcome")
_ax.tick_params(axis="x", rotation=45)
_ax.set_ylim(0, 0.7)
plt.tight_layout()
plt.show()

del _df_t, _year_totals, _rates_yr
Figure 7: outcome_1x2 proportions per year — temporal stability
Note

Distribution is stable. Home-win, draw, and away-win rates remain within a narrow band across all years, with no sustained structural shift. This confirms that a single-split temporal train/test strategy is appropriate — the label distribution the model trains on is representative of the distribution it will be evaluated and deployed against. Any year-to-year variation visible in the chart is within expected sampling noise for the per-year match counts. Calibration should nonetheless be validated on the most recent data slice, as even small shifts compound in probability output.

Population heterogeneity — by tournament

The global class proportions above are an aggregate that masks substantial variation across competitions. This plot examines whether outcome distributions are homogeneous across the most-played tournaments, directly motivating params.yaml → classification.groupby_cols: ["regionId", "sex"] as stratification axes for prior estimation and probability calibration.

Show code
_TOP_N = 20
_OUTCOME_LABELS = {0: "Home win", 1: "Draw", 2: "Away win"}
_OUTCOME_COLORS = ["#2196F3", "#FF9800", "#4CAF50"]

# Recover tournament names from raw (dropped during preprocessing)
_meta = df_match_raw[["id", "tournamentName"]].drop_duplicates("id").set_index("id")
_df_h = df_finished.join(_meta, how="left")

_top_t = _df_h["tournamentName"].value_counts().head(_TOP_N).index
_df_top = _df_h[_df_h["tournamentName"].isin(_top_t)].copy()

_rates_t = (
    _df_top.groupby(["tournamentName", "outcome_1x2"])
    .size()
    .unstack(fill_value=0)
    .div(_df_top.groupby("tournamentName").size(), axis=0)
    .sort_values(0, ascending=False)   # sort by home-win rate descending
)
_rates_t.columns = [_OUTCOME_LABELS[c] for c in _rates_t.columns]

_fig, _ax = plt.subplots(figsize=(9, max(5, len(_rates_t) * 0.42)))
_rates_t.plot(kind="barh", stacked=True, color=_OUTCOME_COLORS, ax=_ax, width=0.75)
_ax.invert_yaxis()
_ax.set_xlabel("Proportion")
_ax.set_title(f"outcome_1x2 distribution — top {_TOP_N} tournaments")
_ax.legend(title="Outcome", bbox_to_anchor=(1.01, 1), loc="upper left")
for _bar_container, _col in zip(_ax.containers, _rates_t.columns):
    _ax.bar_label(_bar_container, fmt="%.2f", label_type="center", fontsize=7, color="white")
plt.tight_layout()
plt.show()

del _meta, _df_h, _df_top, _rates_t
Figure 8: outcome_1x2 proportions — top-20 tournaments by match count
Note

Observed heterogeneity. Home-win rates vary across tournaments (typically 38–52%), confirming that a single global prior is insufficient. regionId and sex are used as stratification axes because tournament-level granularity would introduce sparse strata for lower-division leagues. This is the direct justification for groupby_cols: ["regionId", "sex"] in calibration.

Regression targets

Four regression targets are derived from the final, clipped scores. homeScore and awayScore are the primary outputs; sumScore (total goals) and diffScore (home minus away) are derived aggregates that encode different aspects of the result and may be predicted independently or used as auxiliary signals.

Show code
_cols_and_colors = [("homeScore", "#2196F3"), ("awayScore", "#4CAF50"),
                    ("sumScore", "#FF9800"), ("diffScore", "#9C27B0")]
_fig, _axes = plt.subplots(4, 1, figsize=(9, 16), constrained_layout=True)
_fig.suptitle("Regression target distributions", fontsize=14)

for _ax, (_col, _color) in zip(_axes, _cols_and_colors):
    if _col not in df_finished.columns:
        _ax.set_visible(False)
        continue
    _props_r = df_finished[_col].value_counts(normalize=True).sort_index()
    # Use numeric x-axis to preserve correct order for columns with negative values (diffScore)
    _x_vals = _props_r.index.tolist()
    _ax.bar(range(len(_x_vals)), _props_r.values, color=_color, width=0.9)
    _ax.set_xticks(range(len(_x_vals)))
    _ax.set_xticklabels([str(v) for v in _x_vals], fontsize=8)
    _ax.set_ylabel("Proportion")
    _ax.set_xlabel(_col)
    _ax.bar_label(_ax.containers[0], fmt="%.3f", fontsize=9, padding=2)
Figure 9: Score distributions — finished matches
Note

Distribution shapes. homeScore and awayScore are right-skewed and concentrated at 0–3 goals, consistent with a Poisson-like process — motivating Poisson regression or count-based models for v2. sumScore inherits the same right skew. diffScore is the only target that takes negative values (away win margin), making it incompatible with Poisson regression without transformation; a Skellam distribution (difference of two Poissons) would be the natural parametric choice. These characteristics are out of scope for v1 and are documented here for the v2 feature iteration.

4. Data Quality Checks

Data quality is validated automatically by three DVC stages using Great Expectations:

DVC stage Input Suite Report
validate_raw data/raw/match_raw.parquet raw_match_suite data/evaluation/ge_raw.json
validate_finished data/interim/finished.parquet finished_suite data/evaluation/ge_finished.json
validate_future data/interim/future.parquet future_match_suite data/evaluation/ge_future.json

Each stage exits with code 1 on any expectation failure — the DVC pipeline will not proceed further. validate_future also acts as an anti-leakage gate: it asserts (via exact_match=True) that score and target columns are absent from the future split.

validate_raw — raw data

Checks: required columns present (exact_match=False); row count ≥ 1; id unique; id, homeTeamId, awayTeamId, startTimeUtc, status non-null; startTimeUtc in range 1998-01-01 → 2026-12-31.

Show code
import json

_GE_RAW_PATH = project_root / "data/evaluation/ge_raw.json"

if _GE_RAW_PATH.exists():
    _raw_report = json.loads(_GE_RAW_PATH.read_text())
    _raw_results = _raw_report.get("results", [])
    _raw_success = _raw_report.get("success", None)

    _rows = []
    for _r in _raw_results:
        _etype = _r["expectation_config"]["type"]
        _kwargs = _r["expectation_config"].get("kwargs", {})
        _col    = _kwargs.get("column", "—")
        _ok     = _r["success"]
        _result = _r.get("result", {})
        _detail = ""
        if not _ok:
            _detail = str(_result)
        _rows.append({"Expectation": _etype, "Column": _col,
                      "Status": "✅ PASS" if _ok else "❌ FAIL", "Detail": _detail})

    _df_ge = pd.DataFrame(_rows)
    _n_pass = (_df_ge["Status"] == "✅ PASS").sum()
    _n_fail = (_df_ge["Status"] == "❌ FAIL").sum()

    _suite_status = "✅ PASS" if _raw_success else "❌ FAIL"
    display(Markdown(f"**Suite result: {_suite_status}** — {_n_pass} passed, {_n_fail} failed"))
    display(
        _df_ge.style
        .apply(lambda x: ["background-color: #fdd" if v == "❌ FAIL" else "" for v in x], subset=["Status"])
        .set_caption("Great Expectations — raw_match_suite")
    )
else:
    display(Markdown("::: {.callout-warning}\n`data/evaluation/ge_raw.json` not found. Run `dvc repro validate_raw` first.\n:::"))
Table 9: Great Expectations — raw_match_suite

Suite result: ✅ PASS — 9 passed, 0 failed

  Expectation Column Status Detail
0 expect_table_columns_to_match_set — ✅ PASS
1 expect_table_row_count_to_be_between — ✅ PASS
2 expect_column_values_to_not_be_null id ✅ PASS
3 expect_column_values_to_be_unique id ✅ PASS
4 expect_column_values_to_not_be_null homeTeamId ✅ PASS
5 expect_column_values_to_not_be_null awayTeamId ✅ PASS
6 expect_column_values_to_not_be_null startTimeUtc ✅ PASS
7 expect_column_values_to_be_between startTimeUtc ✅ PASS
8 expect_column_values_to_not_be_null status ✅ PASS

validate_finished — preprocessed finished split

Checks: required columns present; row count ≥ 1; all columns non-null; id unique; outcome_1x2 ∈ {0, 1, 2}; homeScore and awayScore in range [0, 15].

Show code
_GE_FINISHED_PATH = project_root / "data/evaluation/ge_finished.json"

if _GE_FINISHED_PATH.exists():
    _int_report = json.loads(_GE_FINISHED_PATH.read_text())
    _int_results = _int_report.get("results", [])
    _int_success = _int_report.get("success", None)

    _rows = []
    for _r in _int_results:
        _etype = _r["expectation_config"]["type"]
        _kwargs = _r["expectation_config"].get("kwargs", {})
        _col    = _kwargs.get("column", "—")
        _ok     = _r["success"]
        _result = _r.get("result", {})
        _detail = ""
        if not _ok:
            _detail = str(_result)
        _rows.append({"Expectation": _etype, "Column": _col,
                      "Status": "✅ PASS" if _ok else "❌ FAIL", "Detail": _detail})

    _df_ge = pd.DataFrame(_rows)
    _n_pass = (_df_ge["Status"] == "✅ PASS").sum()
    _n_fail = (_df_ge["Status"] == "❌ FAIL").sum()

    _suite_status = "✅ PASS" if _int_success else "❌ FAIL"
    display(Markdown(f"**Suite result: {_suite_status}** — {_n_pass} passed, {_n_fail} failed"))
    display(
        _df_ge.style
        .apply(lambda x: ["background-color: #fdd" if v == "❌ FAIL" else "" for v in x], subset=["Status"])
        .set_caption("Great Expectations — finished_suite")
    )
else:
    display(Markdown("::: {.callout-warning}\n`data/evaluation/ge_finished.json` not found. Run `dvc repro validate_finished` first.\n:::"))
Table 10: Great Expectations — finished_suite

Suite result: ✅ PASS — 18 passed, 0 failed

  Expectation Column Status Detail
0 expect_table_columns_to_match_set — ✅ PASS
1 expect_table_row_count_to_be_between — ✅ PASS
2 expect_column_values_to_not_be_null id ✅ PASS
3 expect_column_values_to_be_unique id ✅ PASS
4 expect_column_values_to_not_be_null homeTeamId ✅ PASS
5 expect_column_values_to_not_be_null awayTeamId ✅ PASS
6 expect_column_values_to_not_be_null startTimeUtc ✅ PASS
7 expect_column_values_to_not_be_null regionId ✅ PASS
8 expect_column_values_to_not_be_null tournamentId ✅ PASS
9 expect_column_values_to_not_be_null seasonId ✅ PASS
10 expect_column_values_to_not_be_null homeScore ✅ PASS
11 expect_column_values_to_be_between homeScore ✅ PASS
12 expect_column_values_to_not_be_null awayScore ✅ PASS
13 expect_column_values_to_be_between awayScore ✅ PASS
14 expect_column_values_to_not_be_null sumScore ✅ PASS
15 expect_column_values_to_not_be_null diffScore ✅ PASS
16 expect_column_values_to_not_be_null outcome_1x2 ✅ PASS
17 expect_column_values_to_be_in_set outcome_1x2 ✅ PASS

validate_future — upcoming matches

Checks: exactly the 9 identity columns and nothing else (exact_match=True — anti-leakage gate verifying score and target columns are absent); row count ≥ 1; all columns non-null; id unique.

Show code
_GE_FUTURE_PATH = project_root / "data/evaluation/ge_future.json"

if _GE_FUTURE_PATH.exists():
    _fut_report = json.loads(_GE_FUTURE_PATH.read_text())
    _fut_results = _fut_report.get("results", [])
    _fut_success = _fut_report.get("success", None)

    _rows = []
    for _r in _fut_results:
        _etype = _r["expectation_config"]["type"]
        _kwargs = _r["expectation_config"].get("kwargs", {})
        _col    = _kwargs.get("column", "—")
        _ok     = _r["success"]
        _result = _r.get("result", {})
        _detail = ""
        if not _ok:
            _detail = str(_result)
        _rows.append({"Expectation": _etype, "Column": _col,
                      "Status": "✅ PASS" if _ok else "❌ FAIL", "Detail": _detail})

    _df_ge = pd.DataFrame(_rows)
    _n_pass = (_df_ge["Status"] == "✅ PASS").sum()
    _n_fail = (_df_ge["Status"] == "❌ FAIL").sum()

    _suite_status = "✅ PASS" if _fut_success else "❌ FAIL"
    display(Markdown(f"**Suite result: {_suite_status}** — {_n_pass} passed, {_n_fail} failed"))
    display(
        _df_ge.style
        .apply(lambda x: ["background-color: #fdd" if v == "❌ FAIL" else "" for v in x], subset=["Status"])
        .set_caption("Great Expectations — future_match_suite")
    )
else:
    display(Markdown("::: {.callout-warning}\n`data/evaluation/ge_future.json` not found. Run `dvc repro validate_future` first.\n:::"))
Table 11: Great Expectations — future_match_suite

Suite result: ✅ PASS — 12 passed, 0 failed

  Expectation Column Status Detail
0 expect_table_columns_to_match_set — ✅ PASS
1 expect_table_row_count_to_be_between — ✅ PASS
2 expect_column_values_to_not_be_null id ✅ PASS
3 expect_column_values_to_be_unique id ✅ PASS
4 expect_column_values_to_not_be_null startTimeUtc ✅ PASS
5 expect_column_values_to_not_be_null sex ✅ PASS
6 expect_column_values_to_not_be_null regionId ✅ PASS
7 expect_column_values_to_not_be_null tournamentId ✅ PASS
8 expect_column_values_to_not_be_null seasonId ✅ PASS
9 expect_column_values_to_not_be_null stageId ✅ PASS
10 expect_column_values_to_not_be_null homeTeamId ✅ PASS
11 expect_column_values_to_not_be_null awayTeamId ✅ PASS

5. EDA and Preprocessing Summary

The table below consolidates all phases covered in this report — from raw data loading to quality-gated artifact output — so the scope of each step and its verification gate are visible in one place.

Phase Scope Artifact / Quality gate
Data loading match_raw.parquet ingested from MinIO df_match_raw
EDA Status distribution, regions/tournaments, seasonality, gender, score distributions, MC coverage Sections 1–2 of this report
Missing values Null share visualised per column; mandatory key columns enforced GE ExpectColumnValuesToNotBeNull (validate_raw)
Duplicates id uniqueness enforced at ingestion and after preprocessing GE ExpectColumnValuesToBeUnique (validate_raw, validate_finished, validate_future)
Column selection UI/display and live-match columns dropped (≈40); see Implementation above finished.parquet, future.parquet
Type downcasting IDs → int16 / int32; scores → int8 Memory-efficient parquet schema
Temporal sort & split Sort ascending by UTC kickoff; partition status=6 (finished) vs status=1 after last finished (future) finished.parquet, future.parquet
Outlier clipping Scores clipped at score_outlier_pct quantile (~0.01% of matches; forfeits and data errors only); threshold visualised on score charts above GE ExpectColumnValuesToBeBetween (validate_finished)
Target derivation outcome_1x2 (3-class classification); homeScore, awayScore, sumScore, diffScore (regression) finished.parquet
Anti-leakage gate Score and target columns verified absent from future split GE ExpectTableColumnsToMatchSet(exact_match=True) (validate_future)