Skip to content

ML Limitations & Justified Improvements

Purpose

Document current ML limitations honestly, grouped by category, and identify future improvements that are concrete and justified — not speculative.

Implementation readiness for all items is tracked in Status and the Architecture Roadmap.


Data limitations

Single data source

Training data comes from WhoScored match statistics only. No player-level data (injuries, transfers, form), no referee assignment, no weather or pitch conditions. These factors are known to influence match outcomes and represent a ceiling on how much the current feature set can explain.

No bookmaker odds as input features

Bookmaker odds are used only as an external evaluation baseline, not as features. This maintains clean separation between prediction and market data. Adding odds as features would risk introducing market-calibrated information that could obscure whether the model has learned anything independently useful.

Historical depth bounded by scraper

Training data is limited to seasons covered by the WhoScored scraper. Competitions or historical periods not scraped are simply absent.


Feature limitations

No player-level or squad composition features

The current feature set is team-level only. Player absences (injury, suspension), transfer activity, and squad rotation have measurable effects on outcomes but are not yet modelled. This is a known gap and a planned improvement.

No live table / league position at prediction time

League table position at the time of prediction requires a careful point-in-time join to avoid using future standings. This join is not yet implemented safely. Adding it incorrectly would be a leakage risk.

H2H features are sparse for infrequently-meeting teams

Head-to-head rolling statistics are reliable only for teams with sufficient historical meetings. Rarely-meeting clubs (across-league cup ties, newly promoted teams) will have near-zero coverage for H2H features.


Validation and calibration limitations

No per-competition metric breakdown

Validation metrics are reported on the full held-out test set. A breakdown by competition type (e.g., top-flight vs. lower division, men's vs. women's) is not yet implemented. Different competition types may have systematically different calibration quality.

Single model per serving path

The serving path exposes one registered model. There is no A/B testing infrastructure or shadow-mode evaluation in place. New model versions can only be compared via offline metrics before promotion.

Post-hoc calibration is optional, not yet default

Calibrated classifier wrapping (CalibratedClassifierCV) is implemented in final_train but is not yet a required step in the pipeline. Whether calibration is applied depends on the calibration_config passed to make_final_train_run.


Feature freshness at inference

The serving layer reads pre-computed features from data/predictions/match_features.parquet (populated by the batch_inference DVC stage). If the pipeline has not run since the last matches were played, the features may be stale. There is no feature freshness guarantee or freshness check at request time.

Cold-start latency after pod restart

The model is lazy-loaded on first inference request after a worker restart (~1–2 s). Subsequent requests use the cached model. This is a known operational characteristic, not a model defect.

No online feature store

Features are file-based Parquet artifacts. For time-sensitive or high-frequency use cases a dedicated online feature store would be needed. This is not required at current scale.


Justified future improvements

These improvements are concrete, causally linked to the limitations above, and ordered by impact on model quality.

Improvement Limitation addressed Priority
Player-level and injury features Data / feature gap — measurable outcome effect High
Per-competition metric breakdown Validation gap — calibration may differ by tier Medium
Automated promotion policy Registry gap — manual gate is operationally fragile Near-term
Live table position feature (safe join) Feature gap — league position signal Medium
Feature freshness check at inference Serving reliability gap Medium
A/B testing / shadow serving Serving gap — no live evaluation of challengers Medium
Calibration as default pipeline step Calibration gap — ECE currently not enforced Medium

Items not listed here (e.g., streaming ingestion, transformer architectures, ensemble stacking) are exploratory or long-term and are tracked in Architecture: Roadmap rather than this page.