ML Limitations & Justified Improvements¶
Purpose¶
Document current ML limitations honestly, grouped by category, and identify future improvements that are concrete and justified — not speculative.
Implementation readiness for all items is tracked in Status and the Architecture Roadmap.
Data limitations¶
Single data source
Training data comes from WhoScored match statistics only. No player-level data (injuries, transfers, form), no referee assignment, no weather or pitch conditions. These factors are known to influence match outcomes and represent a ceiling on how much the current feature set can explain.
No bookmaker odds as input features
Bookmaker odds are used only as an external evaluation baseline, not as features. This maintains clean separation between prediction and market data. Adding odds as features would risk introducing market-calibrated information that could obscure whether the model has learned anything independently useful.
Historical depth bounded by scraper
Training data is limited to seasons covered by the WhoScored scraper. Competitions or historical periods not scraped are simply absent.
Feature limitations¶
No player-level or squad composition features
The current feature set is team-level only. Player absences (injury, suspension), transfer activity, and squad rotation have measurable effects on outcomes but are not yet modelled. This is a known gap and a planned improvement.
No live table / league position at prediction time
League table position at the time of prediction requires a careful point-in-time join to avoid using future standings. This join is not yet implemented safely. Adding it incorrectly would be a leakage risk.
H2H features are sparse for infrequently-meeting teams
Head-to-head rolling statistics are reliable only for teams with sufficient historical meetings. Rarely-meeting clubs (across-league cup ties, newly promoted teams) will have near-zero coverage for H2H features.
Validation and calibration limitations¶
No per-competition metric breakdown
Validation metrics are reported on the full held-out test set. A breakdown by competition type (e.g., top-flight vs. lower division, men's vs. women's) is not yet implemented. Different competition types may have systematically different calibration quality.
Single model per serving path
The serving path exposes one registered model. There is no A/B testing infrastructure or shadow-mode evaluation in place. New model versions can only be compared via offline metrics before promotion.
Post-hoc calibration is optional, not yet default
Calibrated classifier wrapping (CalibratedClassifierCV) is implemented in final_train
but is not yet a required step in the pipeline. Whether calibration is applied depends on
the calibration_config passed to make_final_train_run.
Serving-related ML limitations¶
Feature freshness at inference
The serving layer reads pre-computed features from data/predictions/match_features.parquet
(populated by the batch_inference DVC stage). If the pipeline has not run since the last
matches were played, the features may be stale. There is no feature freshness guarantee or
freshness check at request time.
Cold-start latency after pod restart
The model is lazy-loaded on first inference request after a worker restart (~1–2 s). Subsequent requests use the cached model. This is a known operational characteristic, not a model defect.
No online feature store
Features are file-based Parquet artifacts. For time-sensitive or high-frequency use cases a dedicated online feature store would be needed. This is not required at current scale.
Justified future improvements¶
These improvements are concrete, causally linked to the limitations above, and ordered by impact on model quality.
| Improvement | Limitation addressed | Priority |
|---|---|---|
| Player-level and injury features | Data / feature gap — measurable outcome effect | High |
| Per-competition metric breakdown | Validation gap — calibration may differ by tier | Medium |
| Automated promotion policy | Registry gap — manual gate is operationally fragile | Near-term |
| Live table position feature (safe join) | Feature gap — league position signal | Medium |
| Feature freshness check at inference | Serving reliability gap | Medium |
| A/B testing / shadow serving | Serving gap — no live evaluation of challengers | Medium |
| Calibration as default pipeline step | Calibration gap — ECE currently not enforced | Medium |
Items not listed here (e.g., streaming ingestion, transformer architectures, ensemble stacking) are exploratory or long-term and are tracked in Architecture: Roadmap rather than this page.