Machine Learning Subsystem¶
Purpose¶
This section documents the ML subsystem of SoccerPredictAI: problem framing, validation discipline, feature logic, training lifecycle, experiment tracking, model contracts, and registry promotion.
It is not a second architecture overview and does not repeat data engineering concerns.
- For system-level design, boundaries, and deployment topology, see Architecture.
- For datasets, lineage, contracts, and reproducibility boundary, see Data.
- For inference modes, API schemas, and serving behaviour, see Serving.
- For implementation readiness of each component, see Status.
Scope¶
The ML subsystem covers everything from versioned feature artifacts to a promoted, serving-ready model version:
- what the model predicts and why this is an ML problem,
- how success is defined and measured,
- why temporal validation is mandatory and how it is enforced,
- how features are designed to prevent leakage and maintain offline/online parity,
- how training is orchestrated reproducibly via DVC,
- how experiments are tracked and traced via MLflow,
- what the model interface contract is,
- how models move from training into serving via the registry,
- what the current limitations are.
Design principles¶
| Principle | What it means in practice |
|---|---|
| Reproducibility by default | Same git commit + DVC dataset version + params.yaml → identical results |
| Validation over optimisation | A correct temporal split takes priority over any marginal metric gain |
| Leakage is a critical bug | Any feature that encodes future information invalidates the experiment |
| Explicit contracts | Model input/output schemas are versioned alongside model artifacts |
| Offline/online parity | Feature logic is shared; no ad-hoc transformations at inference |
| Registry as handoff | The only path from training to serving is through the MLflow registry |
ML lifecycle¶
flowchart LR
A[Versioned Features\ndata/features/] --> B[Temporal Split\nsplit_data]
B --> C[Tune\ntune_xgb]
B --> D[Baseline Models\nclassification_models]
C --> E[Final Train\nfinal_train]
D --> E
E --> F[Evaluate on\nheld-out test]
F --> G[MLflow Run\nlogged + traced]
G --> H[Registry\nregister_model]
H --> I[Serving\nFastAPI + Celery]
Each step is a DVC stage. Execution order is determined by the DAG in dvc.yaml.
MLflow is the observability and traceability layer across all training stages.
Pages in this section¶
| Page | Covers |
|---|---|
| Problem | Prediction task, target construction, success definition |
| Baseline | Naive baselines, bookmaker benchmark, promotion gate |
| Validation | Temporal splits, leakage prevention, property tests |
| Features | Implemented feature families, parity rules, excluded types |
| Training Pipeline | DVC orchestration, stages, determinism |
| Tuning | Optuna search, time-aware CV, best-params flow |
| MLflow | Experiment tracking, run structure, lineage |
| Model Contract | Input/output schema, breaking changes, versioning |
| Model Registry | Lifecycle stages, promotion gate, rollback |
| Limitations | Current limitations, justified future improvements |