Machine Learning Subsystem¶

Purpose¶

This section documents the ML subsystem of SoccerPredictAI: problem framing, validation discipline, feature logic, training lifecycle, experiment tracking, model contracts, and registry promotion.

It is not a second architecture overview and does not repeat data engineering concerns.

For system-level design, boundaries, and deployment topology, see Architecture.
For datasets, lineage, contracts, and reproducibility boundary, see Data.
For inference modes, API schemas, and serving behaviour, see Serving.
For implementation readiness of each component, see Status.

Scope¶

The ML subsystem covers everything from versioned feature artifacts to a promoted, serving-ready model version:

what the model predicts and why this is an ML problem,
how success is defined and measured,
why temporal validation is mandatory and how it is enforced,
how features are designed to prevent leakage and maintain offline/online parity,
how training is orchestrated reproducibly via DVC,
how experiments are tracked and traced via MLflow,
what the model interface contract is,
how models move from training into serving via the registry,
what the current limitations are.

Design principles¶

Principle	What it means in practice
Reproducibility by default	Same git commit + DVC dataset version + `params.yaml` → identical results
Validation over optimisation	A correct temporal split takes priority over any marginal metric gain
Leakage is a critical bug	Any feature that encodes future information invalidates the experiment
Explicit contracts	Model input/output schemas are versioned alongside model artifacts
Offline/online parity	Feature logic is shared; no ad-hoc transformations at inference
Registry as handoff	The only path from training to serving is through the MLflow registry

ML lifecycle¶

flowchart LR A[Versioned Features\ndata/features/] --> B[Temporal Split\nsplit_data] B --> C[Tune\ntune_xgb] B --> D[Baseline Models\nclassification_models] C --> E[Final Train\nfinal_train] D --> E E --> F[Evaluate on\nheld-out test] F --> G[MLflow Run\nlogged + traced] G --> H[Registry\nregister_model] H --> I[Serving\nFastAPI + Celery]

Each step is a DVC stage. Execution order is determined by the DAG in dvc.yaml. MLflow is the observability and traceability layer across all training stages.

Pages in this section¶

Page	Covers
Problem	Prediction task, target construction, success definition
Baseline	Naive baselines, bookmaker benchmark, promotion gate
Validation	Temporal splits, leakage prevention, property tests
Features	Implemented feature families, parity rules, excluded types
Training Pipeline	DVC orchestration, stages, determinism
Tuning	Optuna search, time-aware CV, best-params flow
MLflow	Experiment tracking, run structure, lineage
Model Contract	Input/output schema, breaking changes, versioning
Model Registry	Lifecycle stages, promotion gate, rollback
Limitations	Current limitations, justified future improvements