Skip to content

Machine Learning Subsystem

Purpose

This section documents the ML subsystem of SoccerPredictAI: problem framing, validation discipline, feature logic, training lifecycle, experiment tracking, model contracts, and registry promotion.

It is not a second architecture overview and does not repeat data engineering concerns.

  • For system-level design, boundaries, and deployment topology, see Architecture.
  • For datasets, lineage, contracts, and reproducibility boundary, see Data.
  • For inference modes, API schemas, and serving behaviour, see Serving.
  • For implementation readiness of each component, see Status.

Scope

The ML subsystem covers everything from versioned feature artifacts to a promoted, serving-ready model version:

  • what the model predicts and why this is an ML problem,
  • how success is defined and measured,
  • why temporal validation is mandatory and how it is enforced,
  • how features are designed to prevent leakage and maintain offline/online parity,
  • how training is orchestrated reproducibly via DVC,
  • how experiments are tracked and traced via MLflow,
  • what the model interface contract is,
  • how models move from training into serving via the registry,
  • what the current limitations are.

Design principles

Principle What it means in practice
Reproducibility by default Same git commit + DVC dataset version + params.yaml → identical results
Validation over optimisation A correct temporal split takes priority over any marginal metric gain
Leakage is a critical bug Any feature that encodes future information invalidates the experiment
Explicit contracts Model input/output schemas are versioned alongside model artifacts
Offline/online parity Feature logic is shared; no ad-hoc transformations at inference
Registry as handoff The only path from training to serving is through the MLflow registry

ML lifecycle

flowchart LR A[Versioned Features\ndata/features/] --> B[Temporal Split\nsplit_data] B --> C[Tune\ntune_xgb] B --> D[Baseline Models\nclassification_models] C --> E[Final Train\nfinal_train] D --> E E --> F[Evaluate on\nheld-out test] F --> G[MLflow Run\nlogged + traced] G --> H[Registry\nregister_model] H --> I[Serving\nFastAPI + Celery]

Each step is a DVC stage. Execution order is determined by the DAG in dvc.yaml. MLflow is the observability and traceability layer across all training stages.


Pages in this section

Page Covers
Problem Prediction task, target construction, success definition
Baseline Naive baselines, bookmaker benchmark, promotion gate
Validation Temporal splits, leakage prevention, property tests
Features Implemented feature families, parity rules, excluded types
Training Pipeline DVC orchestration, stages, determinism
Tuning Optuna search, time-aware CV, best-params flow
MLflow Experiment tracking, run structure, lineage
Model Contract Input/output schema, breaking changes, versioning
Model Registry Lifecycle stages, promotion gate, rollback
Limitations Current limitations, justified future improvements