Training Pipeline (DVC)¶

Purpose¶

Document the ML training lifecycle from versioned feature artifacts to a registered model. Data acquisition and raw data processing are described in Data. End-to-end stage-by-stage flow is in Architecture: Data & ML Flow.

The ML pipeline starts where the Data layer ends: at validated, DVC-tracked feature artifacts.

ML boundary in the pipeline¶

The DVC DAG includes both data stages and ML stages. The ML subsystem starts at split_data:

[feature_engineering] → [validate_features] → [split_data]
                                                    ↓
                              ┌─────────────────────┤
                              ▼                     ▼
                    [classification_models]     [tune_xgb]
                              ↓                     ↓
                          [final_train] ←───────────┘
                              ↓
                       [register_model]

batch_inference runs independently from the training path — it uses the feature code to prepare upcoming-match features for serving.

Full pipeline DAG¶

flowchart TD A([load_data_from_sources]) --> B[validate_raw] A --> C[export_metadata] A --> D[preprocessing] D --> E[validate_interim] D --> F[feature_engineering] D --> G[split_data] F --> H[validate_features] F --> G G --> I[classification_models] G --> J[ablation_study] G --> K[tune_xgb] I --> L[final_train] K --> L F --> M[batch_inference] D --> M L --> N[register_model] classDef validate fill:#e8f4e8,stroke:#4a7c4a classDef ml fill:#e8e8f4,stroke:#4a4a7c classDef infra fill:#f4f0e8,stroke:#7c6a4a classDef independent fill:#f4e8e8,stroke:#7c4a4a class B,E,H validate class I,J,K,L,N ml class A,C,D infra class M independent

Run dvc dag to verify the current graph matches this diagram.

ML stage responsibilities¶

Stage	Role	Key outputs
`split_data`	Time-based train/test split + CV fold generation	`data/splits/train_ids.parquet`, `test_ids.parquet`, `folds.parquet`; `data/processed/dataset.parquet`
`classification_models`	Baseline classifier runs across data fractions; selects best model	`data/models/run_id.json` (best run ID + model URI)
`ablation_study`	Feature subset experiments to measure contribution of each family	MLflow runs under `matches_clf` experiment, tagged with `pipeline.variant`
`tune_xgb`	Optuna hyperparameter search using walk-forward CV	`data/models/xgb_best_params.json`
`final_train`	Full training with best model architecture + tuned params; evaluates once on held-out test	MLflow run; `data/models/final_run_id.json`
`register_model`	Creates or updates MLflow registry entry from `final_run_id.json`	MLflow Model Registry entry at `Staging`
`batch_inference`	Pre-computes features for upcoming matches (independent of training path)	`data/predictions/match_features.parquet`

Data stages (load_data_from_sources, preprocessing, feature_engineering, and GE validation gates) are documented in Data: ETL.

Configuration¶

All ML stage behaviour is controlled via params.yaml. DVC tracks which params each stage consumes, so dvc params diff shows exactly what changed between runs.

Relevant sections for the ML path:

temporal:
  test_start: "2024-01-01"
  folds_start_year: 2016
  folds_end_year: 2024

classification:
  target_col: "outcome_1x2"
  experiment_name: "matches_clf"
  fracs_for_train: [0.1, 0.5, 1.0]
  side: "diff"
  cat_cols: ["sex"]

tuning:
  n_trials: 20
  frac: 1.0

MLflow logs the full params snapshot for each training run, providing a complete audit trail.

Reproducibility contract¶

Given the same: - git commit, - DVC dataset version (dvc pull), - params.yaml,

dvc repro produces identical results. This is not aspirational — it is enforced by DVC content-addressing and explicit random seed management in training code.

CI¶

CI runs a smoke pipeline with reduced data (fracs_for_train: [0.001, 0.002]) to verify the full stage graph executes without error. Contract tests in tests/contract/ validate that stage inputs and outputs satisfy their schema agreements.

Running the ML pipeline¶

# Full pipeline (data + ML)
dvc repro

# ML stages only (assumes feature artifacts are current)
dvc repro classification_models tune_xgb final_train register_model

# Check what changed since last run
dvc status
dvc params diff