Training Pipeline (DVC)¶
Purpose¶
Document the ML training lifecycle from versioned feature artifacts to a registered model. Data acquisition and raw data processing are described in Data. End-to-end stage-by-stage flow is in Architecture: Data & ML Flow.
The ML pipeline starts where the Data layer ends: at validated, DVC-tracked feature artifacts.
ML boundary in the pipeline¶
The DVC DAG includes both data stages and ML stages. The ML subsystem starts at split_data:
[feature_engineering] → [validate_features] → [split_data]
↓
┌─────────────────────┤
▼ ▼
[classification_models] [tune_xgb]
↓ ↓
[final_train] ←───────────┘
↓
[register_model]
batch_inference runs independently from the training path — it uses the feature code to
prepare upcoming-match features for serving.
Full pipeline DAG¶
Run
dvc dagto verify the current graph matches this diagram.
ML stage responsibilities¶
| Stage | Role | Key outputs |
|---|---|---|
split_data |
Time-based train/test split + CV fold generation | data/splits/train_ids.parquet, test_ids.parquet, folds.parquet; data/processed/dataset.parquet |
classification_models |
Baseline classifier runs across data fractions; selects best model | data/models/run_id.json (best run ID + model URI) |
ablation_study |
Feature subset experiments to measure contribution of each family | MLflow runs under matches_clf experiment, tagged with pipeline.variant |
tune_xgb |
Optuna hyperparameter search using walk-forward CV | data/models/xgb_best_params.json |
final_train |
Full training with best model architecture + tuned params; evaluates once on held-out test | MLflow run; data/models/final_run_id.json |
register_model |
Creates or updates MLflow registry entry from final_run_id.json |
MLflow Model Registry entry at Staging |
batch_inference |
Pre-computes features for upcoming matches (independent of training path) | data/predictions/match_features.parquet |
Data stages (load_data_from_sources, preprocessing, feature_engineering, and GE validation
gates) are documented in Data: ETL.
Configuration¶
All ML stage behaviour is controlled via params.yaml. DVC tracks which params each stage
consumes, so dvc params diff shows exactly what changed between runs.
Relevant sections for the ML path:
temporal:
test_start: "2024-01-01"
folds_start_year: 2016
folds_end_year: 2024
classification:
target_col: "outcome_1x2"
experiment_name: "matches_clf"
fracs_for_train: [0.1, 0.5, 1.0]
side: "diff"
cat_cols: ["sex"]
tuning:
n_trials: 20
frac: 1.0
MLflow logs the full params snapshot for each training run, providing a complete audit trail.
Reproducibility contract¶
Given the same:
- git commit,
- DVC dataset version (dvc pull),
- params.yaml,
dvc repro produces identical results. This is not aspirational — it is enforced by
DVC content-addressing and explicit random seed management in training code.
CI¶
CI runs a smoke pipeline with reduced data (fracs_for_train: [0.001, 0.002]) to verify
the full stage graph executes without error. Contract tests in tests/contract/ validate
that stage inputs and outputs satisfy their schema agreements.
Running the ML pipeline¶
# Full pipeline (data + ML)
dvc repro
# ML stages only (assumes feature artifacts are current)
dvc repro classification_models tune_xgb final_train register_model
# Check what changed since last run
dvc status
dvc params diff