Skip to content

Training Pipeline (DVC)

Purpose

Document the ML training lifecycle from versioned feature artifacts to a registered model. Data acquisition and raw data processing are described in Data. End-to-end stage-by-stage flow is in Architecture: Data & ML Flow.

The ML pipeline starts where the Data layer ends: at validated, DVC-tracked feature artifacts.


ML boundary in the pipeline

The DVC DAG includes both data stages and ML stages. The ML subsystem starts at split_data:

[feature_engineering] → [validate_features] → [split_data]
                              ┌─────────────────────┤
                              ▼                     ▼
                    [classification_models]     [tune_xgb]
                              ↓                     ↓
                          [final_train] ←───────────┘
                       [register_model]

batch_inference runs independently from the training path — it uses the feature code to prepare upcoming-match features for serving.


Full pipeline DAG

flowchart TD A([load_data_from_sources]) --> B[validate_raw] A --> C[export_metadata] A --> D[preprocessing] D --> E[validate_interim] D --> F[feature_engineering] D --> G[split_data] F --> H[validate_features] F --> G G --> I[classification_models] G --> J[ablation_study] G --> K[tune_xgb] I --> L[final_train] K --> L F --> M[batch_inference] D --> M L --> N[register_model] classDef validate fill:#e8f4e8,stroke:#4a7c4a classDef ml fill:#e8e8f4,stroke:#4a4a7c classDef infra fill:#f4f0e8,stroke:#7c6a4a classDef independent fill:#f4e8e8,stroke:#7c4a4a class B,E,H validate class I,J,K,L,N ml class A,C,D infra class M independent

Run dvc dag to verify the current graph matches this diagram.


ML stage responsibilities

Stage Role Key outputs
split_data Time-based train/test split + CV fold generation data/splits/train_ids.parquet, test_ids.parquet, folds.parquet; data/processed/dataset.parquet
classification_models Baseline classifier runs across data fractions; selects best model data/models/run_id.json (best run ID + model URI)
ablation_study Feature subset experiments to measure contribution of each family MLflow runs under matches_clf experiment, tagged with pipeline.variant
tune_xgb Optuna hyperparameter search using walk-forward CV data/models/xgb_best_params.json
final_train Full training with best model architecture + tuned params; evaluates once on held-out test MLflow run; data/models/final_run_id.json
register_model Creates or updates MLflow registry entry from final_run_id.json MLflow Model Registry entry at Staging
batch_inference Pre-computes features for upcoming matches (independent of training path) data/predictions/match_features.parquet

Data stages (load_data_from_sources, preprocessing, feature_engineering, and GE validation gates) are documented in Data: ETL.


Configuration

All ML stage behaviour is controlled via params.yaml. DVC tracks which params each stage consumes, so dvc params diff shows exactly what changed between runs.

Relevant sections for the ML path:

temporal:
  test_start: "2024-01-01"
  folds_start_year: 2016
  folds_end_year: 2024

classification:
  target_col: "outcome_1x2"
  experiment_name: "matches_clf"
  fracs_for_train: [0.1, 0.5, 1.0]
  side: "diff"
  cat_cols: ["sex"]

tuning:
  n_trials: 20
  frac: 1.0

MLflow logs the full params snapshot for each training run, providing a complete audit trail.


Reproducibility contract

Given the same: - git commit, - DVC dataset version (dvc pull), - params.yaml,

dvc repro produces identical results. This is not aspirational — it is enforced by DVC content-addressing and explicit random seed management in training code.


CI

CI runs a smoke pipeline with reduced data (fracs_for_train: [0.001, 0.002]) to verify the full stage graph executes without error. Contract tests in tests/contract/ validate that stage inputs and outputs satisfy their schema agreements.


Running the ML pipeline

# Full pipeline (data + ML)
dvc repro

# ML stages only (assumes feature artifacts are current)
dvc repro classification_models tune_xgb final_train register_model

# Check what changed since last run
dvc status
dvc params diff