Skip to content

Training Pipeline (DVC)

Pipeline orchestration

All offline ML steps are orchestrated via DVC pipelines.

This ensures: - explicit stage dependencies, - reproducible execution, - automatic re-runs on change.


Pipeline stages

All stages are defined in dvc.yaml. Execution order is determined by DVC's DAG.

Stage Command Outputs
load_data_from_sources cli-load-data-from-source data/raw/match.parquet, match_raw.parquet
validate_raw cli-validate-raw data/evaluation/ge_raw.json
export_metadata cli-export-metadata metadata export
preprocessing cli-preprocessing data/interim/finished.parquet, future.parquet
validate_interim cli-validate-interim data/evaluation/ge_interim.json
feature_engineering cli-feature-engineering data/features/features.parquet, features_meta.parquet
validate_features cli-validate-features data/evaluation/ge_features.json
split_data cli-time-based-split data/processed/dataset.parquet, train/test/fold IDs
batch_inference cli-batch-inference data/predictions/match_features.parquet
classification_models cli-classification-models data/models/run_id.json
register_model cli-register-model MLflow Model Registry entry

Run all stages: dvc repro


Configuration management

Pipeline behavior is controlled via params.yaml with DVC parameter tracking:

# params.yaml
features:
  stats_cols: ["win", "draw", "loss", "goals_for", "goals_against"]
  window_sizes: [3]

temporal:
  test_start: "2024-01-01"
  folds_start_year: 2016
  folds_end_year: 2024

classification:
  target_col: "outcome_1x2"
  experiment_name: "matches_clf"
  ...

DVC stages declare which params they consume via params: key in dvc.yaml. This means dvc params diff shows exactly which parameter changed between runs. MLflow logs the full params snapshot for each training run.

Reproducibility contract: given the same git commit + DVC dataset version + params.yaml, dvc repro produces identical results.


Determinism guarantees

Given: - git commit, - DVC dataset version (dvc pull), - params.yaml (tracked by DVC),

the pipeline produces identical results.


CI usage

CI runs: - smoke pipelines on reduced datasets (fracs_for_train: [0.0001] in dev), - contract validation (tests/contract/), - basic training sanity checks via pytest.