Training Pipeline (DVC)¶
Pipeline orchestration¶
All offline ML steps are orchestrated via DVC pipelines.
This ensures: - explicit stage dependencies, - reproducible execution, - automatic re-runs on change.
Pipeline stages¶
All stages are defined in dvc.yaml. Execution order is determined by DVC's DAG.
| Stage | Command | Outputs |
|---|---|---|
load_data_from_sources |
cli-load-data-from-source |
data/raw/match.parquet, match_raw.parquet |
validate_raw |
cli-validate-raw |
data/evaluation/ge_raw.json |
export_metadata |
cli-export-metadata |
metadata export |
preprocessing |
cli-preprocessing |
data/interim/finished.parquet, future.parquet |
validate_interim |
cli-validate-interim |
data/evaluation/ge_interim.json |
feature_engineering |
cli-feature-engineering |
data/features/features.parquet, features_meta.parquet |
validate_features |
cli-validate-features |
data/evaluation/ge_features.json |
split_data |
cli-time-based-split |
data/processed/dataset.parquet, train/test/fold IDs |
batch_inference |
cli-batch-inference |
data/predictions/match_features.parquet |
classification_models |
cli-classification-models |
data/models/run_id.json |
register_model |
cli-register-model |
MLflow Model Registry entry |
Run all stages: dvc repro
Configuration management¶
Pipeline behavior is controlled via params.yaml with DVC parameter tracking:
# params.yaml
features:
stats_cols: ["win", "draw", "loss", "goals_for", "goals_against"]
window_sizes: [3]
temporal:
test_start: "2024-01-01"
folds_start_year: 2016
folds_end_year: 2024
classification:
target_col: "outcome_1x2"
experiment_name: "matches_clf"
...
DVC stages declare which params they consume via params: key in dvc.yaml.
This means dvc params diff shows exactly which parameter changed between runs.
MLflow logs the full params snapshot for each training run.
Reproducibility contract: given the same git commit + DVC dataset version + params.yaml,
dvc repro produces identical results.
Determinism guarantees¶
Given:
- git commit,
- DVC dataset version (dvc pull),
- params.yaml (tracked by DVC),
the pipeline produces identical results.
CI usage¶
CI runs:
- smoke pipelines on reduced datasets (fracs_for_train: [0.0001] in dev),
- contract validation (tests/contract/),
- basic training sanity checks via pytest.