Skip to content

Quickstart — Reproducible Golden Path

This page shows how to reproduce the ML training pipeline locally from a clean checkout.

What this proves: dvc repro from any clean checkout gives deterministic results — same model, same metrics, tracked in MLflow.

Not covered here: live API demo → Demo Guide; full local environment setup → Local Dev Runbook.


Prerequisites

  • Python 3.13
  • pdm (dependency management)
  • git
  • dvc
  • Access to DVC remote storage (read-only for demo)

1. Clone the repository

git clone <repository-url>
cd soccer

2. Install dependencies

Dependencies are managed via PDM with environment-specific groups.

# Install all dependencies
pdm install

# OR: Create conda environment
make env-install

This installs:

  • Data access and storage utilities
  • ML libraries (scikit-learn, XGBoost, MLflow)
  • Pipeline orchestration tools (DVC)
  • Development utilities (ruff, pytest)

3. Pull versioned datasets

All datasets are versioned using DVC.

dvc pull

This restores:

  • Raw parquet files (data/raw/)
  • Processed datasets (data/interim/)
  • Feature tables (data/features/)
  • Train/test splits (data/splits/)

What happens: DVC downloads data files from remote storage (MinIO S3) using content-addressed hashes.


4. Run the ML pipeline

The full ML pipeline is orchestrated via DVC pipelines.

# Run full pipeline
dvc repro

# OR: Force re-run all stages
rm -f dvc.lock
dvc repro

Pipeline stages (see dvc.yaml):

  1. load_data_from_sources - Fetch raw match data
  2. export_metadata - Extract metadata
  3. preprocessing - Clean and filter data
  4. feature_engineering - Compute time-windowed statistics
  5. split_data - Create train/test splits + CV folds
  6. classification_baseline - Train baseline models
  7. classification_models - Train candidate models

Execution characteristics:

  • Deterministic: Same input → same output
  • Cached: Only re-runs changed stages
  • Traceable: All outputs tracked in dvc.lock

Pipeline execution depends on:

  • Data versions (DVC tracked)
  • Code versions (Git tracked)
  • Configuration (params.yaml)

5. Inspect experiment results

Start the MLflow UI:

mlflow ui --port 5001

Open browser: http://localhost:5001

What to inspect:

  • Experiments: Browse all training runs
  • Parameters: Hyperparameters logged automatically
  • Metrics: Accuracy, precision, recall, F1
  • Artifacts: Confusion matrices, model files, plots
  • Runs comparison: Compare multiple models side-by-side

Example workflow:

  1. Navigate to "Experiments" tab
  2. Click on matches_clf experiment
  3. Select multiple runs
  4. Click "Compare"
  5. View metric differences and charts

6. Verify reproducibility

# Check DVC status (should be clean after dvc repro)
dvc status

# View pipeline DAG
dvc dag

Key insight: Any checkout of the same git commit + dvc repro produces identical outputs.


What this demonstrates

Step What it proves
dvc pull Content-addressed versioning — any dataset version restorable by hash
dvc repro Deterministic pipeline — same input + code → same output
mlflow ui Full experiment traceability — parameters, metrics, and artifacts logged automatically
dvc dag Explicit dependency tracking — stages and their inputs/outputs are declared

Currently supported in this path

Step Status
dvc pull — restore versioned datasets ✅ Operational
dvc repro — run full ML pipeline ✅ Operational
mlflow ui — inspect experiments ✅ Operational
pytest tests/ — run test suite (316 tests, make test recommended) ✅ Operational
Grafana dashboard 📋 Not yet deployed

Where to go next


Troubleshooting

DVC remote access

dvc remote list
dvc fetch --remote <remote_name>

MLflow connection

echo $MLFLOW_TRACKING_URI
mlflow ui --backend-store-uri sqlite:///mlflow.db

Pipeline errors

rm -rf .dvc/cache
rm -f dvc.lock
dvc repro --force

View detailed logs

dvc repro --verbose ```


Comparison with Production

Component Local (This Guide) Production Status
Data Pipeline ✅ DVC versioned ✅ Airflow scheduled Working
Feature Engineering ✅ Reproducible ✅ Same code Working
Model Training ✅ DVC + MLflow ✅ DVC + MLflow Working
Inference API ✅ POST /predict implemented 🚧 Infrastructure ready Working
Monitoring 📋 Planned 📋 Planned Not Started

Demo for Interviews

Quick demo (5 minutes): 1. Show architecture diagram 2. Run dvc repro 3. Open MLflow UI 4. Explain separation of concerns


References