Quickstart — Reproducible Golden Path¶

This page shows how to reproduce the ML training pipeline locally from a clean checkout.

What this proves: dvc repro from any clean checkout gives deterministic results — same model, same metrics, tracked in MLflow.

Not covered here: live API demo → Demo Guide; full local environment setup → Local Dev Runbook.

Prerequisites¶

Python 3.13
pdm (dependency management)
git
dvc
Access to DVC remote storage (read-only for demo)

1. Clone the repository¶

git clone <repository-url>
cd soccer

2. Install dependencies¶

Dependencies are managed via PDM with environment-specific groups.

# Install all dependencies
pdm install

# OR: Create conda environment
make env-install

This installs:

Data access and storage utilities
ML libraries (scikit-learn, XGBoost, MLflow)
Pipeline orchestration tools (DVC)
Development utilities (ruff, pytest)

3. Pull versioned datasets¶

All datasets are versioned using DVC.

dvc pull

This restores:

Raw parquet files (data/raw/)
Processed datasets (data/interim/)
Feature tables (data/features/)
Train/test splits (data/splits/)

What happens: DVC downloads data files from remote storage (MinIO S3) using content-addressed hashes.

4. Run the ML pipeline¶

The full ML pipeline is orchestrated via DVC pipelines.

# Run full pipeline
dvc repro

# OR: Force re-run all stages
rm -f dvc.lock
dvc repro

Pipeline stages (see dvc.yaml):

load_data_from_sources - Fetch raw match data
export_metadata - Extract metadata
preprocessing - Clean and filter data
feature_engineering - Compute time-windowed statistics
split_data - Create train/test splits + CV folds
classification_baseline - Train baseline models
classification_models - Train candidate models

Execution characteristics:

Deterministic: Same input → same output
Cached: Only re-runs changed stages
Traceable: All outputs tracked in dvc.lock

Pipeline execution depends on:

Data versions (DVC tracked)
Code versions (Git tracked)
Configuration (params.yaml)

5. Inspect experiment results¶

Start the MLflow UI:

mlflow ui --port 5001

Open browser: http://localhost:5001

What to inspect:

Experiments: Browse all training runs
Parameters: Hyperparameters logged automatically
Metrics: Accuracy, precision, recall, F1
Artifacts: Confusion matrices, model files, plots
Runs comparison: Compare multiple models side-by-side

Example workflow:

Navigate to "Experiments" tab
Click on matches_clf experiment
Select multiple runs
Click "Compare"
View metric differences and charts

6. Verify reproducibility¶

# Check DVC status (should be clean after dvc repro)
dvc status

# View pipeline DAG
dvc dag

Key insight: Any checkout of the same git commit + dvc repro produces identical outputs.

What this demonstrates¶

Step	What it proves
`dvc pull`	Content-addressed versioning — any dataset version restorable by hash
`dvc repro`	Deterministic pipeline — same input + code → same output
`mlflow ui`	Full experiment traceability — parameters, metrics, and artifacts logged automatically
`dvc dag`	Explicit dependency tracking — stages and their inputs/outputs are declared

Currently supported in this path¶

Step	Status
`dvc pull` — restore versioned datasets	✅ Operational
`dvc repro` — run full ML pipeline	✅ Operational
`mlflow ui` — inspect experiments	✅ Operational
`pytest tests/` — run test suite (316 tests, `make test` recommended)	✅ Operational
Grafana dashboard	📋 Not yet deployed

Where to go next¶

Demo Guide — live API walkthrough and interview script
Architecture Overview — system design and C4 diagrams
Implementation Status — full component readiness matrix

Troubleshooting¶

DVC remote access¶

dvc remote list
dvc fetch --remote <remote_name>

MLflow connection¶

echo $MLFLOW_TRACKING_URI
mlflow ui --backend-store-uri sqlite:///mlflow.db

Pipeline errors¶

rm -rf .dvc/cache
rm -f dvc.lock
dvc repro --force

View detailed logs¶

dvc repro --verbose ```

Comparison with Production¶

Component	Local (This Guide)	Production	Status
Data Pipeline	✅ DVC versioned	✅ Airflow scheduled	Working
Feature Engineering	✅ Reproducible	✅ Same code	Working
Model Training	✅ DVC + MLflow	✅ DVC + MLflow	Working
Inference API	✅ POST /predict implemented	🚧 Infrastructure ready	Working
Monitoring	📋 Planned	📋 Planned	Not Started

Demo for Interviews¶

Quick demo (5 minutes): 1. Show architecture diagram 2. Run dvc repro 3. Open MLflow UI 4. Explain separation of concerns