Quickstart — Reproducible Golden Path¶
This page shows how to reproduce the ML training pipeline locally from a clean checkout.
What this proves: dvc repro from any clean checkout gives deterministic results —
same model, same metrics, tracked in MLflow.
Not covered here: live API demo → Demo Guide; full local environment setup → Local Dev Runbook.
Prerequisites¶
- Python 3.13
pdm(dependency management)gitdvc- Access to DVC remote storage (read-only for demo)
1. Clone the repository¶
2. Install dependencies¶
Dependencies are managed via PDM with environment-specific groups.
This installs:
- Data access and storage utilities
- ML libraries (scikit-learn, XGBoost, MLflow)
- Pipeline orchestration tools (DVC)
- Development utilities (ruff, pytest)
3. Pull versioned datasets¶
All datasets are versioned using DVC.
This restores:
- Raw parquet files (
data/raw/) - Processed datasets (
data/interim/) - Feature tables (
data/features/) - Train/test splits (
data/splits/)
What happens: DVC downloads data files from remote storage (MinIO S3) using content-addressed hashes.
4. Run the ML pipeline¶
The full ML pipeline is orchestrated via DVC pipelines.
Pipeline stages (see dvc.yaml):
load_data_from_sources- Fetch raw match dataexport_metadata- Extract metadatapreprocessing- Clean and filter datafeature_engineering- Compute time-windowed statisticssplit_data- Create train/test splits + CV foldsclassification_baseline- Train baseline modelsclassification_models- Train candidate models
Execution characteristics:
- Deterministic: Same input → same output
- Cached: Only re-runs changed stages
- Traceable: All outputs tracked in
dvc.lock
Pipeline execution depends on:
- Data versions (DVC tracked)
- Code versions (Git tracked)
- Configuration (
params.yaml)
5. Inspect experiment results¶
Start the MLflow UI:
Open browser: http://localhost:5001
What to inspect:
- Experiments: Browse all training runs
- Parameters: Hyperparameters logged automatically
- Metrics: Accuracy, precision, recall, F1
- Artifacts: Confusion matrices, model files, plots
- Runs comparison: Compare multiple models side-by-side
Example workflow:
- Navigate to "Experiments" tab
- Click on
matches_clfexperiment - Select multiple runs
- Click "Compare"
- View metric differences and charts
6. Verify reproducibility¶
Key insight: Any checkout of the same git commit + dvc repro produces identical outputs.
What this demonstrates¶
| Step | What it proves |
|---|---|
dvc pull |
Content-addressed versioning — any dataset version restorable by hash |
dvc repro |
Deterministic pipeline — same input + code → same output |
mlflow ui |
Full experiment traceability — parameters, metrics, and artifacts logged automatically |
dvc dag |
Explicit dependency tracking — stages and their inputs/outputs are declared |
Currently supported in this path¶
| Step | Status |
|---|---|
dvc pull — restore versioned datasets |
✅ Operational |
dvc repro — run full ML pipeline |
✅ Operational |
mlflow ui — inspect experiments |
✅ Operational |
pytest tests/ — run test suite (316 tests, make test recommended) |
✅ Operational |
| Grafana dashboard | 📋 Not yet deployed |
Where to go next¶
- Demo Guide — live API walkthrough and interview script
- Architecture Overview — system design and C4 diagrams
- Implementation Status — full component readiness matrix
Troubleshooting¶
DVC remote access¶
MLflow connection¶
Pipeline errors¶
View detailed logs¶
dvc repro --verbose ```
Comparison with Production¶
| Component | Local (This Guide) | Production | Status |
|---|---|---|---|
| Data Pipeline | ✅ DVC versioned | ✅ Airflow scheduled | Working |
| Feature Engineering | ✅ Reproducible | ✅ Same code | Working |
| Model Training | ✅ DVC + MLflow | ✅ DVC + MLflow | Working |
| Inference API | ✅ POST /predict implemented | 🚧 Infrastructure ready | Working |
| Monitoring | 📋 Planned | 📋 Planned | Not Started |
Demo for Interviews¶
Quick demo (5 minutes):
1. Show architecture diagram
2. Run dvc repro
3. Open MLflow UI
4. Explain separation of concerns