Skip to content

Quickstart — Reproducible Golden Path

This page demonstrates how to reproduce the ML training pipeline locally using the same tools and principles that are used in production.

The goal is to prove that the training system is: - deterministic, - version-controlled, - and runnable outside of the live environment.


Prerequisites

You will need:

  • Python 3.13
  • pdm (for dependency management)
  • git (for version control)
  • dvc (for data versioning)
  • Access to DVC remote storage (read-only for demo)

Optional (for full local environment): - mamba or conda (for environment management) - docker (for containerized deployment) - kubectl / helm (for K8s deployment)


1. Clone the repository

git clone <repository-url>
cd soccer

2. Install dependencies

Dependencies are managed via PDM with environment-specific groups.

# Install all dependencies
pdm install

# OR: Create conda environment
make env-install

This installs:

  • Data access and storage utilities
  • ML libraries (scikit-learn, XGBoost, MLflow)
  • Pipeline orchestration tools (DVC)
  • Development utilities (ruff, pytest)

3. Pull versioned datasets

All datasets are versioned using DVC.

dvc pull

This restores:

  • Raw parquet files (data/raw/)
  • Processed datasets (data/interim/)
  • Feature tables (data/features/)
  • Train/test splits (data/splits/)

What happens: DVC downloads data files from remote storage (MinIO S3) using content-addressed hashes.


4. Run the ML pipeline

The full ML pipeline is orchestrated via DVC pipelines.

# Run full pipeline
dvc repro

# OR: Force re-run all stages
rm -f dvc.lock
dvc repro

Pipeline stages (see dvc.yaml):

  1. load_data_from_sources - Fetch raw match data
  2. export_metadata - Extract metadata
  3. preprocessing - Clean and filter data
  4. feature_engineering - Compute time-windowed statistics
  5. split_data - Create train/test splits + CV folds
  6. classification_baseline - Train baseline models
  7. classification_models - Train candidate models

Execution characteristics:

  • Deterministic: Same input → same output
  • Cached: Only re-runs changed stages
  • Traceable: All outputs tracked in dvc.lock

Pipeline execution depends on:

  • Data versions (DVC tracked)
  • Code versions (Git tracked)
  • Configuration (params.yaml)

5. Inspect experiment results

Start the MLflow UI:

mlflow ui --port 5001

Open browser: http://localhost:5001

What to inspect:

  • Experiments: Browse all training runs
  • Parameters: Hyperparameters logged automatically
  • Metrics: Accuracy, precision, recall, F1
  • Artifacts: Confusion matrices, model files, plots
  • Runs comparison: Compare multiple models side-by-side

Example workflow: 1. Navigate to "Experiments" tab 2. Click on matches_clf experiment 3. Select multiple runs 4. Click "Compare" 5. View metric differences and charts


6. Verify reproducibility

# Check DVC status (should be clean)
dvc status

# View pipeline DAG
dvc dag

# Inspect dataset hash
cat data/processed/dataset.parquet.dvc

Key insight: Any team member running dvc repro with the same git commit will produce identical results.


7. Explore the codebase

Key directories:

# Feature engineering (pure functions)
cat src/features/stats_matches.py

# Model training logic
cat src/models/classification.py

# DVC pipeline entrypoints
cat src/pipelines/classification.py

# Pipeline definition
cat dvc.yaml

Design principle: Clear separation between data access, pure transformations, and orchestration.


What This Demonstrates

By completing the Golden Path, you have verified:

Reproducible pipelines: DVC ensures deterministic execution ✅ Versioned data: Content-addressed storage via DVC ✅ Experiment tracking: MLflow logs all runs automatically ✅ Orchestration: DVC pipelines manage dependencies ✅ Separation of concerns: Data, features, models isolated

🚧 Not Yet Included (in development): - Live inference endpoint (POST /predict) - Model serving API integration - Real-time monitoring dashboards


Inference Status

What Works ✅

# Health check endpoint
uvicorn src.app.main:app --reload

# In another terminal
curl http://localhost:8000/healthcheck/

Response: Service health status with memory usage

What's In Progress 🚧

# Prediction endpoint (returns 501 Not Implemented)
curl -X POST http://localhost:8000/predict/ \
  -H "Content-Type: application/json" \
  -d '{"match_id": 123}'

Status: Endpoint structure exists, model loading not wired up yet.

Tracking: See Serving Layer docs for integration progress.


Next Steps

Explore Documentation

Local Development

# Run linting
ruff check src/

# Format code
ruff format src/

# Run tests (when available)
pytest tests/ -v

Infrastructure

# Build Docker images
docker build -f docker/Dockerfile.api -t soccer-api .

# View K8s manifests
ls k8s/manifests/
ls k8s/helm/

Troubleshooting

DVC Remote Access Issues

# Check DVC remote config
dvc remote list
cat .dvc/config

# Test connection
dvc fetch --remote <remote_name>

MLflow Connection Issues

# Check MLflow tracking URI
echo $MLFLOW_TRACKING_URI

# Start local MLflow server
mlflow ui --backend-store-uri sqlite:///mlflow.db

Pipeline Errors

# Clean and force re-run
rm -rf .dvc/cache
rm -f dvc.lock
dvc repro --force

# View detailed logs
dvc repro --verbose

Comparison with Production

Component Local (This Guide) Production Status
Data Pipeline ✅ DVC versioned ✅ Airflow scheduled Working
Feature Engineering ✅ Reproducible ✅ Same code Working
Model Training ✅ DVC + MLflow ✅ DVC + MLflow Working
Inference API ✅ POST /predict implemented 🚧 Infrastructure ready Working
Monitoring 📋 Planned 📋 Planned Not Started

Demo for Interviews

For a structured walkthrough suitable for demonstrating in interviews, see DEMO.md.

Quick demo (5 minutes): 1. Show architecture diagram 2. Run dvc repro 3. Open MLflow UI 4. Explain separation of concerns


References