Quickstart — Reproducible Golden Path¶

This page demonstrates how to reproduce the ML training pipeline locally using the same tools and principles that are used in production.

The goal is to prove that the training system is: - deterministic, - version-controlled, - and runnable outside of the live environment.

Prerequisites¶

You will need:

Python 3.13
pdm (for dependency management)
git (for version control)
dvc (for data versioning)
Access to DVC remote storage (read-only for demo)

Optional (for full local environment): - mamba or conda (for environment management) - docker (for containerized deployment) - kubectl / helm (for K8s deployment)

1. Clone the repository¶

git clone <repository-url>
cd soccer

2. Install dependencies¶

Dependencies are managed via PDM with environment-specific groups.

# Install all dependencies
pdm install

# OR: Create conda environment
make env-install

This installs:

Data access and storage utilities
ML libraries (scikit-learn, XGBoost, MLflow)
Pipeline orchestration tools (DVC)
Development utilities (ruff, pytest)

3. Pull versioned datasets¶

All datasets are versioned using DVC.

dvc pull

This restores:

Raw parquet files (data/raw/)
Processed datasets (data/interim/)
Feature tables (data/features/)
Train/test splits (data/splits/)

What happens: DVC downloads data files from remote storage (MinIO S3) using content-addressed hashes.

4. Run the ML pipeline¶

The full ML pipeline is orchestrated via DVC pipelines.

# Run full pipeline
dvc repro

# OR: Force re-run all stages
rm -f dvc.lock
dvc repro

Pipeline stages (see dvc.yaml):

load_data_from_sources - Fetch raw match data
export_metadata - Extract metadata
preprocessing - Clean and filter data
feature_engineering - Compute time-windowed statistics
split_data - Create train/test splits + CV folds
classification_baseline - Train baseline models
classification_models - Train candidate models

Execution characteristics:

Deterministic: Same input → same output
Cached: Only re-runs changed stages
Traceable: All outputs tracked in dvc.lock

Pipeline execution depends on:

Data versions (DVC tracked)
Code versions (Git tracked)
Configuration (params.yaml)

5. Inspect experiment results¶

Start the MLflow UI:

mlflow ui --port 5001

Open browser: http://localhost:5001

What to inspect:

Experiments: Browse all training runs
Parameters: Hyperparameters logged automatically
Metrics: Accuracy, precision, recall, F1
Artifacts: Confusion matrices, model files, plots
Runs comparison: Compare multiple models side-by-side

Example workflow: 1. Navigate to "Experiments" tab 2. Click on matches_clf experiment 3. Select multiple runs 4. Click "Compare" 5. View metric differences and charts

6. Verify reproducibility¶

# Check DVC status (should be clean)
dvc status

# View pipeline DAG
dvc dag

# Inspect dataset hash
cat data/processed/dataset.parquet.dvc

Key insight: Any team member running dvc repro with the same git commit will produce identical results.

7. Explore the codebase¶

Key directories:¶

# Feature engineering (pure functions)
cat src/features/stats_matches.py

# Model training logic
cat src/models/classification.py

# DVC pipeline entrypoints
cat src/pipelines/classification.py

# Pipeline definition
cat dvc.yaml

Design principle: Clear separation between data access, pure transformations, and orchestration.

What This Demonstrates¶

By completing the Golden Path, you have verified:

✅ Reproducible pipelines: DVC ensures deterministic execution ✅ Versioned data: Content-addressed storage via DVC ✅ Experiment tracking: MLflow logs all runs automatically ✅ Orchestration: DVC pipelines manage dependencies ✅ Separation of concerns: Data, features, models isolated

🚧 Not Yet Included (in development): - Live inference endpoint (POST /predict) - Model serving API integration - Real-time monitoring dashboards

Inference Status¶

What Works ✅¶

# Health check endpoint
uvicorn src.app.main:app --reload

# In another terminal
curl http://localhost:8000/healthcheck/

Response: Service health status with memory usage

What's In Progress 🚧¶

# Prediction endpoint (returns 501 Not Implemented)
curl -X POST http://localhost:8000/predict/ \
  -H "Content-Type: application/json" \
  -d '{"match_id": 123}'

Status: Endpoint structure exists, model loading not wired up yet.

Tracking: See Serving Layer docs for integration progress.

Next Steps¶

Explore Documentation¶

Implementation Status - Current state of all components
Architecture - C4 diagrams and system design
ADRs - Architectural decisions and rationale
ML Pipeline - Training, features, validation details

Local Development¶

# Run linting
ruff check src/

# Format code
ruff format src/

# Run tests (when available)
pytest tests/ -v

Infrastructure¶

# Build Docker images
docker build -f docker/Dockerfile.api -t soccer-api .

# View K8s manifests
ls k8s/manifests/
ls k8s/helm/

Troubleshooting¶

DVC Remote Access Issues¶

# Check DVC remote config
dvc remote list
cat .dvc/config

# Test connection
dvc fetch --remote <remote_name>

MLflow Connection Issues¶

# Check MLflow tracking URI
echo $MLFLOW_TRACKING_URI

# Start local MLflow server
mlflow ui --backend-store-uri sqlite:///mlflow.db

Pipeline Errors¶

# Clean and force re-run
rm -rf .dvc/cache
rm -f dvc.lock
dvc repro --force

# View detailed logs
dvc repro --verbose

Comparison with Production¶

Component	Local (This Guide)	Production	Status
Data Pipeline	✅ DVC versioned	✅ Airflow scheduled	Working
Feature Engineering	✅ Reproducible	✅ Same code	Working
Model Training	✅ DVC + MLflow	✅ DVC + MLflow	Working
Inference API	✅ POST /predict implemented	🚧 Infrastructure ready	Working
Monitoring	📋 Planned	📋 Planned	Not Started

Demo for Interviews¶

For a structured walkthrough suitable for demonstrating in interviews, see DEMO.md.

Quick demo (5 minutes): 1. Show architecture diagram 2. Run dvc repro 3. Open MLflow UI 4. Explain separation of concerns