Quickstart — Reproducible Golden Path¶
This page demonstrates how to reproduce the ML training pipeline locally using the same tools and principles that are used in production.
The goal is to prove that the training system is: - deterministic, - version-controlled, - and runnable outside of the live environment.
Prerequisites¶
You will need:
- Python 3.13
pdm(for dependency management)git(for version control)dvc(for data versioning)- Access to DVC remote storage (read-only for demo)
Optional (for full local environment):
- mamba or conda (for environment management)
- docker (for containerized deployment)
- kubectl / helm (for K8s deployment)
1. Clone the repository¶
2. Install dependencies¶
Dependencies are managed via PDM with environment-specific groups.
This installs:
- Data access and storage utilities
- ML libraries (scikit-learn, XGBoost, MLflow)
- Pipeline orchestration tools (DVC)
- Development utilities (ruff, pytest)
3. Pull versioned datasets¶
All datasets are versioned using DVC.
This restores:
- Raw parquet files (
data/raw/) - Processed datasets (
data/interim/) - Feature tables (
data/features/) - Train/test splits (
data/splits/)
What happens: DVC downloads data files from remote storage (MinIO S3) using content-addressed hashes.
4. Run the ML pipeline¶
The full ML pipeline is orchestrated via DVC pipelines.
Pipeline stages (see dvc.yaml):
load_data_from_sources- Fetch raw match dataexport_metadata- Extract metadatapreprocessing- Clean and filter datafeature_engineering- Compute time-windowed statisticssplit_data- Create train/test splits + CV foldsclassification_baseline- Train baseline modelsclassification_models- Train candidate models
Execution characteristics:
- Deterministic: Same input → same output
- Cached: Only re-runs changed stages
- Traceable: All outputs tracked in
dvc.lock
Pipeline execution depends on:
- Data versions (DVC tracked)
- Code versions (Git tracked)
- Configuration (
params.yaml)
5. Inspect experiment results¶
Start the MLflow UI:
Open browser: http://localhost:5001
What to inspect:
- Experiments: Browse all training runs
- Parameters: Hyperparameters logged automatically
- Metrics: Accuracy, precision, recall, F1
- Artifacts: Confusion matrices, model files, plots
- Runs comparison: Compare multiple models side-by-side
Example workflow:
1. Navigate to "Experiments" tab
2. Click on matches_clf experiment
3. Select multiple runs
4. Click "Compare"
5. View metric differences and charts
6. Verify reproducibility¶
# Check DVC status (should be clean)
dvc status
# View pipeline DAG
dvc dag
# Inspect dataset hash
cat data/processed/dataset.parquet.dvc
Key insight: Any team member running dvc repro with the same git commit will produce identical results.
7. Explore the codebase¶
Key directories:¶
# Feature engineering (pure functions)
cat src/features/stats_matches.py
# Model training logic
cat src/models/classification.py
# DVC pipeline entrypoints
cat src/pipelines/classification.py
# Pipeline definition
cat dvc.yaml
Design principle: Clear separation between data access, pure transformations, and orchestration.
What This Demonstrates¶
By completing the Golden Path, you have verified:
✅ Reproducible pipelines: DVC ensures deterministic execution ✅ Versioned data: Content-addressed storage via DVC ✅ Experiment tracking: MLflow logs all runs automatically ✅ Orchestration: DVC pipelines manage dependencies ✅ Separation of concerns: Data, features, models isolated
🚧 Not Yet Included (in development):
- Live inference endpoint (POST /predict)
- Model serving API integration
- Real-time monitoring dashboards
Inference Status¶
What Works ✅¶
# Health check endpoint
uvicorn src.app.main:app --reload
# In another terminal
curl http://localhost:8000/healthcheck/
Response: Service health status with memory usage
What's In Progress 🚧¶
# Prediction endpoint (returns 501 Not Implemented)
curl -X POST http://localhost:8000/predict/ \
-H "Content-Type: application/json" \
-d '{"match_id": 123}'
Status: Endpoint structure exists, model loading not wired up yet.
Tracking: See Serving Layer docs for integration progress.
Next Steps¶
Explore Documentation¶
- Implementation Status - Current state of all components
- Architecture - C4 diagrams and system design
- ADRs - Architectural decisions and rationale
- ML Pipeline - Training, features, validation details
Local Development¶
# Run linting
ruff check src/
# Format code
ruff format src/
# Run tests (when available)
pytest tests/ -v
Infrastructure¶
# Build Docker images
docker build -f docker/Dockerfile.api -t soccer-api .
# View K8s manifests
ls k8s/manifests/
ls k8s/helm/
Troubleshooting¶
DVC Remote Access Issues¶
# Check DVC remote config
dvc remote list
cat .dvc/config
# Test connection
dvc fetch --remote <remote_name>
MLflow Connection Issues¶
# Check MLflow tracking URI
echo $MLFLOW_TRACKING_URI
# Start local MLflow server
mlflow ui --backend-store-uri sqlite:///mlflow.db
Pipeline Errors¶
# Clean and force re-run
rm -rf .dvc/cache
rm -f dvc.lock
dvc repro --force
# View detailed logs
dvc repro --verbose
Comparison with Production¶
| Component | Local (This Guide) | Production | Status |
|---|---|---|---|
| Data Pipeline | ✅ DVC versioned | ✅ Airflow scheduled | Working |
| Feature Engineering | ✅ Reproducible | ✅ Same code | Working |
| Model Training | ✅ DVC + MLflow | ✅ DVC + MLflow | Working |
| Inference API | ✅ POST /predict implemented | 🚧 Infrastructure ready | Working |
| Monitoring | 📋 Planned | 📋 Planned | Not Started |
Demo for Interviews¶
For a structured walkthrough suitable for demonstrating in interviews, see DEMO.md.
Quick demo (5 minutes):
1. Show architecture diagram
2. Run dvc repro
3. Open MLflow UI
4. Explain separation of concerns