Environments & Dependency Strategy¶

SoccerPredictAI uses a layered dependency approach to maximize reproducibility and eliminate "works on my machine" issues across dev, CI, training, and production.

Dependency Layering Strategy¶

Layer 1 — System + Python runtime
    conda / mamba (environment.yml)
    → exported to requirements-mamba-base.txt

Layer 2 — Python application dependencies
    PDM groups: api / ml / dev / prod
    → exported per group to requirements-pdm-*.txt

Layer 3 — Final pinned artifacts
    Merged into requirements-*.txt
    → used for deterministic Docker builds

Why this design? - conda handles system-level and compiled library dependencies reliably. - PDM provides modern dependency resolution and group-based separation (api/ml/dev). - Exporting to pinned requirements-*.txt ensures Docker images are reproducible and auditable without requiring conda in the container build chain.

Environment Matrix¶

Environment	Purpose	Dependency anchor	Python version	Activation
Local development	Code authoring, debugging, test runs	`conda env` + `pdm install --dev`	3.13 (from `environment.yml`)	`conda activate soccer`
CI (GitLab)	Lint, test, build, deploy	`pdm install` from `pdm.lock`	3.13 (pinned in CI image)	CI runner environment
Offline ML training	`dvc repro`, experiment runs	`requirements-ml.txt` (pinned)	3.13	conda or Docker container
Deployed runtime (API)	FastAPI + Celery workers serving predictions	`requirements-prod.txt` (pinned)	3.13	K8s pod from Docker image
Docs / reporting	MkDocs build, Quarto reports	`requirements-dev.txt` subset	3.13	Local dev env

Reproducibility Anchors¶

Every deployed model and dataset is traceable to four anchors:

Anchor	What it pins
`git commit`	Code version
`pdm.lock`	All Python dependency versions
DVC content hash	Exact dataset version used for training
MLflow run ID	All training parameters, metrics, and artifacts

A deployment is fully reproducible when all four anchors are recorded. Deployment manifests in k8s/ reference the Docker image tag, which maps to a specific git commit and pdm.lock.

Dependency Groups (PDM)¶

Group	Contents	Used by
`api`	FastAPI, Pydantic, Celery, Redis client	API Docker image
`ml`	scikit-learn, XGBoost, Optuna, MLflow, DVC	Training pipeline Docker image / local
`dev`	pytest, hypothesis, ruff, pre-commit, mypy	CI + local development
`prod`	Combined api + ml for production deployment	Production Docker image

How to Rebuild Pinned Requirements¶

make requirements

This regenerates: - PDM exports per group (requirements-pdm-*.txt) - Base pip freeze from conda env (requirements-mamba-base.txt) - Merged final requirements-*.txt for Docker builds

Run this whenever pdm.lock or environment.yml changes and before building new Docker images.

Operational Note¶

The system treats pdm.lock and DVC content hashes as the primary reproducibility anchors. All production deployments should be traceable to:

git commit
dataset version (DVC hash)
model version (MLflow run ID + registered version)
dependency lock (pdm.lock)

No deployment should be performed from an environment where any of these anchors is unresolved.

Architecture Principles — Reproducibility First
Deployment View — how Docker images are deployed
Trade-offs — DVC + MinIO
Security — how secrets are injected at runtime (not baked into images)