Skip to content

Architecture Principles

This page documents the guiding principles behind the SoccerPredictAI architecture. Each principle is a deliberate design choice with direct consequences on how the system is built and operated.


1. Reproducibility First

Statement: Any historical model must be reproducible from a clean checkout.

Anchors: pdm.lock (dependencies) + DVC content-addressed storage (data + model artifacts) + MLflow run metadata (training parameters + metrics).

Consequences: - All randomness is explicitly seeded. - No path, parameter, or version is hardcoded in code — all come from params.yaml or Hydra configs. - dvc repro from any git commit and matching dataset version must yield the same artifacts. - Docker images pin dependency hashes, not floating ranges.

Where this appears: Environments, Data & ML Flow, Trade-offs — DVC + MinIO


2. Explicit Contracts at Boundaries

Statement: Every boundary between subsystems has a formal, validated contract. No contract = no boundary.

Contracts in this system: - data/raw → Great Expectations suite (validate_raw) - data/interim → Great Expectations suite (validate_finished, validate_future) - data/features → Great Expectations suite (validate_features) - Training output → MLflow pyfunc model signature (input + output schema) - API → Pydantic PredictRequest / PredictResponse schemas

Consequences: - Broken data fails fast at the validation gate, not silently downstream. - Model serving rejects malformed inputs at schema validation, not at inference time. - Contract files are versioned alongside code.

Where this appears: Component View, Data & ML Flow


3. Architecture over Implementation Details

Statement: Architecture documentation describes structure, decisions, and constraints — not algorithm configurations, model hyperparameters, or performance thresholds. Implementation specifics belong in implementation-level docs (docs/ml/, docs/reference/), not architectural views.

Design distinction:

Concept Meaning Examples in this system
Architectural invariant A property that must hold regardless of how implementation evolves All models loaded via MLflow Registry; feature logic shared between offline and online paths
Implementation optimization A decision that improves performance but is replaceable Redis caching strategy; specific calibration method; exact window sizes

This distinction keeps architecture stable across implementation changes and prevents docs from degrading into snapshots of the current configuration.

Consequences: - Architecture docs remain useful across code refactors and parameter tuning. - Reviewers can distinguish what is structurally binding from what is operationally tunable. - Status labels (Implemented, Partial, Planned) apply to architectural elements, not to config values.

Where this appears: This distinction is applied throughout all architecture pages.


4. Separate Offline and Online Concerns

Statement: The offline training pipeline (DVC) and the online serving path (FastAPI + Celery) are independent execution environments that share contracts and logic, but not runtime infrastructure.

Consequences: - DVC stages never import or call FastAPI/Celery code. - Serving code never triggers DVC stages. - Shared logic (feature engineering functions) lives in src/features/ and is imported by both, but the execution paths are separate. - Model promotion is the explicit handoff point between offline and online.

Where this appears: Data & ML Flow, Container View, Runtime View


5. Prefer Operational Clarity Over Platform Sprawl

Statement: Choose the right tool per job. Do not add infrastructure that adds complexity without proportionate benefit at current scale.

Consequences: - Airflow for scheduling (calendar-driven); DVC for ML pipelines (artifact-driven). Not both for the same job. - Celery + RabbitMQ for async tasks; Kafka is not justified at current throughput. - MinIO provides S3-compatible storage on-prem; no dependency on AWS. - Monitoring via Prometheus + Grafana (standard stack), not a proprietary SaaS.

Where this appears: Trade-offs, Deployment View


6. Single Source of Truth per Responsibility

Statement: Each category of data has exactly one authoritative store. No duplication of authority.

Responsibility Authoritative store
Structured scraped data PostgreSQL (namespace: ds)
Raw and interim datasets MinIO (via DVC)
Model artifacts and runs MLflow Registry
Live prediction/feature cache Redis (namespace: soccer-api)
Configuration and parameters params.yaml / Hydra configs
Secret values SOPS-encrypted files in git

Consequences: - No reconciliation problems across stores for the same data type. - Debugging always starts from the same place.

Where this appears: System Boundary, Container View


7. Documentation-First Architecture

Statement: Architectural decisions are documented before or alongside implementation, not retrospectively.

Mechanism: - ADRs in docs/adr/ capture the decision context, alternatives, and consequences. - This architecture section documents the intended design, with explicit status labels. - status.md is the authoritative implementation status tracker.

Consequences: - Intent and implementation can diverge; the docs distinguish them honestly. - Reviewers can trace why a decision was made without reading the commit history.

Where this appears: Implementation Status, Trade-offs, ADR Index


8. Honest Current-State Labeling

Statement: Every architecturally significant element in documentation carries an explicit status label.

Labels used throughout this documentation:

Label Meaning
✅ Implemented Exists in code, deployed or tested
🚧 Partially implemented Core exists; some parts missing or manual
📋 Planned Architecturally designed; not yet built

Consequences: - Documentation is useful for code review and interviews, not just aspirational. - No component is described as production-ready unless it actually is.

Where this appears: Implementation Status, all architecture pages


9. Security by Design

Statement: Secrets are never plaintext in code, config files, Docker images, or CI logs. Access is bounded by namespace isolation and least-privilege service accounts.

Mechanisms: - SOPS + age encryption for all credentials committed to git. - K8s namespace isolation: ds, soccer-api, monitoring, ingress-nginx are separate trust zones. - No plaintext .env files in the repository. - CI decrypts secrets only in scoped, ephemeral steps. - Kubernetes secrets are namespace-scoped.

Consequences: - Secret rotation requires re-encrypting SOPS files and redeploying. - Operational complexity is slightly higher than plaintext .env, but the security posture is correct.

Where this appears: Security, Environments, Trade-offs — SOPS + age