Data Layer¶

This section documents the offline data subsystem that feeds reproducible ML in SoccerPredictAI. It is not a second architecture overview. For system boundaries, component responsibilities, and end-to-end flow, see Architecture.

Purpose¶

The data layer exists to bridge an unstable, uncontrolled external source (WhoScored.com) and a reproducible, contract-governed ML pipeline (DVC). Its core function is:

Acquire raw match data from an unreliable external source.
Canonicalize that data into PostgreSQL.
Export immutable parquet snapshots to MinIO.
Version those snapshots so ML experiments are reproducible and traceable.
Gate downstream ML with validated data contracts.

Every boundary in this layer exists to enforce that contract. The data layer does not train models, serve predictions, or manage infrastructure.

Design principles¶

These are not aspirations — they are active constraints that shape how each stage is built.

Principle	Constraint it enforces
Scraping is isolated from ML reproducibility	ML pipelines never read from live PostgreSQL; they read only from versioned parquet snapshots
PostgreSQL is canonical, not the training source	Canonical status is about correctness and dedup, not ML input; the raw export stage creates the reproducibility boundary
Raw snapshots are immutable	Once a parquet file is exported and DVC-tracked, it is never mutated; new exports create new DVC versions
Contracts are explicit and blocking	Great Expectations suites are enforced as DVC stage gates; a failed check stops the pipeline
Lineage is full-stack	Every training run can be traced back through: MLflow run → DVC dataset version → parquet snapshot → PostgreSQL scraping run

Data lifecycle¶

flowchart LR A[WhoScored.com] -->|Airflow + Selenoid| B[(PostgreSQL)] B -->|DVC: load_data_from_sources| C[data/raw/*.parquet] C -->|DVC: validate_raw| D{GE gate} D -->|pass| E[data/interim/] D -->|fail| X[Pipeline blocked] E -->|DVC: feature_engineering| F[data/features/] F -->|DVC: train / tune / calibrate| G[MLflow Registry]

Each arrow is an explicit contract. No stage reads from a previous stage's live output — every handoff is through a versioned, validated artifact.

Scope¶

This section covers the offline data layer only:

what the upstream source is and how it is treated,
how ingestion and canonicalization work,
what the raw export boundary is and why it matters,
what datasets exist and how they evolve,
how contracts are defined and enforced,
how versions are tracked and experiments stay reproducible,
how freshness and backfills are managed,
what fails and how.

Online feature access at inference time is described in Serving. ML pipeline stages are described in ML Training Pipeline.

Architecture: Data & ML Flow — end-to-end stage-by-stage breakdown
Architecture: System Boundary — what is inside and outside the system
ML Training Pipeline — how features flow into model training
Runbook: Backfills — safe historical reprocessing
Status — implementation readiness

Data Layer¶

Purpose¶

Design principles¶

Data lifecycle¶

Scope¶

Related¶