Skip to content

Data Layer

This section documents the offline data subsystem that feeds reproducible ML in SoccerPredictAI. It is not a second architecture overview. For system boundaries, component responsibilities, and end-to-end flow, see Architecture.


Purpose

The data layer exists to bridge an unstable, uncontrolled external source (WhoScored.com) and a reproducible, contract-governed ML pipeline (DVC). Its core function is:

  1. Acquire raw match data from an unreliable external source.
  2. Canonicalize that data into PostgreSQL.
  3. Export immutable parquet snapshots to MinIO.
  4. Version those snapshots so ML experiments are reproducible and traceable.
  5. Gate downstream ML with validated data contracts.

Every boundary in this layer exists to enforce that contract. The data layer does not train models, serve predictions, or manage infrastructure.


Design principles

These are not aspirations — they are active constraints that shape how each stage is built.

Principle Constraint it enforces
Scraping is isolated from ML reproducibility ML pipelines never read from live PostgreSQL; they read only from versioned parquet snapshots
PostgreSQL is canonical, not the training source Canonical status is about correctness and dedup, not ML input; the raw export stage creates the reproducibility boundary
Raw snapshots are immutable Once a parquet file is exported and DVC-tracked, it is never mutated; new exports create new DVC versions
Contracts are explicit and blocking Great Expectations suites are enforced as DVC stage gates; a failed check stops the pipeline
Lineage is full-stack Every training run can be traced back through: MLflow run → DVC dataset version → parquet snapshot → PostgreSQL scraping run

Data lifecycle

flowchart LR A[WhoScored.com] -->|Airflow + Selenoid| B[(PostgreSQL)] B -->|DVC: load_data_from_sources| C[data/raw/*.parquet] C -->|DVC: validate_raw| D{GE gate} D -->|pass| E[data/interim/] D -->|fail| X[Pipeline blocked] E -->|DVC: feature_engineering| F[data/features/] F -->|DVC: train / tune / calibrate| G[MLflow Registry]

Each arrow is an explicit contract. No stage reads from a previous stage's live output — every handoff is through a versioned, validated artifact.


Scope

This section covers the offline data layer only:

  • what the upstream source is and how it is treated,
  • how ingestion and canonicalization work,
  • what the raw export boundary is and why it matters,
  • what datasets exist and how they evolve,
  • how contracts are defined and enforced,
  • how versions are tracked and experiments stay reproducible,
  • how freshness and backfills are managed,
  • what fails and how.

Online feature access at inference time is described in Serving. ML pipeline stages are described in ML Training Pipeline.