Data Layer¶
This section documents the offline data subsystem that feeds reproducible ML in SoccerPredictAI. It is not a second architecture overview. For system boundaries, component responsibilities, and end-to-end flow, see Architecture.
Purpose¶
The data layer exists to bridge an unstable, uncontrolled external source (WhoScored.com) and a reproducible, contract-governed ML pipeline (DVC). Its core function is:
- Acquire raw match data from an unreliable external source.
- Canonicalize that data into PostgreSQL.
- Export immutable parquet snapshots to MinIO.
- Version those snapshots so ML experiments are reproducible and traceable.
- Gate downstream ML with validated data contracts.
Every boundary in this layer exists to enforce that contract. The data layer does not train models, serve predictions, or manage infrastructure.
Design principles¶
These are not aspirations — they are active constraints that shape how each stage is built.
| Principle | Constraint it enforces |
|---|---|
| Scraping is isolated from ML reproducibility | ML pipelines never read from live PostgreSQL; they read only from versioned parquet snapshots |
| PostgreSQL is canonical, not the training source | Canonical status is about correctness and dedup, not ML input; the raw export stage creates the reproducibility boundary |
| Raw snapshots are immutable | Once a parquet file is exported and DVC-tracked, it is never mutated; new exports create new DVC versions |
| Contracts are explicit and blocking | Great Expectations suites are enforced as DVC stage gates; a failed check stops the pipeline |
| Lineage is full-stack | Every training run can be traced back through: MLflow run → DVC dataset version → parquet snapshot → PostgreSQL scraping run |
Data lifecycle¶
Each arrow is an explicit contract. No stage reads from a previous stage's live output — every handoff is through a versioned, validated artifact.
Scope¶
This section covers the offline data layer only:
- what the upstream source is and how it is treated,
- how ingestion and canonicalization work,
- what the raw export boundary is and why it matters,
- what datasets exist and how they evolve,
- how contracts are defined and enforced,
- how versions are tracked and experiments stay reproducible,
- how freshness and backfills are managed,
- what fails and how.
Online feature access at inference time is described in Serving. ML pipeline stages are described in ML Training Pipeline.
Related¶
- Architecture: Data & ML Flow — end-to-end stage-by-stage breakdown
- Architecture: System Boundary — what is inside and outside the system
- ML Training Pipeline — how features flow into model training
- Runbook: Backfills — safe historical reprocessing
- Status — implementation readiness