Data Overview¶
The data layer is the foundation of the Time2Bet MLOps system.
This section documents:
- data sources and ingestion mechanisms,
- data schemas and lineage,
- dataset versioning and reproducibility,
- data quality controls and contracts.
The primary goal of the data architecture is to ensure that all downstream ML artifacts are traceable, reproducible, and auditable.
Design principles¶
-
Single source of truth PostgreSQL is the authoritative store for scraped and normalized data.
-
Immutable raw snapshots Raw datasets are exported as parquet snapshots and never mutated.
-
Explicit contracts Data quality expectations are formalized and enforced.
-
Versioned datasets All datasets used for training are versioned with DVC.
Data lifecycle (high level)¶
flowchart LR
A[WhoScored.com] --> B[Scraping]
B --> C[(PostgreSQL)]
C --> D[Raw Parquet Export]
D --> E[MinIO / S3]
E --> F[DVC Versioned Data]
F --> G[ML Pipelines]
Scope¶
This section focuses on offline data pipelines. Online feature access and inference-time data handling are described in the ML and Serving sections.