Data Overview¶

The data layer is the foundation of the Time2Bet MLOps system.

This section documents:

data sources and ingestion mechanisms,
data schemas and lineage,
dataset versioning and reproducibility,
data quality controls and contracts.

The primary goal of the data architecture is to ensure that all downstream ML artifacts are traceable, reproducible, and auditable.

Design principles¶

Single source of truth PostgreSQL is the authoritative store for scraped and normalized data.
Immutable raw snapshots Raw datasets are exported as parquet snapshots and never mutated.
Explicit contracts Data quality expectations are formalized and enforced.
Versioned datasets All datasets used for training are versioned with DVC.

Data lifecycle (high level)¶

flowchart LR A[WhoScored.com] --> B[Scraping] B --> C[(PostgreSQL)] C --> D[Raw Parquet Export] D --> E[MinIO / S3] E --> F[DVC Versioned Data] F --> G[ML Pipelines]

Scope¶

This section focuses on offline data pipelines. Online feature access and inference-time data handling are described in the ML and Serving sections.