Skip to content

Data Overview

The data layer is the foundation of the Time2Bet MLOps system.

This section documents:

  • data sources and ingestion mechanisms,
  • data schemas and lineage,
  • dataset versioning and reproducibility,
  • data quality controls and contracts.

The primary goal of the data architecture is to ensure that all downstream ML artifacts are traceable, reproducible, and auditable.


Design principles

  • Single source of truth PostgreSQL is the authoritative store for scraped and normalized data.

  • Immutable raw snapshots Raw datasets are exported as parquet snapshots and never mutated.

  • Explicit contracts Data quality expectations are formalized and enforced.

  • Versioned datasets All datasets used for training are versioned with DVC.


Data lifecycle (high level)

flowchart LR A[WhoScored.com] --> B[Scraping] B --> C[(PostgreSQL)] C --> D[Raw Parquet Export] D --> E[MinIO / S3] E --> F[DVC Versioned Data] F --> G[ML Pipelines]

Scope

This section focuses on offline data pipelines. Online feature access and inference-time data handling are described in the ML and Serving sections.