Raw Parquet Export (PostgreSQL → MinIO → DVC)¶

Purpose¶

The raw export is the reproducibility boundary of the data layer. It is the explicit handoff point between live canonical storage (PostgreSQL) and immutable ML inputs (DVC-versioned parquet).

Everything upstream of this boundary is ingestion. Everything downstream is reproducible ML.

Why this boundary exists¶

PostgreSQL is correct and canonical, but it is not suitable as a direct ML training source:

it is a live, mutable store — records may be corrected, schema may evolve,
querying it directly from the ML pipeline couples training to ingestion infrastructure,
re-running a training experiment months later against a live DB may not reproduce the same data.

The raw export solves this by creating a point-in-time, immutable snapshot that is content-addressed by DVC and stored in MinIO. Any experiment can be reproduced from that specific snapshot regardless of subsequent changes to PostgreSQL.

Status: ✅ Implemented — DVC stage load_data_from_sources

Export workflow¶

DVC stage: load_data_from_sources
  Input:  PostgreSQL canonical tables
  Output: data/raw/*.parquet  (DVC-tracked, stored in MinIO)
  Trigger: dvc repro (manual or CI)

The DVC stage queries PostgreSQL and writes parquet files to data/raw/.
DVC content-addresses the outputs and pushes them to MinIO.
Git tracks the .dvc pointer files; MinIO stores the actual data.
Any environment with dvc pull can retrieve the exact snapshot used for a given Git commit.

The export stage runs on dvc repro, not automatically after an Airflow ingestion run. This is intentional: export is an operator action, not a side effect of ingestion.

Properties of raw datasets¶

Property	Detail
Immutable	Once DVC-tracked, a snapshot is never mutated; new exports create new content hashes
Content-addressed	DVC stores the SHA of each file; identical input → identical output hash
Time-partitioned	Files are organized by competition / time range where applicable
Schema-documented	Schema for each raw file is defined in Schemas

Anti-patterns this boundary prevents¶

Anti-pattern	Why avoided
Training directly from PostgreSQL	Non-reproducible; couples ML to live DB
Mutable raw datasets	Breaks traceability; past experiments cannot be reproduced
Ad-hoc local data dumps	Not tracked by DVC; not accessible in CI or shared environments
Scraping output written directly to parquet	Scraping is not deterministic; the export must come from the canonical store

Connection to downstream stages¶

After a successful export, dvc repro automatically chains to validate_raw:

load_data_from_sources → validate_raw → preprocessing → feature_engineering → ...

The validate_raw gate enforces data contracts before any downstream stage can proceed. A failed validation stops the pipeline. See Contracts.

ETL & Ingestion — what populates PostgreSQL before export
Data Contracts — the validate_raw gate applied to these outputs
Dataset Versioning — how DVC tracks these snapshots
Architecture: Data & ML Flow — Stage 2