Skip to content

Raw Parquet Export (PostgreSQL → MinIO → DVC)

Purpose

The raw export is the reproducibility boundary of the data layer. It is the explicit handoff point between live canonical storage (PostgreSQL) and immutable ML inputs (DVC-versioned parquet).

Everything upstream of this boundary is ingestion. Everything downstream is reproducible ML.


Why this boundary exists

PostgreSQL is correct and canonical, but it is not suitable as a direct ML training source:

  • it is a live, mutable store — records may be corrected, schema may evolve,
  • querying it directly from the ML pipeline couples training to ingestion infrastructure,
  • re-running a training experiment months later against a live DB may not reproduce the same data.

The raw export solves this by creating a point-in-time, immutable snapshot that is content-addressed by DVC and stored in MinIO. Any experiment can be reproduced from that specific snapshot regardless of subsequent changes to PostgreSQL.

Status: ✅ Implemented — DVC stage load_data_from_sources


Export workflow

DVC stage: load_data_from_sources
  Input:  PostgreSQL canonical tables
  Output: data/raw/*.parquet  (DVC-tracked, stored in MinIO)
  Trigger: dvc repro (manual or CI)
  1. The DVC stage queries PostgreSQL and writes parquet files to data/raw/.
  2. DVC content-addresses the outputs and pushes them to MinIO.
  3. Git tracks the .dvc pointer files; MinIO stores the actual data.
  4. Any environment with dvc pull can retrieve the exact snapshot used for a given Git commit.

The export stage runs on dvc repro, not automatically after an Airflow ingestion run. This is intentional: export is an operator action, not a side effect of ingestion.


Properties of raw datasets

Property Detail
Immutable Once DVC-tracked, a snapshot is never mutated; new exports create new content hashes
Content-addressed DVC stores the SHA of each file; identical input → identical output hash
Time-partitioned Files are organized by competition / time range where applicable
Schema-documented Schema for each raw file is defined in Schemas

Anti-patterns this boundary prevents

Anti-pattern Why avoided
Training directly from PostgreSQL Non-reproducible; couples ML to live DB
Mutable raw datasets Breaks traceability; past experiments cannot be reproduced
Ad-hoc local data dumps Not tracked by DVC; not accessible in CI or shared environments
Scraping output written directly to parquet Scraping is not deterministic; the export must come from the canonical store

Connection to downstream stages

After a successful export, dvc repro automatically chains to validate_raw:

load_data_from_sources → validate_raw → preprocessing → feature_engineering → ...

The validate_raw gate enforces data contracts before any downstream stage can proceed. A failed validation stops the pipeline. See Contracts.