Raw Parquet Export (PostgreSQL → MinIO → DVC)¶
Purpose¶
The raw export is the reproducibility boundary of the data layer. It is the explicit handoff point between live canonical storage (PostgreSQL) and immutable ML inputs (DVC-versioned parquet).
Everything upstream of this boundary is ingestion. Everything downstream is reproducible ML.
Why this boundary exists¶
PostgreSQL is correct and canonical, but it is not suitable as a direct ML training source:
- it is a live, mutable store — records may be corrected, schema may evolve,
- querying it directly from the ML pipeline couples training to ingestion infrastructure,
- re-running a training experiment months later against a live DB may not reproduce the same data.
The raw export solves this by creating a point-in-time, immutable snapshot that is content-addressed by DVC and stored in MinIO. Any experiment can be reproduced from that specific snapshot regardless of subsequent changes to PostgreSQL.
Status: ✅ Implemented — DVC stage load_data_from_sources
Export workflow¶
DVC stage: load_data_from_sources
Input: PostgreSQL canonical tables
Output: data/raw/*.parquet (DVC-tracked, stored in MinIO)
Trigger: dvc repro (manual or CI)
- The DVC stage queries PostgreSQL and writes parquet files to
data/raw/. - DVC content-addresses the outputs and pushes them to MinIO.
- Git tracks the
.dvcpointer files; MinIO stores the actual data. - Any environment with
dvc pullcan retrieve the exact snapshot used for a given Git commit.
The export stage runs on dvc repro, not automatically after an Airflow ingestion run.
This is intentional: export is an operator action, not a side effect of ingestion.
Properties of raw datasets¶
| Property | Detail |
|---|---|
| Immutable | Once DVC-tracked, a snapshot is never mutated; new exports create new content hashes |
| Content-addressed | DVC stores the SHA of each file; identical input → identical output hash |
| Time-partitioned | Files are organized by competition / time range where applicable |
| Schema-documented | Schema for each raw file is defined in Schemas |
Anti-patterns this boundary prevents¶
| Anti-pattern | Why avoided |
|---|---|
| Training directly from PostgreSQL | Non-reproducible; couples ML to live DB |
| Mutable raw datasets | Breaks traceability; past experiments cannot be reproduced |
| Ad-hoc local data dumps | Not tracked by DVC; not accessible in CI or shared environments |
| Scraping output written directly to parquet | Scraping is not deterministic; the export must come from the canonical store |
Connection to downstream stages¶
After a successful export, dvc repro automatically chains to validate_raw:
The validate_raw gate enforces data contracts before any downstream stage can proceed.
A failed validation stops the pipeline. See Contracts.
Related¶
- ETL & Ingestion — what populates PostgreSQL before export
- Data Contracts — the
validate_rawgate applied to these outputs - Dataset Versioning — how DVC tracks these snapshots
- Architecture: Data & ML Flow — Stage 2