Skip to content

Raw Parquet Export (Postgres → MinIO → DVC)¶

Motivation¶

ML pipelines should operate on:

immutable inputs,
snapshot-based datasets,
storage-efficient formats.

For this reason, raw data is exported as parquet snapshots.

Export workflow¶

Airflow extracts data from PostgreSQL.
Data is written as parquet files to MinIO (S3-compatible).
DVC tracks exported datasets as versioned artifacts.
Local and CI environments retrieve data via dvc pull.

Properties of raw datasets¶

Immutable once exported.
Partitioned by time or competition where applicable.
Schema documented and versioned.

Benefits¶

Clear boundary between ingestion and ML.
Reproducible downstream experiments.
Efficient storage and transfer.

Anti-patterns avoided¶

Training directly from PostgreSQL.
Mutable raw datasets.
Ad-hoc local data dumps.