Raw Parquet Export (Postgres → MinIO → DVC)¶
Motivation¶
ML pipelines should operate on:
- immutable inputs,
- snapshot-based datasets,
- storage-efficient formats.
For this reason, raw data is exported as parquet snapshots.
Export workflow¶
- Airflow extracts data from PostgreSQL.
- Data is written as parquet files to MinIO (S3-compatible).
- DVC tracks exported datasets as versioned artifacts.
- Local and CI environments retrieve data via
dvc pull.
Properties of raw datasets¶
- Immutable once exported.
- Partitioned by time or competition where applicable.
- Schema documented and versioned.
Benefits¶
- Clear boundary between ingestion and ML.
- Reproducible downstream experiments.
- Efficient storage and transfer.
Anti-patterns avoided¶
- Training directly from PostgreSQL.
- Mutable raw datasets.
- Ad-hoc local data dumps.