Skip to content

Raw Parquet Export (Postgres → MinIO → DVC)

Motivation

ML pipelines should operate on:

  • immutable inputs,
  • snapshot-based datasets,
  • storage-efficient formats.

For this reason, raw data is exported as parquet snapshots.


Export workflow

  1. Airflow extracts data from PostgreSQL.
  2. Data is written as parquet files to MinIO (S3-compatible).
  3. DVC tracks exported datasets as versioned artifacts.
  4. Local and CI environments retrieve data via dvc pull.

Properties of raw datasets

  • Immutable once exported.
  • Partitioned by time or competition where applicable.
  • Schema documented and versioned.

Benefits

  • Clear boundary between ingestion and ML.
  • Reproducible downstream experiments.
  • Efficient storage and transfer.

Anti-patterns avoided

  • Training directly from PostgreSQL.
  • Mutable raw datasets.
  • Ad-hoc local data dumps.