Skip to content

Dataset Versioning & Reproducibility

Purpose

This page explains how dataset versioning works in this system, what guarantees it provides, and how it connects to experiment reproducibility. It is specific to the tools and workflow used here — not a general DVC tutorial.


What is versioned and by what tool

Artifact Tool Where stored What is tracked
Raw parquet snapshots (data/raw/) DVC MinIO (S3-compatible) Content hash → Git .dvc pointer
Interim datasets (data/interim/) DVC MinIO Content hash → Git .dvc pointer
Feature matrix (data/features/) DVC MinIO Content hash → Git .dvc pointer
Pipeline stage definitions Git Repository dvc.yaml stage graph
Parameters (params.yaml) Git Repository Input to DVC stages; tracked as dep
ML experiment runs MLflow MLflow Tracking Server Metrics, params, artifacts per run
Registered models MLflow Registry MLflow + MinIO Model artifacts under versioned alias

DVC manages data. Git manages code, config, and DVC pointer files. MLflow manages experiment runs and model lifecycle.


How DVC versioning works in this pipeline

Every parquet file tracked by DVC has a corresponding .dvc pointer file in Git. This pointer records the content hash and the MinIO remote path.

When a DVC stage runs:

  1. It computes a hash of its declared outputs.
  2. If the hash matches what is in the Git-tracked .dvc file, the stage is skipped.
  3. If the hash differs (or is missing), the stage re-runs and updates the .dvc file.

This means:

  • a given Git commit uniquely identifies the dataset state for that pipeline run,
  • dvc checkout restores the exact files to data/ from MinIO,
  • re-running dvc repro on the same commit produces the same outputs (pipeline is deterministic).

Reproducibility semantics

Given:

  • a Git commit hash,
  • access to the MinIO remote,
  • the same params.yaml at that commit,

any past experiment can be reproduced:

git checkout <commit>
dvc pull
dvc repro

This is the reproducibility guarantee. It holds because:

  • all pipeline inputs (parquet files, params) are content-addressed,
  • all feature and preprocessing logic is deterministic (pure functions, no random state in data stages),
  • all randomness in model training is seeded via params.yaml.

How DVC and MLflow connect

Each MLflow experiment run records:

  • the DVC dataset version (via logged tags/params if the pipeline logs them),
  • training parameters from params.yaml,
  • evaluation metrics.

The model artifact registered in the MLflow Registry was produced by a specific training run, which consumed a specific DVC-versioned dataset. Following the chain:

MLflow model version
    → MLflow run ID
    → params.yaml commit
    → DVC .dvc pointer files at that commit
    → MinIO content-addressed parquet

What is NOT versioned by DVC

  • PostgreSQL data — PostgreSQL is the live canonical store; its history is in database WAL, not DVC. The raw parquet export is the version boundary.
  • GE validation reports — HTML artifacts are produced per run but not tracked as versioned datasets.
  • Metadata JSON files (data/metadata/) — regenerated on each preprocessing run; not independently versioned.

Restore semantics

To restore a specific dataset version:

# restore to a specific git commit
git checkout <commit>
dvc pull               # download exact data matching that commit from MinIO

# verify pipeline state
dvc status             # should report "Data and pipelines are up to date"

To reproduce the full pipeline from that point:

dvc repro

CI interaction

CI runs dvc pull to fetch the tracked dataset versions matching the commit under test. Contract validation (validate_raw, etc.) runs in CI as part of dvc repro. Full retraining does not run in CI by default — only smoke pipeline stages.

(🚧 Partial — CI runs contract and smoke checks; full retraining is a manual operator action)