Dataset Versioning & Reproducibility¶

Purpose¶

This page explains how dataset versioning works in this system, what guarantees it provides, and how it connects to experiment reproducibility. It is specific to the tools and workflow used here — not a general DVC tutorial.

What is versioned and by what tool¶

Artifact	Tool	Where stored	What is tracked
Raw parquet snapshots (`data/raw/`)	DVC	MinIO (S3-compatible)	Content hash → Git `.dvc` pointer
Interim datasets (`data/interim/`)	DVC	MinIO	Content hash → Git `.dvc` pointer
Feature matrix (`data/features/`)	DVC	MinIO	Content hash → Git `.dvc` pointer
Pipeline stage definitions	Git	Repository	`dvc.yaml` stage graph
Parameters (`params.yaml`)	Git	Repository	Input to DVC stages; tracked as dep
ML experiment runs	MLflow	MLflow Tracking Server	Metrics, params, artifacts per run
Registered models	MLflow Registry	MLflow + MinIO	Model artifacts under versioned alias

DVC manages data. Git manages code, config, and DVC pointer files. MLflow manages experiment runs and model lifecycle.

How DVC versioning works in this pipeline¶

Every parquet file tracked by DVC has a corresponding .dvc pointer file in Git. This pointer records the content hash and the MinIO remote path.

When a DVC stage runs:

It computes a hash of its declared outputs.
If the hash matches what is in the Git-tracked .dvc file, the stage is skipped.
If the hash differs (or is missing), the stage re-runs and updates the .dvc file.

This means:

a given Git commit uniquely identifies the dataset state for that pipeline run,
dvc checkout restores the exact files to data/ from MinIO,
re-running dvc repro on the same commit produces the same outputs (pipeline is deterministic).

Reproducibility semantics¶

Given:

a Git commit hash,
access to the MinIO remote,
the same params.yaml at that commit,

any past experiment can be reproduced:

git checkout <commit>
dvc pull
dvc repro

This is the reproducibility guarantee. It holds because:

all pipeline inputs (parquet files, params) are content-addressed,
all feature and preprocessing logic is deterministic (pure functions, no random state in data stages),
all randomness in model training is seeded via params.yaml.

How DVC and MLflow connect¶

Each MLflow experiment run records:

the DVC dataset version (via logged tags/params if the pipeline logs them),
training parameters from params.yaml,
evaluation metrics.

The model artifact registered in the MLflow Registry was produced by a specific training run, which consumed a specific DVC-versioned dataset. Following the chain:

MLflow model version
    → MLflow run ID
    → params.yaml commit
    → DVC .dvc pointer files at that commit
    → MinIO content-addressed parquet

What is NOT versioned by DVC¶

PostgreSQL data — PostgreSQL is the live canonical store; its history is in database WAL, not DVC. The raw parquet export is the version boundary.
GE validation reports — HTML artifacts are produced per run but not tracked as versioned datasets.
Metadata JSON files (data/metadata/) — regenerated on each preprocessing run; not independently versioned.

Restore semantics¶

To restore a specific dataset version:

# restore to a specific git commit
git checkout <commit>
dvc pull               # download exact data matching that commit from MinIO

# verify pipeline state
dvc status             # should report "Data and pipelines are up to date"

To reproduce the full pipeline from that point:

dvc repro

CI interaction¶

CI runs dvc pull to fetch the tracked dataset versions matching the commit under test. Contract validation (validate_raw, etc.) runs in CI as part of dvc repro. Full retraining does not run in CI by default — only smoke pipeline stages.

(🚧 Partial — CI runs contract and smoke checks; full retraining is a manual operator action)

Raw Export — where the DVC-tracked parquet files originate
Schemas & Lineage — what each versioned file contains
Data Contracts — validation applied to versioned datasets
Architecture: Data & ML Flow — DVC stage graph
Status — current CI and pipeline status