Dataset Versioning & Reproducibility¶
Purpose¶
This page explains how dataset versioning works in this system, what guarantees it provides, and how it connects to experiment reproducibility. It is specific to the tools and workflow used here — not a general DVC tutorial.
What is versioned and by what tool¶
| Artifact | Tool | Where stored | What is tracked |
|---|---|---|---|
Raw parquet snapshots (data/raw/) |
DVC | MinIO (S3-compatible) | Content hash → Git .dvc pointer |
Interim datasets (data/interim/) |
DVC | MinIO | Content hash → Git .dvc pointer |
Feature matrix (data/features/) |
DVC | MinIO | Content hash → Git .dvc pointer |
| Pipeline stage definitions | Git | Repository | dvc.yaml stage graph |
Parameters (params.yaml) |
Git | Repository | Input to DVC stages; tracked as dep |
| ML experiment runs | MLflow | MLflow Tracking Server | Metrics, params, artifacts per run |
| Registered models | MLflow Registry | MLflow + MinIO | Model artifacts under versioned alias |
DVC manages data. Git manages code, config, and DVC pointer files. MLflow manages experiment runs and model lifecycle.
How DVC versioning works in this pipeline¶
Every parquet file tracked by DVC has a corresponding .dvc pointer file in Git.
This pointer records the content hash and the MinIO remote path.
When a DVC stage runs:
- It computes a hash of its declared outputs.
- If the hash matches what is in the Git-tracked
.dvcfile, the stage is skipped. - If the hash differs (or is missing), the stage re-runs and updates the
.dvcfile.
This means:
- a given Git commit uniquely identifies the dataset state for that pipeline run,
dvc checkoutrestores the exact files todata/from MinIO,- re-running
dvc reproon the same commit produces the same outputs (pipeline is deterministic).
Reproducibility semantics¶
Given:
- a Git commit hash,
- access to the MinIO remote,
- the same
params.yamlat that commit,
any past experiment can be reproduced:
This is the reproducibility guarantee. It holds because:
- all pipeline inputs (parquet files, params) are content-addressed,
- all feature and preprocessing logic is deterministic (pure functions, no random state in data stages),
- all randomness in model training is seeded via
params.yaml.
How DVC and MLflow connect¶
Each MLflow experiment run records:
- the DVC dataset version (via logged tags/params if the pipeline logs them),
- training parameters from
params.yaml, - evaluation metrics.
The model artifact registered in the MLflow Registry was produced by a specific training run, which consumed a specific DVC-versioned dataset. Following the chain:
MLflow model version
→ MLflow run ID
→ params.yaml commit
→ DVC .dvc pointer files at that commit
→ MinIO content-addressed parquet
What is NOT versioned by DVC¶
- PostgreSQL data — PostgreSQL is the live canonical store; its history is in database WAL, not DVC. The raw parquet export is the version boundary.
- GE validation reports — HTML artifacts are produced per run but not tracked as versioned datasets.
- Metadata JSON files (
data/metadata/) — regenerated on each preprocessing run; not independently versioned.
Restore semantics¶
To restore a specific dataset version:
# restore to a specific git commit
git checkout <commit>
dvc pull # download exact data matching that commit from MinIO
# verify pipeline state
dvc status # should report "Data and pipelines are up to date"
To reproduce the full pipeline from that point:
CI interaction¶
CI runs dvc pull to fetch the tracked dataset versions matching the commit under test.
Contract validation (validate_raw, etc.) runs in CI as part of dvc repro.
Full retraining does not run in CI by default — only smoke pipeline stages.
(🚧 Partial — CI runs contract and smoke checks; full retraining is a manual operator action)
Related¶
- Raw Export — where the DVC-tracked parquet files originate
- Schemas & Lineage — what each versioned file contains
- Data Contracts — validation applied to versioned datasets
- Architecture: Data & ML Flow — DVC stage graph
- Status — current CI and pipeline status