Dataset Versioning & Reproducibility (DVC)¶
Role of DVC¶
DVC provides:
- dataset versioning,
- data lineage,
- reproducible pipeline execution.
It acts as the bridge between data engineering and ML engineering.
Versioning semantics¶
Each dataset version is tied to:
- a specific snapshot in object storage,
- a git commit,
- downstream ML artifacts.
Reproducibility guarantee¶
Given:
- git commit hash,
- DVC version,
- configuration snapshot,
any experiment can be reproduced deterministically.
CI integration¶
CI pipelines can:
- pull specific dataset versions,
- run smoke pipelines,
- validate contracts without full retraining.