Skip to content

Dataset Versioning & Reproducibility (DVC)

Role of DVC

DVC provides:

  • dataset versioning,
  • data lineage,
  • reproducible pipeline execution.

It acts as the bridge between data engineering and ML engineering.


Versioning semantics

Each dataset version is tied to:

  • a specific snapshot in object storage,
  • a git commit,
  • downstream ML artifacts.

Reproducibility guarantee

Given:

  • git commit hash,
  • DVC version,
  • configuration snapshot,

any experiment can be reproduced deterministically.


CI integration

CI pipelines can:

  • pull specific dataset versions,
  • run smoke pipelines,
  • validate contracts without full retraining.