Skip to content

ADR-0002 — Data Versioning Strategy

Status

Accepted

Context

Machine learning experiments depend heavily on dataset versions. Raw data is produced continuously via scraping and ETL workflows.

We need: - reproducible experiments, - traceability between models and datasets, - efficient handling of large files.

Decision

We use DVC for dataset versioning with MinIO (S3-compatible) as remote storage.

Workflow: - Airflow exports raw parquet snapshots to MinIO. - DVC tracks dataset metadata and versions. - Local and CI environments retrieve datasets via dvc pull.

Alternatives Considered

  • Git LFS: rejected due to limited metadata and pipeline integration.
  • Pure S3 versioning: rejected due to lack of experiment-level traceability.
  • LakeFS: rejected due to operational overhead.

Consequences

Positive

  • Strong coupling between code, data, and experiments.
  • Full dataset lineage and reproducibility.
  • Scales to large datasets.

Negative

  • Requires DVC tooling familiarity.
  • Initial setup complexity compared to ad-hoc storage.

Rollback / Change Strategy

If dataset scale or collaboration grows significantly, DVC remote storage can be replaced without changing pipeline semantics.

References

  • DVC documentation