ADR-0002 — Data Versioning Strategy¶

Status¶

Accepted

Context¶

Machine learning experiments depend heavily on dataset versions. Raw data is produced continuously via scraping and ETL workflows.

We need: - reproducible experiments, - traceability between models and datasets, - efficient handling of large files.

Decision¶

We use DVC for dataset versioning with MinIO (S3-compatible) as remote storage.

Workflow: - Airflow exports raw parquet snapshots to MinIO. - DVC tracks dataset metadata and versions. - Local and CI environments retrieve datasets via dvc pull.

Alternatives Considered¶

Git LFS: rejected due to limited metadata and pipeline integration.
Pure S3 versioning: rejected due to lack of experiment-level traceability.
LakeFS: rejected due to operational overhead.

Consequences¶

Positive¶

Strong coupling between code, data, and experiments.
Full dataset lineage and reproducibility.
Scales to large datasets.

Negative¶

Requires DVC tooling familiarity.
Initial setup complexity compared to ad-hoc storage.

Rollback / Change Strategy¶

If dataset scale or collaboration grows significantly, DVC remote storage can be replaced without changing pipeline semantics.

References¶

DVC documentation