ADR-0002 — Data Versioning Strategy¶
Status¶
Accepted
Context¶
Machine learning experiments depend heavily on dataset versions. Raw data is produced continuously via scraping and ETL workflows.
We need: - reproducible experiments, - traceability between models and datasets, - efficient handling of large files.
Decision¶
We use DVC for dataset versioning with MinIO (S3-compatible) as remote storage.
Workflow:
- Airflow exports raw parquet snapshots to MinIO.
- DVC tracks dataset metadata and versions.
- Local and CI environments retrieve datasets via dvc pull.
Alternatives Considered¶
- Git LFS: rejected due to limited metadata and pipeline integration.
- Pure S3 versioning: rejected due to lack of experiment-level traceability.
- LakeFS: rejected due to operational overhead.
Consequences¶
Positive¶
- Strong coupling between code, data, and experiments.
- Full dataset lineage and reproducibility.
- Scales to large datasets.
Negative¶
- Requires DVC tooling familiarity.
- Initial setup complexity compared to ad-hoc storage.
Rollback / Change Strategy¶
If dataset scale or collaboration grows significantly, DVC remote storage can be replaced without changing pipeline semantics.
References¶
- DVC documentation