ADR-0001 — Pipeline Orchestration Strategy¶
Status¶
Accepted
Context¶
The system requires orchestration for multiple types of workflows: - continuous data ingestion and scraping, - reproducible offline ML pipelines, - scheduled and event-driven jobs.
Different tools offer overlapping functionality but serve different purposes.
Decision¶
We use multiple orchestrators with clearly separated responsibilities:
- Airflow for external, scheduled ETL:
- scraping data sources,
- loading into PostgreSQL,
-
exporting raw datasets.
-
DVC pipelines for offline ML workflows:
- preprocessing,
- feature engineering,
- training,
- evaluation.
DVC is treated as the source of truth for ML reproducibility.
Alternatives Considered¶
- Airflow-only: rejected due to weak experiment reproducibility and local developer experience.
- Snakemake: rejected to avoid overlapping orchestration semantics with DVC.
- Kubeflow Pipelines: rejected due to operational overhead for project scope.
Consequences¶
Positive¶
- Clear separation between data engineering and ML experimentation.
- Deterministic ML pipelines tied to data and code versions.
- Easy local and CI execution via
dvc repro.
Negative¶
- Requires discipline to avoid orchestration overlap.
- Two orchestration tools increase conceptual surface area.
Rollback / Change Strategy¶
If pipeline complexity increases significantly, offline ML orchestration could be migrated to a workflow engine (e.g., Kubeflow or Prefect) via a new ADR.
References¶
- DVC documentation
- Airflow documentation