ADR-0001 — Pipeline Orchestration Strategy¶

Status¶

Accepted

Context¶

The system requires orchestration for multiple types of workflows: - continuous data ingestion and scraping, - reproducible offline ML pipelines, - scheduled and event-driven jobs.

Different tools offer overlapping functionality but serve different purposes.

Decision¶

We use multiple orchestrators with clearly separated responsibilities:

Airflow for external, scheduled ETL:
scraping data sources,
loading into PostgreSQL,
exporting raw datasets.
DVC pipelines for offline ML workflows:
preprocessing,
feature engineering,
training,
evaluation.

DVC is treated as the source of truth for ML reproducibility.

Alternatives Considered¶

Airflow-only: rejected due to weak experiment reproducibility and local developer experience.
Snakemake: rejected to avoid overlapping orchestration semantics with DVC.
Kubeflow Pipelines: rejected due to operational overhead for project scope.

Consequences¶

Positive¶

Clear separation between data engineering and ML experimentation.
Deterministic ML pipelines tied to data and code versions.
Easy local and CI execution via dvc repro.

Negative¶

Requires discipline to avoid orchestration overlap.
Two orchestration tools increase conceptual surface area.

Rollback / Change Strategy¶

If pipeline complexity increases significantly, offline ML orchestration could be migrated to a workflow engine (e.g., Kubeflow or Prefect) via a new ADR.

References¶

DVC documentation
Airflow documentation