Skip to content

ADR-0001 — Pipeline Orchestration Strategy

Status

Accepted

Context

The system requires orchestration for multiple types of workflows: - continuous data ingestion and scraping, - reproducible offline ML pipelines, - scheduled and event-driven jobs.

Different tools offer overlapping functionality but serve different purposes.

Decision

We use multiple orchestrators with clearly separated responsibilities:

  • Airflow for external, scheduled ETL:
  • scraping data sources,
  • loading into PostgreSQL,
  • exporting raw datasets.

  • DVC pipelines for offline ML workflows:

  • preprocessing,
  • feature engineering,
  • training,
  • evaluation.

DVC is treated as the source of truth for ML reproducibility.

Alternatives Considered

  • Airflow-only: rejected due to weak experiment reproducibility and local developer experience.
  • Snakemake: rejected to avoid overlapping orchestration semantics with DVC.
  • Kubeflow Pipelines: rejected due to operational overhead for project scope.

Consequences

Positive

  • Clear separation between data engineering and ML experimentation.
  • Deterministic ML pipelines tied to data and code versions.
  • Easy local and CI execution via dvc repro.

Negative

  • Requires discipline to avoid orchestration overlap.
  • Two orchestration tools increase conceptual surface area.

Rollback / Change Strategy

If pipeline complexity increases significantly, offline ML orchestration could be migrated to a workflow engine (e.g., Kubeflow or Prefect) via a new ADR.

References

  • DVC documentation
  • Airflow documentation