Skip to content

Pipelines Reference

This page documents how to run the core workflows of the system locally and in CI.

Time2Bet uses: - Airflow for external ingestion/ETL - DVC pipelines for reproducible offline ML workflows


DVC pipeline entrypoints

Pull versioned data

dvc pull
````

### Reproduce the full ML pipeline

```bash
dvc repro

Re-run a specific stage (example)

dvc repro <stage_name>

Show pipeline graph

dvc dag

Show pipeline status

dvc status

Common workflows

Full offline training cycle

  1. dvc pull
  2. dvc repro
  3. inspect MLflow runs
  4. start API and run /predict

Smoke run (CI-friendly)

  • use a reduced dataset or subset target
  • run dvc repro and ensure:

  • pipeline completes

  • basic metrics sanity checks pass
  • artifacts logged to MLflow

Airflow workflows (operational)

Airflow is responsible for:

  • scraping WhoScored.com,
  • loading normalized data into PostgreSQL,
  • exporting raw parquet snapshots to MinIO.

Airflow jobs are not expected to be reproducible in the strict ML sense, but their outputs are materialized and versioned downstream.


Make targets (developer ergonomics)

If Make targets are provided, they should map to:

  • environment setup
  • requirements export
  • docs build
  • encryption/decryption operations

Example:

make docs-build
make export
make encrypt

  • Data → ETL / Raw Export / Versioning
  • ML → Training Pipeline
  • CI/CD → Testing Strategy