Skip to content

Pipelines Reference

This page documents how to run the core workflows of the system locally and in CI.

Time2Bet uses: - Airflow for external ingestion/ETL - DVC pipelines for reproducible offline ML workflows


DVC pipeline entrypoints

Pull versioned data

dvc pull
````

### Reproduce the full ML pipeline

```bash
dvc repro

Re-run a specific stage (example)

dvc repro <stage_name>

Show pipeline graph

dvc dag

Show pipeline status

dvc status        # cache vs workspace
dvc status -c     # cache vs remote (what is NOT yet pushed)

Daily workflow: save-point pattern

dvc repro overwrites artifacts in data/ in place. To keep changes safely revertible, treat each known-good state as an explicit save-point:

# 0. Confirm current state is fully saved
dvc status -c                       # must report "in sync"
git status                          # dvc.lock must be committed

# 1. Make code/params changes, then re-run
dvc repro                           # or: dvc repro -s <stage>

# 2a. Result is good -> publish + commit (this is the new save-point)
dvc push
git add dvc.lock && git commit -m "..."
git push                            # ALWAYS: dvc push BEFORE git push

# 2b. Result is bad -> roll back to the previous save-point
git checkout HEAD -- dvc.lock       # restore old artifact pointers
dvc checkout                        # restore data/ from local cache

Rule: dvc push before git push. Otherwise other machines / CI will receive pointers to artifacts that do not exist in the remote.

Isolated experiments (no workspace pollution)

For exploratory runs that should NOT touch data/ until accepted:

dvc exp run -S data.sample_frac=0.01    # run with overrides
dvc exp show                             # compare with baseline
dvc exp apply <exp-name>                 # promote to workspace
dvc exp remove <exp-name>                # discard

dvc exp keeps each run as a hidden git ref; the workspace stays clean until dvc exp apply.

Pitfall: detached HEAD after DVC operations

dvc exp apply, dvc exp branch, and some dvc checkout flows emit Restore HEAD to <hash> and leave the repo in detached HEAD state. A git commit made in this state is not attached to any branch, so a later git checkout <branch> makes the commit appear "lost" — it is not shown in git log <branch>.

Before every commit, verify:

git status          # first line MUST be: "On branch <name>"
                    # NOT:  "HEAD detached at <hash>"

If detached, attach first:

git switch <branch>            # return to a branch, OR
git switch -c <new-branch>     # keep current commit on a new branch

Recovering a commit that was made while detached:

git reflog                     # find the commit hash
git branch <name> <hash>       # anchor it to a branch
git merge <name>               # then merge into the target branch

Credentials

DVC uses the standard AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_ENDPOINT_URL_S3 env vars (defined in .env). Load them once per shell:

set -a; source .env; set +a

Common workflows

Full offline training cycle

  1. dvc pull
  2. dvc repro
  3. inspect MLflow runs
  4. start API and run /predict

Smoke run (CI-friendly)

  • use a reduced dataset or subset target
  • run dvc repro and ensure:

  • pipeline completes

  • basic metrics sanity checks pass
  • artifacts logged to MLflow

Airflow workflows (operational)

Airflow is responsible for:

  • scraping WhoScored.com,
  • loading normalized data into PostgreSQL,
  • exporting raw parquet snapshots to MinIO.

Airflow jobs are not expected to be reproducible in the strict ML sense, but their outputs are materialized and versioned downstream.


Make targets (developer ergonomics)

If Make targets are provided, they should map to:

  • environment setup
  • requirements export
  • docs build
  • encryption/decryption operations

Example:

make docs-build
make requirements
make encrypt

  • Data → ETL / Raw Export / Versioning
  • ML → Training Pipeline
  • CI/CD → Testing Strategy