Pipelines Reference¶
This page documents how to run the core workflows of the system locally and in CI.
Time2Bet uses: - Airflow for external ingestion/ETL - DVC pipelines for reproducible offline ML workflows
DVC pipeline entrypoints¶
Pull versioned data¶
Re-run a specific stage (example)¶
Show pipeline graph¶
Show pipeline status¶
Daily workflow: save-point pattern¶
dvc repro overwrites artifacts in data/ in place. To keep changes safely
revertible, treat each known-good state as an explicit save-point:
# 0. Confirm current state is fully saved
dvc status -c # must report "in sync"
git status # dvc.lock must be committed
# 1. Make code/params changes, then re-run
dvc repro # or: dvc repro -s <stage>
# 2a. Result is good -> publish + commit (this is the new save-point)
dvc push
git add dvc.lock && git commit -m "..."
git push # ALWAYS: dvc push BEFORE git push
# 2b. Result is bad -> roll back to the previous save-point
git checkout HEAD -- dvc.lock # restore old artifact pointers
dvc checkout # restore data/ from local cache
Rule:
dvc pushbeforegit push. Otherwise other machines / CI will receive pointers to artifacts that do not exist in the remote.
Isolated experiments (no workspace pollution)¶
For exploratory runs that should NOT touch data/ until accepted:
dvc exp run -S data.sample_frac=0.01 # run with overrides
dvc exp show # compare with baseline
dvc exp apply <exp-name> # promote to workspace
dvc exp remove <exp-name> # discard
dvc exp keeps each run as a hidden git ref; the workspace stays clean
until dvc exp apply.
Pitfall: detached HEAD after DVC operations¶
dvc exp apply, dvc exp branch, and some dvc checkout flows emit
Restore HEAD to <hash> and leave the repo in detached HEAD state.
A git commit made in this state is not attached to any branch, so a
later git checkout <branch> makes the commit appear "lost" — it is not
shown in git log <branch>.
Before every commit, verify:
If detached, attach first:
git switch <branch> # return to a branch, OR
git switch -c <new-branch> # keep current commit on a new branch
Recovering a commit that was made while detached:
git reflog # find the commit hash
git branch <name> <hash> # anchor it to a branch
git merge <name> # then merge into the target branch
Credentials¶
DVC uses the standard AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY /
AWS_ENDPOINT_URL_S3 env vars (defined in .env). Load them once per shell:
Common workflows¶
Full offline training cycle¶
dvc pulldvc repro- inspect MLflow runs
- start API and run
/predict
Smoke run (CI-friendly)¶
- use a reduced dataset or subset target
-
run
dvc reproand ensure: -
pipeline completes
- basic metrics sanity checks pass
- artifacts logged to MLflow
Airflow workflows (operational)¶
Airflow is responsible for:
- scraping WhoScored.com,
- loading normalized data into PostgreSQL,
- exporting raw parquet snapshots to MinIO.
Airflow jobs are not expected to be reproducible in the strict ML sense, but their outputs are materialized and versioned downstream.
Make targets (developer ergonomics)¶
If Make targets are provided, they should map to:
- environment setup
- requirements export
- docs build
- encryption/decryption operations
Example:
Related docs¶
- Data → ETL / Raw Export / Versioning
- ML → Training Pipeline
- CI/CD → Testing Strategy