Skip to content

Backfills & Freshness Policy

Data freshness

Freshness requirements differ by dataset:

Dataset Expected update frequency Notes
data/raw/match.parquet Per dvc repro run after new ingestion Depends on Airflow schedule
data/interim/finished.parquet Same as raw (pipeline re-runs) Updated only when raw changes
data/features/features.parquet Same as interim Derived; updated on each pipeline run

A freshness breach occurs when the most recent match in finished.parquet is older than the expected update window. This is currently detected manually by inspecting data/interim/finished.parquet or checking the Airflow DAG run history.

Automated freshness alerting — not yet implemented. Alertmanager rules are planned. (📋 Planned — see Status)


When backfills are required

A backfill is a controlled reprocessing of historical data. It is required when:

  • Scraping logic changes — the scraper is updated to extract additional fields or fix a parsing bug; historical records in PostgreSQL need to be re-extracted.
  • Historical corrections — WhoScored updates or corrects past match records.
  • Schema evolution — a column is added, renamed, or changed in the canonical PostgreSQL schema; past records need to be re-normalized.
  • Data quality issues detected post-ingestion — GE validation reveals a systematic problem in a past scraping run.

Backfills are not required for routine model retraining. Normal pipeline re-runs use existing DVC-tracked datasets and do not require touching PostgreSQL.


Current operating process

Backfills are an operator-driven manual process. There is no automated backfill scheduling.

1. Identify affected date range and root cause
2. Trigger Airflow backfill DAG for the affected range:
       airflow dags backfill --start-date <start> --end-date <end> <dag_id>
3. Verify new records are in PostgreSQL (compare record counts)
4. Run dvc repro load_data_from_sources  — produces a new raw parquet snapshot
5. Run dvc repro  — runs the full pipeline from raw through training
6. The new DVC content hash is committed to Git (.dvc pointer files)
7. Validate that downstream ML experiments are re-run and new artifacts registered in MLflow

Safety rules

These rules are invariants — violating them risks breaking reproducibility:

  • No in-place mutation of existing DVC-tracked datasets. Backfills produce new DVC versions.
  • Downstream ML stages must be explicitly re-run. A new raw snapshot does not automatically trigger retraining. The operator runs dvc repro.
  • Backfill scope must be documented. At minimum: date range, reason, and resulting Git commit.
  • GE contracts must pass on backfilled data. A backfill that introduces schema violations is not complete.

How backfills produce new dataset versions

A backfill changes the data in PostgreSQL. On the next dvc repro:

  1. load_data_from_sources runs and produces a parquet file with a different content hash.
  2. DVC updates the .dvc pointer file.
  3. Committing that .dvc change to Git creates a new dataset version.
  4. All downstream DVC stages re-run because their dependency hash changed.
  5. A new MLflow experiment run is produced with the backfilled dataset.

The previous DVC version remains in MinIO and accessible via the previous Git commit. This is the rollback path if the backfill introduced a problem.


Rollback

If a backfill introduces problems:

# revert to the pre-backfill Git commit
git checkout <pre-backfill-commit>
dvc pull
dvc repro

The previous dataset version is restored from MinIO. The MLflow model from the previous training run is still in the registry and can be re-promoted.