Skip to content

Data Backfills & Reprocessing¶

This runbook documents how to safely reprocess historical data.

When backfills are required¶

scraping logic changes,
upstream data corrections,
schema evolution,
data quality issues detected post-ingestion.

Backfill principles¶

backfills never mutate existing datasets,
backfills always produce new dataset versions,
downstream pipelines must be explicitly re-run.

Backfill procedure¶

Identify affected data range.
Trigger controlled Airflow backfill job.
Export new raw parquet snapshots.
Track new datasets via DVC.
Re-run ML pipelines using new versions.

Safety checks¶

validate data contracts on backfilled data,
compare dataset statistics before/after,
document backfill reason and scope.

Rollback¶

If a backfill introduces issues: - discard the new dataset version, - revert to previous DVC revision, - investigate root cause before retrying.