Data Backfills & Reprocessing¶
This runbook documents how to safely reprocess historical data.
When backfills are required¶
- scraping logic changes,
- upstream data corrections,
- schema evolution,
- data quality issues detected post-ingestion.
Backfill principles¶
- backfills never mutate existing datasets,
- backfills always produce new dataset versions,
- downstream pipelines must be explicitly re-run.
Backfill procedure¶
- Identify affected data range.
- Trigger controlled Airflow backfill job.
- Export new raw parquet snapshots.
- Track new datasets via DVC.
- Re-run ML pipelines using new versions.
Safety checks¶
- validate data contracts on backfilled data,
- compare dataset statistics before/after,
- document backfill reason and scope.
Rollback¶
If a backfill introduces issues: - discard the new dataset version, - revert to previous DVC revision, - investigate root cause before retrying.