Skip to content

Data Backfills & Reprocessing

This runbook documents how to safely reprocess historical data.


When backfills are required

  • scraping logic changes,
  • upstream data corrections,
  • schema evolution,
  • data quality issues detected post-ingestion.

Backfill principles

  • backfills never mutate existing datasets,
  • backfills always produce new dataset versions,
  • downstream pipelines must be explicitly re-run.

Backfill procedure

  1. Identify affected data range.
  2. Trigger controlled Airflow backfill job.
  3. Export new raw parquet snapshots.
  4. Track new datasets via DVC.
  5. Re-run ML pipelines using new versions.

Safety checks

  • validate data contracts on backfilled data,
  • compare dataset statistics before/after,
  • document backfill reason and scope.

Rollback

If a backfill introduces issues: - discard the new dataset version, - revert to previous DVC revision, - investigate root cause before retrying.