Ingestion & Canonicalization (Airflow → PostgreSQL)¶
Ingestion boundary¶
Airflow orchestrates the external data ingestion path: scraping, normalization, and write to PostgreSQL. This is the boundary where unstructured external data becomes a structured canonical record.
Airflow's responsibility ends at PostgreSQL. It does not manage DVC, trigger ML pipelines, or write parquet files. The transition from canonical storage to reproducible ML inputs is a separate boundary described in Raw Export.
Status: ✅ Implemented
PostgreSQL as canonical store¶
PostgreSQL is the authoritative structured representation of all scraped match data. "Canonical" means:
- deduplicated — each match record exists once, identified by natural key,
- normalized — team IDs, tournament IDs, and match status are resolved,
- stable — schema changes are versioned and downstream consumers are notified.
PostgreSQL is not the ML training source. Training data comes from versioned parquet snapshots exported from PostgreSQL. This separation exists because:
- PostgreSQL data is live and mutable (corrections, late arrivals, schema migrations),
- ML reproducibility requires immutable, content-addressed inputs,
- training directly from a live DB couples the ML pipeline to ingestion infrastructure.
ETL guarantees¶
The ingestion layer provides:
| Guarantee | Mechanism |
|---|---|
| Idempotency | Upsert logic with dedup keys; safe to replay a scrape run |
| Schema stability for downstream | DB schema changes are versioned; breaking changes blocked until raw export is updated |
| Separation from analytics workloads | ML pipeline reads from parquet, not from live PostgreSQL |
| No data loss on partial scrape | Failed scrape runs do not corrupt previously ingested records |
What ETL does not guarantee¶
- Real-time freshness — ingestion runs on a schedule; late match data is a known gap.
- GE validation at ingest time — data contracts are enforced at the DVC
validate_rawgate, not at the PostgreSQL write. Invalid records may exist in PostgreSQL temporarily. - Automated alerting on ETL failure — Airflow UI surfaces failures; Alertmanager rules are planned but not yet deployed. (📋 Planned)
Downstream connection¶
When ETL completes a successful run, PostgreSQL contains new or updated records.
These records become available for raw export on the next dvc repro run.
The ETL stage and the raw export stage are deliberately decoupled:
- ETL runs on a calendar schedule (Airflow),
- raw export runs on an artifact-driven trigger (DVC),
- neither triggers the other automatically.
This decoupling is intentional. It prevents an ingestion run from silently triggering a new model training without operator review.
Related¶
- Data Sources & Scraping — what triggers ETL
- Raw Parquet Export — where reproducibility begins
- Architecture: Data & ML Flow — Stage 1
- Runbook: Backfills — replaying ingestion for historical corrections