Ingestion & Canonicalization (Airflow → PostgreSQL)¶

Ingestion boundary¶

Airflow orchestrates the external data ingestion path: scraping, normalization, and write to PostgreSQL. This is the boundary where unstructured external data becomes a structured canonical record.

Airflow's responsibility ends at PostgreSQL. It does not manage DVC, trigger ML pipelines, or write parquet files. The transition from canonical storage to reproducible ML inputs is a separate boundary described in Raw Export.

Status: ✅ Implemented

PostgreSQL as canonical store¶

PostgreSQL is the authoritative structured representation of all scraped match data. "Canonical" means:

deduplicated — each match record exists once, identified by natural key,
normalized — team IDs, tournament IDs, and match status are resolved,
stable — schema changes are versioned and downstream consumers are notified.

PostgreSQL is not the ML training source. Training data comes from versioned parquet snapshots exported from PostgreSQL. This separation exists because:

PostgreSQL data is live and mutable (corrections, late arrivals, schema migrations),
ML reproducibility requires immutable, content-addressed inputs,
training directly from a live DB couples the ML pipeline to ingestion infrastructure.

ETL guarantees¶

The ingestion layer provides:

Guarantee	Mechanism
Idempotency	Upsert logic with dedup keys; safe to replay a scrape run
Schema stability for downstream	DB schema changes are versioned; breaking changes blocked until raw export is updated
Separation from analytics workloads	ML pipeline reads from parquet, not from live PostgreSQL
No data loss on partial scrape	Failed scrape runs do not corrupt previously ingested records

What ETL does not guarantee¶

Real-time freshness — ingestion runs on a schedule; late match data is a known gap.
GE validation at ingest time — data contracts are enforced at the DVC validate_raw gate, not at the PostgreSQL write. Invalid records may exist in PostgreSQL temporarily.
Automated alerting on ETL failure — Airflow UI surfaces failures; Alertmanager rules are planned but not yet deployed. (📋 Planned)

Downstream connection¶

When ETL completes a successful run, PostgreSQL contains new or updated records. These records become available for raw export on the next dvc repro run.

The ETL stage and the raw export stage are deliberately decoupled:

ETL runs on a calendar schedule (Airflow),
raw export runs on an artifact-driven trigger (DVC),
neither triggers the other automatically.

This decoupling is intentional. It prevents an ingestion run from silently triggering a new model training without operator review.

Data Sources & Scraping — what triggers ETL
Raw Parquet Export — where reproducibility begins
Architecture: Data & ML Flow — Stage 1
Runbook: Backfills — replaying ingestion for historical corrections