Skip to content

Canonical Schemas & Lineage

Canonical datasets

The system defines a set of canonical datasets, including:

  • matches,
  • teams,
  • events,
  • aggregated statistics.

Each dataset has:

  • a defined schema,
  • documented semantics,
  • known update cadence.

Schema evolution

Schema changes are handled via:

  • explicit migrations,
  • backward-compatible extensions when possible,
  • versioned dataset exports.

Breaking changes require:

  • schema version bump,
  • updated data contracts,
  • downstream validation.

Data lineage

Lineage is tracked across stages:

  • scraping run → PostgreSQL tables,
  • PostgreSQL snapshot → parquet export,
  • parquet version → ML experiment.

This ensures traceability from predictions back to raw inputs.