Canonical Schemas & Lineage¶
Canonical datasets¶
The system defines a set of canonical datasets, including:
- matches,
- teams,
- events,
- aggregated statistics.
Each dataset has:
- a defined schema,
- documented semantics,
- known update cadence.
Schema evolution¶
Schema changes are handled via:
- explicit migrations,
- backward-compatible extensions when possible,
- versioned dataset exports.
Breaking changes require:
- schema version bump,
- updated data contracts,
- downstream validation.
Data lineage¶
Lineage is tracked across stages:
- scraping run → PostgreSQL tables,
- PostgreSQL snapshot → parquet export,
- parquet version → ML experiment.
This ensures traceability from predictions back to raw inputs.