Skip to content

Data Sources & Scraping¶

Primary data source¶

All football match data is obtained via scraping from:

WhoScored.com — match statistics, teams, events, and outcomes.

The source is external, uncontrolled, and subject to:

schema changes,
missing or delayed data,
partial updates.

This imposes strong requirements on data validation and robustness.

Scraping strategy¶

Scraping jobs are scheduled and executed via Airflow.
Scrapers extract raw HTML/JSON data and normalize it into structured records.
Each scraping run is timestamped and traceable.

Idempotency and deduplication¶

To mitigate scraper instability:

records are upserted using natural or composite keys,
duplicate matches/events are detected and filtered,
partial runs do not corrupt historical data.

Failure modes¶

Common failure scenarios:

upstream HTML structure changes,
incomplete match data,
temporary network or rate-limit issues.

Mitigation:

retries with backoff,
monitoring of scrape volume and freshness,
downstream quality gates.

Why scraping is isolated from ML pipelines¶

Scraping is treated as external data ingestion:

it is not reproducible in the strict ML sense,
it depends on third-party availability.

ML pipelines operate only on materialized, versioned snapshots.