Skip to content

Data Sources & Scraping

Primary data source

All football match data is obtained via scraping from:

  • WhoScored.com — match statistics, teams, events, and outcomes.

The source is external, uncontrolled, and subject to:

  • schema changes,
  • missing or delayed data,
  • partial updates.

This imposes strong requirements on data validation and robustness.


Scraping strategy

  • Scraping jobs are scheduled and executed via Airflow.
  • Scrapers extract raw HTML/JSON data and normalize it into structured records.
  • Each scraping run is timestamped and traceable.

Idempotency and deduplication

To mitigate scraper instability:

  • records are upserted using natural or composite keys,
  • duplicate matches/events are detected and filtered,
  • partial runs do not corrupt historical data.

Failure modes

Common failure scenarios:

  • upstream HTML structure changes,
  • incomplete match data,
  • temporary network or rate-limit issues.

Mitigation:

  • retries with backoff,
  • monitoring of scrape volume and freshness,
  • downstream quality gates.

Why scraping is isolated from ML pipelines

Scraping is treated as external data ingestion:

  • it is not reproducible in the strict ML sense,
  • it depends on third-party availability.

ML pipelines operate only on materialized, versioned snapshots.