Data Sources & Scraping¶
Primary data source¶
All football match data is obtained via scraping from:
- WhoScored.com — match statistics, teams, events, and outcomes.
The source is external, uncontrolled, and subject to:
- schema changes,
- missing or delayed data,
- partial updates.
This imposes strong requirements on data validation and robustness.
Scraping strategy¶
- Scraping jobs are scheduled and executed via Airflow.
- Scrapers extract raw HTML/JSON data and normalize it into structured records.
- Each scraping run is timestamped and traceable.
Idempotency and deduplication¶
To mitigate scraper instability:
- records are upserted using natural or composite keys,
- duplicate matches/events are detected and filtered,
- partial runs do not corrupt historical data.
Failure modes¶
Common failure scenarios:
- upstream HTML structure changes,
- incomplete match data,
- temporary network or rate-limit issues.
Mitigation:
- retries with backoff,
- monitoring of scrape volume and freshness,
- downstream quality gates.
Why scraping is isolated from ML pipelines¶
Scraping is treated as external data ingestion:
- it is not reproducible in the strict ML sense,
- it depends on third-party availability.
ML pipelines operate only on materialized, versioned snapshots.