Skip to content

Data Sources & Scraping

Primary source

All football match data originates from WhoScored.com — match statistics, team records, event data, and outcomes. This is the system's only upstream data source.

WhoScored is an untrusted external dependency. The system treats it as such:

  • its HTML structure and data delivery are uncontrolled and subject to change without notice,
  • data may be partial, delayed, or temporarily unavailable,
  • there is no API contract — scraping is fragile by nature.

The scraping layer exists to acquire data from this source. It does not make data reproducible. Reproducibility starts at the raw export stage.


Scraping mechanism

Scraping is executed via a Celery task chain triggered by an Airflow DAG:

  1. Airflow sends POST /scrape to the FastAPI service on a configurable schedule.
  2. The FastAPI service enqueues a task to the RabbitMQ api queue.
  3. celery-worker-api drives a headless browser session via Selenoid to scrape WhoScored.
  4. Scraped records are normalized and written to PostgreSQL.

The scraping path uses browser automation (not a direct HTTP API) because WhoScored requires JavaScript rendering. Selenoid is an operator-managed external service, outside the K8s cluster.

Status: ✅ Implemented


Trust model

Stage Trust level Consequence
WhoScored.com HTML Untrusted Validated by Great Expectations after raw export
Scraped records in PostgreSQL Canonical but unvalidated Validation deferred to the DVC validate_raw gate
Raw parquet snapshot Validated against contract Only data that passes validate_raw proceeds

Downstream ML pipelines never interact directly with the scraping layer. They operate only on materialized, DVC-versioned parquet snapshots that have passed the validate_raw gate.


Idempotency and deduplication

Scraping runs are designed to be safe to replay:

  • records are upserted in PostgreSQL using natural or composite keys,
  • duplicate match/event records are detected and rejected at the DB level,
  • a failed or partial scrape run does not corrupt previously ingested data.

This is a safety property of ingestion. It does not affect DVC reproducibility — DVC reproducibility begins downstream at the raw export stage.


What is materialized into canonical data

Every successfully ingested scrape run adds records to PostgreSQL canonical tables. What is not stored:

  • raw HTML — not archived (no HTML fallback in MinIO),
  • intermediate normalized form before DB write — ephemeral in the Celery worker.

The PostgreSQL record is the only durable output of a scraping run.


Current limitations

Limitation Status
Retry with backoff on Celery task failure ✅ Implemented
Airflow DAG failure visibility ✅ Airflow UI
Freshness monitoring / alerts 📋 Planned — alerting rules not yet deployed in Alertmanager
Cached HTML fallback ❌ Not implemented — raw HTML is not archived
Scrape volume anomaly detection 📋 Planned


Secondary source: Bookmaker odds (Pari)

Status: ✅ Implemented — production module: src/data/odds_pari.py; daily Airflow DAG: airflow/dags/etl_odds_01.py (schedule @daily, 00:05 UTC); snapshot output: data/raw/odds_pari/date=YYYY-MM-DD/snapshot.parquet. Historical fallback (Path B): src/data/odds_fdco.py + DVC stage load_odds_fdcodata/raw/odds_fdco.parquet.

Purpose: External evaluation only — bookmaker implied probabilities serve as the Tier 4 benchmark (see Baseline & Benchmarks). Odds are not used as model input features. Use cases: ROI simulation on the held-out test set, calibration reference, model sanity check.

API

Unofficial internal API of the Pari bookmaker (Russian market, scopeMarket=1600). No public contract; endpoint stability is not guaranteed.

  • Endpoint: GET https://line-lb61-w.bk6bba-resources.com/ma/events/list
  • Params: lang=en, version=<cursor>, scopeMarket=1600
  • Delta-sync: the version field in the response is the cursor for the next incremental request; version=1 returns the full snapshot.
  • Pydantic validation: src/app/config/validate_bets.PariEvents (extra="forbid" on all models).

Market IDs (1X2)

Factor ID Market
921 P1 — home win
922 X — draw
923 P2 — away win

Verified by checking the bookmaker margin (vig): observed mean ~1.087, consistent with the expected 1.05–1.10 range for this market.

Available markets (factor ID reference)

Each entry in customFactors[].factors has four fields:

Field Description
f Factor ID — market type identifier
v Decimal odds value
p Numeric parameter (e.g. handicap line, total line)
pt String parameter label (e.g. "-1.5", "2.5")

Markets without p/pt are unconditional (1X2, double chance, both teams score). Markets with p/pt are parameterised (handicap, totals).

Full factor ID reference observed in the API:

Factor ID Market Description
921 1X2: P1 Home win
922 1X2: X Draw
923 1X2: P2 Away win
924 Double chance: 1X Home win or draw
925 Double chance: 2X Away win or draw
1571 Double chance: 12 Home win or away win (no draw)
4241 Both teams score: Yes Both teams score
4242 Both teams score: No At least one team does not score
910 Handicap Parameterised (e.g. -1.5, +1.5)
912 Handicap Parameterised (e.g. +1.5)
989 Handicap Parameterised (-1)
991 Handicap Parameterised (+1)
1569 Handicap Parameterised (+1)
1572 Handicap Parameterised (-1)
1672 Asian handicap Parameterised (+1.5)
1675 Asian handicap Parameterised (-1.5)
927 Total goals Over Parameterised (line in pt)
928 Total goals Under Parameterised (line in pt)
930 Total goals Over 2.5 Fixed line
931 Total goals Under 2.5 Fixed line
1696 Total goals Over 0.5 Fixed line
1697 Total goals Under 0.5 Fixed line
1727 Total goals Over 1 Fixed line
1728 Total goals Under 1 Fixed line
1730 Total goals Over 1.5 Fixed line
1731 Total goals Under 1.5 Fixed line
1733 Total goals Over 2 Fixed line
1734 Total goals Under 2 Fixed line
1736 Total goals Over 3 Fixed line
1737 Total goals Under 3 Fixed line
1739 Total goals Over 3.5 Fixed line
1791 Total goals Under 3.5 Fixed line
1809 Total goals Over 0.5 Alternate encoding
1810 Total goals Under 0.5 Alternate encoding
1812 Total goals Over 1 Alternate encoding
1813 Total goals Under 1 Alternate encoding
1815 Total goals Over 1.5 Alternate encoding
1816 Total goals Under 1.5 Alternate encoding
1818 Total goals Over 2 Alternate encoding
1819 Total goals Under 2 Alternate encoding
1821 Total goals Over 2.5 Alternate encoding
1822 Total goals Under 2.5 Alternate encoding
1854 Total goals Over 0.5 Alternate encoding
1871 Total goals Under 0.5 Alternate encoding
1873 Total goals Over 1 Alternate encoding
1874 Total goals Under 1 Alternate encoding
1880 Total goals Over 1.5 Alternate encoding
1881 Total goals Under 1.5 Alternate encoding
1883 Total goals Over 2 Alternate encoding
1884 Total goals Under 2 Alternate encoding
1886 Total goals Over 2.5 Alternate encoding
1887 Total goals Under 2.5 Alternate encoding
2820 Specials Special markets
2821 Specials Special markets

The scraper currently uses only 921/922/923. The remaining IDs are documented here for reference when extending to handicap or totals markets.

Snapshot schema

The scraper produces a flat DataFrame with one row per main match:

Column Type Description
id int Pari event ID
league str League/tournament name (Pari internal)
team1 str Home team name
team2 str Away team name
startTime datetime Match start time (UTC)
odd_home float Decimal odds — home win
odd_draw float Decimal odds — draw
odd_away float Decimal odds — away win
scraped_at datetime[UTC] Snapshot collection time

Proposed storage path: data/raw/odds_snapshot.parquet (not yet materialized).

Filtering logic

The raw API response contains sports, leagues, and events for all sports plus esports. The scraper applies the following filters in order:

  1. sports[parentId == 1] → football league IDs (sport ID 1 = Football)
  2. events[sportId ∈ league_ids, level == 1, team1 IS NOT NULL] → main matches only (level > 1 = halves, corners, etc.)
  3. Drop esports: leagues matching prefixes FC 26, eFootball, FIFA, Esport, esport, Virtual, Cyber
  4. Drop placeholder teams: team1 == "Home"

Historical odds fallback (Path B)

For ROI simulation on historical data before forward collection is running, use football-data.co.uk closing odds.

  • URL pattern: https://www.football-data.co.uk/mmz4281/{season}/{league_code}.csv
  • Fields used: Date, HomeTeam, AwayTeam, FTR, B365H, B365D, B365A (Bet365 closing odds)
  • Coverage: 40+ leagues, 1990s–present
  • Trust: Third-party published data; no live API, batch CSV download only.
  • Implementation: src/data/odds_fdco.py (fetch_league_csv, normalize_fdco, load_odds_fdco); DVC stage load_odds_fdco; leagues and seasons configured in params.yaml under odds_fdco.
  • Team name matching: WhoScored numeric team IDs resolved via data/metadata/homeTeamId.json; fuzzy join to FDCO string names implemented in src/data/odds_join.py (threshold=85, fuzzywuzzy).

This source is not Pari odds and should not be used as a direct substitute for calibration of Pari-specific margins. It is appropriate for relative ROI comparisons and historical model evaluation.

Trust model

Property Value
Source type Unofficial internal API, no public contract
Stability Unknown — endpoint or response schema may change
Idempotency Not guaranteed — odds change before kick-off; snapshot captures a point-in-time view
Data quality Validated via Pydantic (extra="forbid"); no GE suite (not a pipeline stage)
Use in ML pipeline ❌ Evaluation and ROI simulation only — not an input feature