Data Sources & Scraping¶
Primary source¶
All football match data originates from WhoScored.com — match statistics, team records, event data, and outcomes. This is the system's only upstream data source.
WhoScored is an untrusted external dependency. The system treats it as such:
- its HTML structure and data delivery are uncontrolled and subject to change without notice,
- data may be partial, delayed, or temporarily unavailable,
- there is no API contract — scraping is fragile by nature.
The scraping layer exists to acquire data from this source. It does not make data reproducible. Reproducibility starts at the raw export stage.
Scraping mechanism¶
Scraping is executed via a Celery task chain triggered by an Airflow DAG:
- Airflow sends
POST /scrapeto the FastAPI service on a configurable schedule. - The FastAPI service enqueues a task to the RabbitMQ
apiqueue. celery-worker-apidrives a headless browser session via Selenoid to scrape WhoScored.- Scraped records are normalized and written to PostgreSQL.
The scraping path uses browser automation (not a direct HTTP API) because WhoScored requires JavaScript rendering. Selenoid is an operator-managed external service, outside the K8s cluster.
Status: ✅ Implemented
Trust model¶
| Stage | Trust level | Consequence |
|---|---|---|
| WhoScored.com HTML | Untrusted | Validated by Great Expectations after raw export |
| Scraped records in PostgreSQL | Canonical but unvalidated | Validation deferred to the DVC validate_raw gate |
| Raw parquet snapshot | Validated against contract | Only data that passes validate_raw proceeds |
Downstream ML pipelines never interact directly with the scraping layer. They operate only on
materialized, DVC-versioned parquet snapshots that have passed the validate_raw gate.
Idempotency and deduplication¶
Scraping runs are designed to be safe to replay:
- records are upserted in PostgreSQL using natural or composite keys,
- duplicate match/event records are detected and rejected at the DB level,
- a failed or partial scrape run does not corrupt previously ingested data.
This is a safety property of ingestion. It does not affect DVC reproducibility — DVC reproducibility begins downstream at the raw export stage.
What is materialized into canonical data¶
Every successfully ingested scrape run adds records to PostgreSQL canonical tables. What is not stored:
- raw HTML — not archived (no HTML fallback in MinIO),
- intermediate normalized form before DB write — ephemeral in the Celery worker.
The PostgreSQL record is the only durable output of a scraping run.
Current limitations¶
| Limitation | Status |
|---|---|
| Retry with backoff on Celery task failure | ✅ Implemented |
| Airflow DAG failure visibility | ✅ Airflow UI |
| Freshness monitoring / alerts | 📋 Planned — alerting rules not yet deployed in Alertmanager |
| Cached HTML fallback | ❌ Not implemented — raw HTML is not archived |
| Scrape volume anomaly detection | 📋 Planned |
Related¶
- ETL & Ingestion Boundary — what happens after scraping
- Architecture: Data & ML Flow — Stage 1
- Architecture: System Boundary — Selenoid trust zone
- Architecture: Failure Modes — scraper failure scenarios
Secondary source: Bookmaker odds (Pari)¶
Status: ✅ Implemented — production module: src/data/odds_pari.py; daily Airflow DAG: airflow/dags/etl_odds_01.py (schedule @daily, 00:05 UTC); snapshot output: data/raw/odds_pari/date=YYYY-MM-DD/snapshot.parquet. Historical fallback (Path B): src/data/odds_fdco.py + DVC stage load_odds_fdco → data/raw/odds_fdco.parquet.
Purpose: External evaluation only — bookmaker implied probabilities serve as the Tier 4 benchmark (see Baseline & Benchmarks). Odds are not used as model input features. Use cases: ROI simulation on the held-out test set, calibration reference, model sanity check.
API¶
Unofficial internal API of the Pari bookmaker (Russian market, scopeMarket=1600). No public contract; endpoint stability is not guaranteed.
- Endpoint:
GET https://line-lb61-w.bk6bba-resources.com/ma/events/list - Params:
lang=en,version=<cursor>,scopeMarket=1600 - Delta-sync: the
versionfield in the response is the cursor for the next incremental request;version=1returns the full snapshot. - Pydantic validation:
src/app/config/validate_bets.PariEvents(extra="forbid"on all models).
Market IDs (1X2)¶
| Factor ID | Market |
|---|---|
| 921 | P1 — home win |
| 922 | X — draw |
| 923 | P2 — away win |
Verified by checking the bookmaker margin (vig): observed mean ~1.087, consistent with the expected 1.05–1.10 range for this market.
Available markets (factor ID reference)¶
Each entry in customFactors[].factors has four fields:
| Field | Description |
|---|---|
f |
Factor ID — market type identifier |
v |
Decimal odds value |
p |
Numeric parameter (e.g. handicap line, total line) |
pt |
String parameter label (e.g. "-1.5", "2.5") |
Markets without p/pt are unconditional (1X2, double chance, both teams score). Markets with p/pt are parameterised (handicap, totals).
Full factor ID reference observed in the API:
| Factor ID | Market | Description |
|---|---|---|
| 921 | 1X2: P1 | Home win |
| 922 | 1X2: X | Draw |
| 923 | 1X2: P2 | Away win |
| 924 | Double chance: 1X | Home win or draw |
| 925 | Double chance: 2X | Away win or draw |
| 1571 | Double chance: 12 | Home win or away win (no draw) |
| 4241 | Both teams score: Yes | Both teams score |
| 4242 | Both teams score: No | At least one team does not score |
| 910 | Handicap | Parameterised (e.g. -1.5, +1.5) |
| 912 | Handicap | Parameterised (e.g. +1.5) |
| 989 | Handicap | Parameterised (-1) |
| 991 | Handicap | Parameterised (+1) |
| 1569 | Handicap | Parameterised (+1) |
| 1572 | Handicap | Parameterised (-1) |
| 1672 | Asian handicap | Parameterised (+1.5) |
| 1675 | Asian handicap | Parameterised (-1.5) |
| 927 | Total goals Over | Parameterised (line in pt) |
| 928 | Total goals Under | Parameterised (line in pt) |
| 930 | Total goals Over 2.5 | Fixed line |
| 931 | Total goals Under 2.5 | Fixed line |
| 1696 | Total goals Over 0.5 | Fixed line |
| 1697 | Total goals Under 0.5 | Fixed line |
| 1727 | Total goals Over 1 | Fixed line |
| 1728 | Total goals Under 1 | Fixed line |
| 1730 | Total goals Over 1.5 | Fixed line |
| 1731 | Total goals Under 1.5 | Fixed line |
| 1733 | Total goals Over 2 | Fixed line |
| 1734 | Total goals Under 2 | Fixed line |
| 1736 | Total goals Over 3 | Fixed line |
| 1737 | Total goals Under 3 | Fixed line |
| 1739 | Total goals Over 3.5 | Fixed line |
| 1791 | Total goals Under 3.5 | Fixed line |
| 1809 | Total goals Over 0.5 | Alternate encoding |
| 1810 | Total goals Under 0.5 | Alternate encoding |
| 1812 | Total goals Over 1 | Alternate encoding |
| 1813 | Total goals Under 1 | Alternate encoding |
| 1815 | Total goals Over 1.5 | Alternate encoding |
| 1816 | Total goals Under 1.5 | Alternate encoding |
| 1818 | Total goals Over 2 | Alternate encoding |
| 1819 | Total goals Under 2 | Alternate encoding |
| 1821 | Total goals Over 2.5 | Alternate encoding |
| 1822 | Total goals Under 2.5 | Alternate encoding |
| 1854 | Total goals Over 0.5 | Alternate encoding |
| 1871 | Total goals Under 0.5 | Alternate encoding |
| 1873 | Total goals Over 1 | Alternate encoding |
| 1874 | Total goals Under 1 | Alternate encoding |
| 1880 | Total goals Over 1.5 | Alternate encoding |
| 1881 | Total goals Under 1.5 | Alternate encoding |
| 1883 | Total goals Over 2 | Alternate encoding |
| 1884 | Total goals Under 2 | Alternate encoding |
| 1886 | Total goals Over 2.5 | Alternate encoding |
| 1887 | Total goals Under 2.5 | Alternate encoding |
| 2820 | Specials | Special markets |
| 2821 | Specials | Special markets |
The scraper currently uses only 921/922/923. The remaining IDs are documented here for reference when extending to handicap or totals markets.
Snapshot schema¶
The scraper produces a flat DataFrame with one row per main match:
| Column | Type | Description |
|---|---|---|
id |
int | Pari event ID |
league |
str | League/tournament name (Pari internal) |
team1 |
str | Home team name |
team2 |
str | Away team name |
startTime |
datetime | Match start time (UTC) |
odd_home |
float | Decimal odds — home win |
odd_draw |
float | Decimal odds — draw |
odd_away |
float | Decimal odds — away win |
scraped_at |
datetime[UTC] | Snapshot collection time |
Proposed storage path: data/raw/odds_snapshot.parquet (not yet materialized).
Filtering logic¶
The raw API response contains sports, leagues, and events for all sports plus esports. The scraper applies the following filters in order:
sports[parentId == 1]→ football league IDs (sport ID 1 = Football)events[sportId ∈ league_ids, level == 1, team1 IS NOT NULL]→ main matches only (level > 1 = halves, corners, etc.)- Drop esports: leagues matching prefixes
FC 26,eFootball,FIFA,Esport,esport,Virtual,Cyber - Drop placeholder teams:
team1 == "Home"
Historical odds fallback (Path B)¶
For ROI simulation on historical data before forward collection is running, use football-data.co.uk closing odds.
- URL pattern:
https://www.football-data.co.uk/mmz4281/{season}/{league_code}.csv - Fields used:
Date, HomeTeam, AwayTeam, FTR, B365H, B365D, B365A(Bet365 closing odds) - Coverage: 40+ leagues, 1990s–present
- Trust: Third-party published data; no live API, batch CSV download only.
- Implementation:
src/data/odds_fdco.py(fetch_league_csv,normalize_fdco,load_odds_fdco); DVC stageload_odds_fdco; leagues and seasons configured inparams.yamlunderodds_fdco. - Team name matching: WhoScored numeric team IDs resolved via
data/metadata/homeTeamId.json; fuzzy join to FDCO string names implemented insrc/data/odds_join.py(threshold=85, fuzzywuzzy).
This source is not Pari odds and should not be used as a direct substitute for calibration of Pari-specific margins. It is appropriate for relative ROI comparisons and historical model evaluation.
Trust model¶
| Property | Value |
|---|---|
| Source type | Unofficial internal API, no public contract |
| Stability | Unknown — endpoint or response schema may change |
| Idempotency | Not guaranteed — odds change before kick-off; snapshot captures a point-in-time view |
| Data quality | Validated via Pydantic (extra="forbid"); no GE suite (not a pipeline stage) |
| Use in ML pipeline | ❌ Evaluation and ROI simulation only — not an input feature |