Skip to content

End-to-End Data & ML Flow¶

This page describes the system lifecycle from raw data to monitored predictions. Each stage includes: trigger, inputs, outputs, validation gates, failure/recovery, and idempotency notes.

Overview Diagram¶

flowchart TB A[Airflow Schedule] -->|HTTP trigger| B[FastAPI + Celery] B -->|browser session| C[Selenoid] C -->|scraped data| D[PostgreSQL] D -->|DVC export| E[MinIO — data/raw] E -->|GE: validate_raw| F[Preprocessing] F -->|GE: validate_finished/future| G[Feature Engineering] G -->|GE: validate_features| H[Temporal Split] H --> I[Train / Tune / Calibrate] I --> J[MLflow Registry] J -->|model_uri| K[FastAPI Inference] K -->|cache check/write| L[Redis] K --> M[Prometheus]

Stage 1 — Data Ingestion (Scraping → PostgreSQL)¶

Attribute	Detail
Trigger	Airflow DAG on configurable schedule
Input	Airflow calls `POST /scrape` on FastAPI; task enqueued to RabbitMQ `api` queue
Processing	`celery-worker-api` drives headless browser via Selenoid; scrapes WhoScored.com; normalizes records
Output	Canonical match records in PostgreSQL
Gates	None at input; GE `validate_raw` applied after export
Failure	Celery retry with backoff; Airflow DAG marked failed; data gap logged; no data loss for previous runs
Recovery	Airflow backfill once source recovers; manual trigger via `airflow dags trigger`
Idempotency	Upsert logic with dedup keys in PostgreSQL; safe to replay
Status	✅ Implemented

Stage 2 — Raw Export (PostgreSQL → MinIO → DVC)¶

Attribute	Detail
Trigger	DVC stage `load_data_from_sources`; triggered by `dvc repro` (manual or CI)
Input	PostgreSQL canonical tables
Output	`data/raw/*.parquet` (DVC-tracked, content-addressed in MinIO)
Gates	DVC output hash check (stage only re-runs if deps change)
Failure	DVC stage exits non-zero; pipeline blocked at this stage
Recovery	Fix DB connectivity or MinIO credentials; re-run `dvc repro load_data_from_sources`
Idempotency	DVC content-addressing; identical input → identical output hash
Status	✅ Implemented

Stage 3 — Raw Validation (Great Expectations)¶

Attribute	Detail
Trigger	DVC stage `validate_raw`; runs immediately after `load_data_from_sources`
Input	`data/raw/*.parquet`
Output	GE validation result (pass/fail); HTML report artifact
Gates	Blocking: pipeline stops if any expectation fails
Failure	DVC stage failure; downstream stages cannot run
Recovery	Investigate source data change; update GE suite if schema evolution is intentional; re-run
Idempotency	Read-only; safe to re-run
Status	✅ Implemented

Stage 4 — Preprocessing¶

Attribute	Detail
Trigger	DVC stage `preprocessing`; runs after `validate_raw` passes
Input	`data/raw/*.parquet`
Output	`data/interim/finished.parquet` (completed matches), `data/interim/future.parquet` (upcoming)
Processing	Clean records, resolve team/tournament IDs, normalize schema
Gates	GE `validate_finished` and `validate_future` gates downstream
Failure	DVC stage failure; logged to stderr
Recovery	Fix preprocessing logic; re-run `dvc repro preprocessing`
Idempotency	Deterministic; same input → same output
Status	✅ Implemented

Stage 5 — Feature Engineering¶

Attribute	Detail
Trigger	DVC stage `feature_engineering`; runs after preprocessing validation passes
Input	`data/interim/finished.parquet`, `data/interim/future.parquet`
Output	`data/features/*.parquet`
Processing	Rolling window match statistics and tournament-scoped team ratings; implementation parameters in `params.yaml`
Gates	GE `validate_features` gate downstream
Contract	Same feature code (`src/features/`) reused at inference time — no duplicate implementation
Failure	DVC stage failure
Recovery	Fix feature logic; re-run `dvc repro feature_engineering`
Idempotency	Deterministic pure functions; same input → same output
Status	✅ Implemented

Stage 6 — Batch Inference Feature Assembly¶

Attribute	Detail
Trigger	DVC stage `batch_inference`; can run in parallel with training
Input	`data/features/` + upcoming match schedule
Output	`data/predictions/match_features.parquet`
Purpose	Pre-assemble features for all known upcoming matches; used by `GET /predict/{match_id}`
Failure	DVC stage failure; existing parquet not updated
Idempotency	Deterministic
Status	✅ Implemented

Stage 7 — Temporal Split¶

Attribute	Detail
Trigger	DVC stage `split_data`; runs after feature validation
Input	`data/features/*.parquet`
Output	`data/splits/*.parquet` (walk-forward CV folds + holdout)
Split config	Time-based boundaries from `params.yaml`; no random shuffling
Leakage invariant	No future data appears in any training fold; enforced by temporal ordering
Failure	DVC stage failure
Idempotency	Deterministic given fixed `params.yaml`
Status	✅ Implemented

Stage 8 — Model Training (Baseline + XGBoost + Ablation + Tuning)¶

Attribute	Detail
Trigger	DVC stages `classification_models`, `ablation_study`, `tune_xgb`; run after `split_data`
Input	`data/splits/*.parquet`
Output	MLflow runs: params, metrics, serialized model artifacts
Tracking	All runs logged to MLflow; experiment name configured in `params.yaml`
Failure	DVC stage failure; partial MLflow runs preserved
Idempotency	Deterministic given seeded randomness and fixed params
Status	✅ Implemented

Stage 9 — Final Training + Calibration¶

Attribute	Detail
Trigger	DVC stage `final_train`; runs after tuning
Input	`data/splits/*.parquet` + best hyperparameters from `tune_xgb`
Output	Calibrated model artifact; MLflow run
Post-processing	Probability calibration applied to model outputs
Failure	DVC stage failure
Idempotency	Deterministic
Status	✅ Implemented

Stage 10 — Model Registration¶

Attribute	Detail
Trigger	DVC stage `register_model`; runs after `final_train`
Input	Final model artifact; MLflow run ID
Output	Registered model version in MLflow Registry; `Staging` alias assigned
Contract	MLflow `pyfunc` signature validated at registration
Promotion gate	Manual: operator reviews metrics in MLflow UI and updates `champion` alias
Limitation	No automated metric-threshold promotion policy yet
Planned	Automated promotion gate (see Roadmap)
Status	🚧 Partially implemented (registration ✅; automated gate 📋)

Stage 11 — Serving (FastAPI + Celery + Redis)¶

See Runtime View for the detailed serving paths (sync, async, batch lookup).

Attribute	Detail
Model source	MLflow Registry `champion` alias; loaded lazily per worker process
Cache	Redis: check before inference; write after; TTL-based
Failure modes	See Failure Modes
Status	✅ Implemented

Stage 12 — Observability (Prometheus)¶

Attribute	Detail
Metrics	Request rate, latency, error rate, active tasks, and queue depth; scraped from FastAPI and Celery workers
Endpoint	`GET /metrics`
Dashboards	📋 Grafana dashboards planned (see Roadmap)
Drift detection	📋 Evidently offline reports planned
Status	✅ Prometheus scraping; 🚧 Grafana dashboards; 📋 drift

Component View — component-level detail for each stage
Runtime View — serving path detail
Failure Modes — failure and recovery for each stage
ML — Validation — leakage-safe evaluation design