End-to-End Data & ML Flow
This page describes the system lifecycle from raw data to monitored predictions.
Each stage includes: trigger, inputs, outputs, validation gates, failure/recovery, and idempotency notes.
Overview Diagram
flowchart TB
A[Airflow Schedule] -->|HTTP trigger| B[FastAPI + Celery]
B -->|browser session| C[Selenoid]
C -->|scraped data| D[PostgreSQL]
D -->|DVC export| E[MinIO — data/raw]
E -->|GE: validate_raw| F[Preprocessing]
F -->|GE: validate_finished/future| G[Feature Engineering]
G -->|GE: validate_features| H[Temporal Split]
H --> I[Train / Tune / Calibrate]
I --> J[MLflow Registry]
J -->|model_uri| K[FastAPI Inference]
K -->|cache check/write| L[Redis]
K --> M[Prometheus]
Stage 1 — Data Ingestion (Scraping → PostgreSQL)
Attribute
Detail
Trigger
Airflow DAG on configurable schedule
Input
Airflow calls POST /scrape on FastAPI; task enqueued to RabbitMQ api queue
Processing
celery-worker-api drives headless browser via Selenoid; scrapes WhoScored.com; normalizes records
Output
Canonical match records in PostgreSQL
Gates
None at input; GE validate_raw applied after export
Failure
Celery retry with backoff; Airflow DAG marked failed; data gap logged; no data loss for previous runs
Recovery
Airflow backfill once source recovers; manual trigger via airflow dags trigger
Idempotency
Upsert logic with dedup keys in PostgreSQL; safe to replay
Status
✅ Implemented
Stage 2 — Raw Export (PostgreSQL → MinIO → DVC)
Attribute
Detail
Trigger
DVC stage load_data_from_sources; triggered by dvc repro (manual or CI)
Input
PostgreSQL canonical tables
Output
data/raw/*.parquet (DVC-tracked, content-addressed in MinIO)
Gates
DVC output hash check (stage only re-runs if deps change)
Failure
DVC stage exits non-zero; pipeline blocked at this stage
Recovery
Fix DB connectivity or MinIO credentials; re-run dvc repro load_data_from_sources
Idempotency
DVC content-addressing; identical input → identical output hash
Status
✅ Implemented
Stage 3 — Raw Validation (Great Expectations)
Attribute
Detail
Trigger
DVC stage validate_raw; runs immediately after load_data_from_sources
Input
data/raw/*.parquet
Output
GE validation result (pass/fail); HTML report artifact
Gates
Blocking: pipeline stops if any expectation fails
Failure
DVC stage failure; downstream stages cannot run
Recovery
Investigate source data change; update GE suite if schema evolution is intentional; re-run
Idempotency
Read-only; safe to re-run
Status
✅ Implemented
Stage 4 — Preprocessing
Attribute
Detail
Trigger
DVC stage preprocessing; runs after validate_raw passes
Input
data/raw/*.parquet
Output
data/interim/finished.parquet (completed matches), data/interim/future.parquet (upcoming)
Processing
Clean records, resolve team/tournament IDs, normalize schema
Gates
GE validate_finished and validate_future gates downstream
Failure
DVC stage failure; logged to stderr
Recovery
Fix preprocessing logic; re-run dvc repro preprocessing
Idempotency
Deterministic; same input → same output
Status
✅ Implemented
Stage 5 — Feature Engineering
Attribute
Detail
Trigger
DVC stage feature_engineering; runs after preprocessing validation passes
Input
data/interim/finished.parquet, data/interim/future.parquet
Output
data/features/*.parquet
Processing
Rolling window match statistics and tournament-scoped team ratings; implementation parameters in params.yaml
Gates
GE validate_features gate downstream
Contract
Same feature code (src/features/) reused at inference time — no duplicate implementation
Failure
DVC stage failure
Recovery
Fix feature logic; re-run dvc repro feature_engineering
Idempotency
Deterministic pure functions; same input → same output
Status
✅ Implemented
Stage 6 — Batch Inference Feature Assembly
Attribute
Detail
Trigger
DVC stage batch_inference; can run in parallel with training
Input
data/features/ + upcoming match schedule
Output
data/predictions/match_features.parquet
Purpose
Pre-assemble features for all known upcoming matches; used by GET /predict/{match_id}
Failure
DVC stage failure; existing parquet not updated
Idempotency
Deterministic
Status
✅ Implemented
Stage 7 — Temporal Split
Attribute
Detail
Trigger
DVC stage split_data; runs after feature validation
Input
data/features/*.parquet
Output
data/splits/*.parquet (walk-forward CV folds + holdout)
Split config
Time-based boundaries from params.yaml; no random shuffling
Leakage invariant
No future data appears in any training fold; enforced by temporal ordering
Failure
DVC stage failure
Idempotency
Deterministic given fixed params.yaml
Status
✅ Implemented
Stage 8 — Model Training (Baseline + XGBoost + Ablation + Tuning)
Attribute
Detail
Trigger
DVC stages classification_models, ablation_study, tune_xgb; run after split_data
Input
data/splits/*.parquet
Output
MLflow runs: params, metrics, serialized model artifacts
Tracking
All runs logged to MLflow; experiment name configured in params.yaml
Failure
DVC stage failure; partial MLflow runs preserved
Idempotency
Deterministic given seeded randomness and fixed params
Status
✅ Implemented
Stage 9 — Final Training + Calibration
Attribute
Detail
Trigger
DVC stage final_train; runs after tuning
Input
data/splits/*.parquet + best hyperparameters from tune_xgb
Output
Calibrated model artifact; MLflow run
Post-processing
Probability calibration applied to model outputs
Failure
DVC stage failure
Idempotency
Deterministic
Status
✅ Implemented
Stage 10 — Model Registration
Attribute
Detail
Trigger
DVC stage register_model; runs after final_train
Input
Final model artifact; MLflow run ID
Output
Registered model version in MLflow Registry; Staging alias assigned
Contract
MLflow pyfunc signature validated at registration
Promotion gate
Manual: operator reviews metrics in MLflow UI and updates champion alias
Limitation
No automated metric-threshold promotion policy yet
Planned
Automated promotion gate (see Roadmap )
Status
🚧 Partially implemented (registration ✅; automated gate 📋)
Stage 11 — Serving (FastAPI + Celery + Redis)
See Runtime View for the detailed serving paths (sync, async, batch lookup).
Attribute
Detail
Model source
MLflow Registry champion alias; loaded lazily per worker process
Cache
Redis: check before inference; write after; TTL-based
Failure modes
See Failure Modes
Status
✅ Implemented
Stage 12 — Observability (Prometheus)
Attribute
Detail
Metrics
Request rate, latency, error rate, active tasks, and queue depth; scraped from FastAPI and Celery workers
Endpoint
GET /metrics
Dashboards
📋 Grafana dashboards planned (see Roadmap )
Drift detection
📋 Evidently offline reports planned
Status
✅ Prometheus scraping; 🚧 Grafana dashboards; 📋 drift