Skip to content

End-to-End Data & ML Flow

This page describes the system lifecycle from raw data to monitored predictions. Each stage includes: trigger, inputs, outputs, validation gates, failure/recovery, and idempotency notes.


Overview Diagram

flowchart TB A[Airflow Schedule] -->|HTTP trigger| B[FastAPI + Celery] B -->|browser session| C[Selenoid] C -->|scraped data| D[PostgreSQL] D -->|DVC export| E[MinIO — data/raw] E -->|GE: validate_raw| F[Preprocessing] F -->|GE: validate_finished/future| G[Feature Engineering] G -->|GE: validate_features| H[Temporal Split] H --> I[Train / Tune / Calibrate] I --> J[MLflow Registry] J -->|model_uri| K[FastAPI Inference] K -->|cache check/write| L[Redis] K --> M[Prometheus]

Stage 1 — Data Ingestion (Scraping → PostgreSQL)

Attribute Detail
Trigger Airflow DAG on configurable schedule
Input Airflow calls POST /scrape on FastAPI; task enqueued to RabbitMQ api queue
Processing celery-worker-api drives headless browser via Selenoid; scrapes WhoScored.com; normalizes records
Output Canonical match records in PostgreSQL
Gates None at input; GE validate_raw applied after export
Failure Celery retry with backoff; Airflow DAG marked failed; data gap logged; no data loss for previous runs
Recovery Airflow backfill once source recovers; manual trigger via airflow dags trigger
Idempotency Upsert logic with dedup keys in PostgreSQL; safe to replay
Status ✅ Implemented

Stage 2 — Raw Export (PostgreSQL → MinIO → DVC)

Attribute Detail
Trigger DVC stage load_data_from_sources; triggered by dvc repro (manual or CI)
Input PostgreSQL canonical tables
Output data/raw/*.parquet (DVC-tracked, content-addressed in MinIO)
Gates DVC output hash check (stage only re-runs if deps change)
Failure DVC stage exits non-zero; pipeline blocked at this stage
Recovery Fix DB connectivity or MinIO credentials; re-run dvc repro load_data_from_sources
Idempotency DVC content-addressing; identical input → identical output hash
Status ✅ Implemented

Stage 3 — Raw Validation (Great Expectations)

Attribute Detail
Trigger DVC stage validate_raw; runs immediately after load_data_from_sources
Input data/raw/*.parquet
Output GE validation result (pass/fail); HTML report artifact
Gates Blocking: pipeline stops if any expectation fails
Failure DVC stage failure; downstream stages cannot run
Recovery Investigate source data change; update GE suite if schema evolution is intentional; re-run
Idempotency Read-only; safe to re-run
Status ✅ Implemented

Stage 4 — Preprocessing

Attribute Detail
Trigger DVC stage preprocessing; runs after validate_raw passes
Input data/raw/*.parquet
Output data/interim/finished.parquet (completed matches), data/interim/future.parquet (upcoming)
Processing Clean records, resolve team/tournament IDs, normalize schema
Gates GE validate_finished and validate_future gates downstream
Failure DVC stage failure; logged to stderr
Recovery Fix preprocessing logic; re-run dvc repro preprocessing
Idempotency Deterministic; same input → same output
Status ✅ Implemented

Stage 5 — Feature Engineering

Attribute Detail
Trigger DVC stage feature_engineering; runs after preprocessing validation passes
Input data/interim/finished.parquet, data/interim/future.parquet
Output data/features/*.parquet
Processing Rolling window match statistics and tournament-scoped team ratings; implementation parameters in params.yaml
Gates GE validate_features gate downstream
Contract Same feature code (src/features/) reused at inference time — no duplicate implementation
Failure DVC stage failure
Recovery Fix feature logic; re-run dvc repro feature_engineering
Idempotency Deterministic pure functions; same input → same output
Status ✅ Implemented

Stage 6 — Batch Inference Feature Assembly

Attribute Detail
Trigger DVC stage batch_inference; can run in parallel with training
Input data/features/ + upcoming match schedule
Output data/predictions/match_features.parquet
Purpose Pre-assemble features for all known upcoming matches; used by GET /predict/{match_id}
Failure DVC stage failure; existing parquet not updated
Idempotency Deterministic
Status ✅ Implemented

Stage 7 — Temporal Split

Attribute Detail
Trigger DVC stage split_data; runs after feature validation
Input data/features/*.parquet
Output data/splits/*.parquet (walk-forward CV folds + holdout)
Split config Time-based boundaries from params.yaml; no random shuffling
Leakage invariant No future data appears in any training fold; enforced by temporal ordering
Failure DVC stage failure
Idempotency Deterministic given fixed params.yaml
Status ✅ Implemented

Stage 8 — Model Training (Baseline + XGBoost + Ablation + Tuning)

Attribute Detail
Trigger DVC stages classification_models, ablation_study, tune_xgb; run after split_data
Input data/splits/*.parquet
Output MLflow runs: params, metrics, serialized model artifacts
Tracking All runs logged to MLflow; experiment name configured in params.yaml
Failure DVC stage failure; partial MLflow runs preserved
Idempotency Deterministic given seeded randomness and fixed params
Status ✅ Implemented

Stage 9 — Final Training + Calibration

Attribute Detail
Trigger DVC stage final_train; runs after tuning
Input data/splits/*.parquet + best hyperparameters from tune_xgb
Output Calibrated model artifact; MLflow run
Post-processing Probability calibration applied to model outputs
Failure DVC stage failure
Idempotency Deterministic
Status ✅ Implemented

Stage 10 — Model Registration

Attribute Detail
Trigger DVC stage register_model; runs after final_train
Input Final model artifact; MLflow run ID
Output Registered model version in MLflow Registry; Staging alias assigned
Contract MLflow pyfunc signature validated at registration
Promotion gate Manual: operator reviews metrics in MLflow UI and updates champion alias
Limitation No automated metric-threshold promotion policy yet
Planned Automated promotion gate (see Roadmap)
Status 🚧 Partially implemented (registration ✅; automated gate 📋)

Stage 11 — Serving (FastAPI + Celery + Redis)

See Runtime View for the detailed serving paths (sync, async, batch lookup).

Attribute Detail
Model source MLflow Registry champion alias; loaded lazily per worker process
Cache Redis: check before inference; write after; TTL-based
Failure modes See Failure Modes
Status ✅ Implemented

Stage 12 — Observability (Prometheus)

Attribute Detail
Metrics Request rate, latency, error rate, active tasks, and queue depth; scraped from FastAPI and Celery workers
Endpoint GET /metrics
Dashboards 📋 Grafana dashboards planned (see Roadmap)
Drift detection 📋 Evidently offline reports planned
Status ✅ Prometheus scraping; 🚧 Grafana dashboards; 📋 drift