Skip to content

End-to-End Data & ML Flow¶

This page describes the system lifecycle from raw data to monitored predictions.

1) Ingestion (Scraping → PostgreSQL)¶

Airflow schedules scraping tasks against WhoScored.com.
Scraped data is normalized and stored in PostgreSQL.
Ingestion is designed to be idempotent where possible (upserts, dedup keys).

Outputs - canonical tables in PostgreSQL (source of truth for structured scraped data)

2) Raw export (PostgreSQL → MinIO → DVC)¶

Airflow exports raw parquet snapshots to MinIO.
DVC pulls the raw data into the local/CI workspace.
Dataset versions are tracked by DVC (data lineage + reproducibility).

Outputs - data/raw (versioned) - metadata for lineage and dataset provenance

3) Offline ML pipeline (DVC pipeline)¶

dvc repro orchestrates:
preprocessing
feature engineering
splitting (leakage-safe)
training
evaluation and reporting

Gates - Great Expectations (blocking checks) on raw/processed/features (planned/implemented)

4) Experiment tracking and model registry (MLflow)¶

training logs:
parameters (Hydra config snapshot)
metrics (per fold / holdout)
artifacts (plots, reports, feature importances)
successful models are registered to MLflow Registry with clear versioning.

Promotion - models are promoted based on explicit rules (see ML → Model Registry)

5) Serving (FastAPI + optional async)¶

FastAPI loads the production model via MLflow model_uri.
Sync /predict supports low-latency predictions.
Async inference uses RabbitMQ + Celery for heavy workloads / batch jobs.

6) Monitoring (Prometheus/Grafana + Evidently)¶

Prometheus collects service metrics (latency, errors, throughput).
Grafana dashboards provide operational visibility and alerts.
Evidently reports track drift (data + predictions) and quality signals.

Failure modes and recovery (high level)¶

ingestion failures → retry/backoff, alerting, replay via backfill runbooks
contract violations → block downstream pipeline stages
model regression → rollback via MLflow registry + deployment automation