Skip to content

End-to-End Data & ML Flow

This page describes the system lifecycle from raw data to monitored predictions.


1) Ingestion (Scraping → PostgreSQL)

  • Airflow schedules scraping tasks against WhoScored.com.
  • Scraped data is normalized and stored in PostgreSQL.
  • Ingestion is designed to be idempotent where possible (upserts, dedup keys).

Outputs - canonical tables in PostgreSQL (source of truth for structured scraped data)


2) Raw export (PostgreSQL → MinIO → DVC)

  • Airflow exports raw parquet snapshots to MinIO.
  • DVC pulls the raw data into the local/CI workspace.
  • Dataset versions are tracked by DVC (data lineage + reproducibility).

Outputs - data/raw (versioned) - metadata for lineage and dataset provenance


3) Offline ML pipeline (DVC pipeline)

  • dvc repro orchestrates:
  • preprocessing
  • feature engineering
  • splitting (leakage-safe)
  • training
  • evaluation and reporting

Gates - Great Expectations (blocking checks) on raw/processed/features (planned/implemented)


4) Experiment tracking and model registry (MLflow)

  • training logs:
  • parameters (Hydra config snapshot)
  • metrics (per fold / holdout)
  • artifacts (plots, reports, feature importances)
  • successful models are registered to MLflow Registry with clear versioning.

Promotion - models are promoted based on explicit rules (see ML → Model Registry)


5) Serving (FastAPI + optional async)

  • FastAPI loads the production model via MLflow model_uri.
  • Sync /predict supports low-latency predictions.
  • Async inference uses RabbitMQ + Celery for heavy workloads / batch jobs.

6) Monitoring (Prometheus/Grafana + Evidently)

  • Prometheus collects service metrics (latency, errors, throughput).
  • Grafana dashboards provide operational visibility and alerts.
  • Evidently reports track drift (data + predictions) and quality signals.

Failure modes and recovery (high level)

  • ingestion failures → retry/backoff, alerting, replay via backfill runbooks
  • contract violations → block downstream pipeline stages
  • model regression → rollback via MLflow registry + deployment automation