End-to-End Data & ML Flow¶
This page describes the system lifecycle from raw data to monitored predictions.
1) Ingestion (Scraping → PostgreSQL)¶
- Airflow schedules scraping tasks against WhoScored.com.
- Scraped data is normalized and stored in PostgreSQL.
- Ingestion is designed to be idempotent where possible (upserts, dedup keys).
Outputs - canonical tables in PostgreSQL (source of truth for structured scraped data)
2) Raw export (PostgreSQL → MinIO → DVC)¶
- Airflow exports raw parquet snapshots to MinIO.
- DVC pulls the raw data into the local/CI workspace.
- Dataset versions are tracked by DVC (data lineage + reproducibility).
Outputs
- data/raw (versioned)
- metadata for lineage and dataset provenance
3) Offline ML pipeline (DVC pipeline)¶
dvc reproorchestrates:- preprocessing
- feature engineering
- splitting (leakage-safe)
- training
- evaluation and reporting
Gates - Great Expectations (blocking checks) on raw/processed/features (planned/implemented)
4) Experiment tracking and model registry (MLflow)¶
- training logs:
- parameters (Hydra config snapshot)
- metrics (per fold / holdout)
- artifacts (plots, reports, feature importances)
- successful models are registered to MLflow Registry with clear versioning.
Promotion - models are promoted based on explicit rules (see ML → Model Registry)
5) Serving (FastAPI + optional async)¶
- FastAPI loads the production model via MLflow
model_uri. - Sync
/predictsupports low-latency predictions. - Async inference uses RabbitMQ + Celery for heavy workloads / batch jobs.
6) Monitoring (Prometheus/Grafana + Evidently)¶
- Prometheus collects service metrics (latency, errors, throughput).
- Grafana dashboards provide operational visibility and alerts.
- Evidently reports track drift (data + predictions) and quality signals.
Failure modes and recovery (high level)¶
- ingestion failures → retry/backoff, alerting, replay via backfill runbooks
- contract violations → block downstream pipeline stages
- model regression → rollback via MLflow registry + deployment automation