Skip to content

Component View (C4 — Level 3)

This view breaks down the internal components of both the offline ML pipeline and the online runtime. Each component has a defined responsibility, contract, failure behavior, and implementation status.


Component Map

flowchart TB subgraph Offline[Offline ML Pipeline — DVC Stages] SE[Source Extraction] PP[Preprocessing] GE1[GE: validate_raw] GE2[GE: validate_finished / validate_future] FE[Feature Engineering] GE3[GE: validate_features] SP[Temporal Split] BL[Baseline Model] XGB[Classifier] AB[Ablation Study] TN[Hyperparameter Tuning] FT[Final Train + Calibration] BI[Batch Inference\nFeature Assembly] MR[Model Registration] end subgraph Contracts[Contract Layer] DC[Data Contract — GE Suites] MC[Model Contract — MLflow Signature] AC[API Contract — Pydantic Schemas] end subgraph Runtime[Online Runtime — FastAPI + Celery] RV[Request Validation] FA[Feature Assembly at Inference] IE[Inference Execution] TD[Task Dispatch\nSync vs Async] TM[Telemetry — Prometheus] end SE --> GE1 --> PP --> GE2 --> FE --> GE3 --> SP SP --> BL SP --> XGB SP --> AB SP --> TN TN --> FT FT --> MR BI -.feature parquet.-> IE DC -.gates.-> GE1 DC -.gates.-> GE2 DC -.gates.-> GE3 MC -.enforced.-> MR AC -.enforced.-> RV MR --> IE RV --> TD TD --> FA FA --> IE IE --> TM

Offline Pipeline Components

Source Extraction — ✅ Implemented

Attribute Detail
Responsibility Scrape WhoScored.com via Selenoid; normalize data; write to PostgreSQL; export raw parquet to MinIO
Inputs Airflow schedule trigger → FastAPI HTTP → RabbitMQ → celery-worker-api
Outputs PostgreSQL tables (canonical scraped data); data/raw/*.parquet (DVC-tracked)
Contract None at input; GE validate_raw gate immediately after
Failure behavior Celery retry with backoff; Airflow marks DAG failed; data gap logged
Idempotency Upsert logic in PostgreSQL; safe to replay

Preprocessing — ✅ Implemented

Attribute Detail
Responsibility Clean and normalize raw match records; resolve team/tournament IDs; produce finished.parquet and future.parquet
Inputs data/raw/*.parquet (DVC stage dep)
Outputs data/interim/finished.parquet, data/interim/future.parquet
Contract GE validate_finished and validate_future gates downstream
Failure behavior DVC stage failure; pipeline blocked
Idempotency Deterministic; safe to re-run

Great Expectations Validation Gates — ✅ Implemented

Three distinct GE suites act as blocking gates:

Gate DVC stage Dataset Failure action
validate_raw After load_data_from_sources Raw parquet Block pipeline; raise on expectation failure
validate_finished / validate_future After preprocessing Interim parquets Block feature engineering
validate_features After feature_engineering Feature parquet Block training

Suites are versioned in the repository. Schema evolution requires explicit suite updates.


Feature Engineering — ✅ Implemented

Attribute Detail
Responsibility Compute time-windowed match statistics and rating-based features for each team
Inputs data/interim/finished.parquet, data/interim/future.parquet
Outputs data/features/*.parquet
Contract GE validate_features gate downstream
Architectural invariant Feature logic (src/features/) is shared between the offline pipeline and online inference — no separate implementation for serving
Failure behavior DVC stage failure
Idempotency Deterministic; pure functions; no IO side-effects

Temporal Split — ✅ Implemented

Attribute Detail
Responsibility Split data into training folds and holdout set using time-based boundaries (no random shuffling)
Inputs data/features/*.parquet
Outputs data/splits/*.parquet (folds + holdout)
Contract No data from the holdout period may appear in any training fold (leakage invariant); split boundaries come from params.yaml
Failure behavior DVC stage failure if leakage detected
Idempotency Deterministic given fixed split configuration

Baseline Model — ✅ Implemented

Attribute Detail
Responsibility Train a reference model to establish a minimum performance bound
Inputs data/splits/
Outputs MLflow run with baseline metrics
Contract Provides a lower-bound benchmark; all production candidates must exceed it
Failure behavior DVC stage failure

Gradient Boosting Classifier — ✅ Implemented

Attribute Detail
Responsibility Train a gradient boosting classifier for match outcome prediction
Inputs data/splits/
Outputs MLflow run; serialized model artifact
Target outcome_1x2 (match result)
Failure behavior DVC stage failure; partial metrics logged to MLflow

Ablation Study — ✅ Implemented

Attribute Detail
Responsibility Measure the contribution of individual feature groups to model performance
Inputs data/splits/
Outputs MLflow runs per feature set configuration
Contract Results inform which feature groups are retained in the production pipeline
Failure behavior DVC stage failure; individual runs logged to MLflow

Hyperparameter Tuning — ✅ Implemented

Attribute Detail
Responsibility Search the model hyperparameter space and select the configuration that maximizes holdout performance
Inputs data/splits/ + tuning configuration from params.yaml
Outputs Best hyperparameter set (artifact); MLflow runs per trial
Failure behavior DVC stage failure; partial trial results preserved in MLflow

Final Train + Calibration — ✅ Implemented

Attribute Detail
Responsibility Train the final model on the full training set with selected hyperparameters; apply probability calibration
Inputs data/splits/, best params from tuning stage
Outputs Calibrated model artifact; MLflow run
Contract Calibrated probability outputs signed in MLflow model signature
Failure behavior DVC stage failure

Model Registration — 🚧 Partially Implemented

Attribute Detail
Responsibility Register final model to MLflow Registry; assign version; promote to Staging
Inputs Final calibrated model artifact; MLflow run ID
Outputs MLflow registered model version
Contract MLflow pyfunc model signature enforced at registration
Current limitation Staging → Production promotion is manual; no automated metric-threshold gate
Planned Automated promotion policy (see Roadmap)

Batch Inference Feature Assembly — ✅ Implemented

Attribute Detail
Responsibility Assemble feature vectors for all upcoming matches; write to data/predictions/match_features.parquet
Inputs data/features/, future match schedule
Outputs data/predictions/match_features.parquet
Contract Feature schema must match training feature schema
Failure behavior DVC stage failure

Online Runtime Components

Request Validation — ✅ Implemented

Attribute Detail
Responsibility Validate all incoming API requests against Pydantic schemas before any processing
Schema PredictRequest / PredictResponse in src/app/schemas/
Failure behavior Returns 422 Unprocessable Entity with structured error details; no inference runs
Contract API contract; OpenAPI schema auto-generated by FastAPI

Feature Assembly at Inference — ✅ Implemented

Attribute Detail
Responsibility Assemble feature vectors at inference time using the same src/features/ code as the offline pipeline
Inputs Match context from request; historical data from Redis cache or recomputed
Outputs Feature vector matching training schema
Contract Must produce identical features to offline pipeline for the same input
Failure behavior Inference task fails; error returned to FastAPI

Inference Execution — ✅ Implemented

Attribute Detail
Responsibility Run the loaded model against assembled feature vectors; return probability distribution
Model loading Lazy, once per worker process; resolved from MLflow Registry champion alias
Outputs Probability vector [p_home_win, p_draw, p_away_win]; model version metadata
Failure behavior Task fails; FastAPI returns 500/504 depending on mode

Task Dispatch (Sync vs Async) — ✅ Implemented

Attribute Detail
Responsibility Route inference requests to Celery ml queue; manage timeout for sync path; return task_id for async
Sync timeout 30 s (configurable)
Async return task_id for polling via GET /monitoring/task_status/{task_id}
Failure behavior Sync: 504 on timeout. Async: task state FAILURE retrievable via status endpoint

Telemetry — ✅ Implemented

Attribute Detail
Responsibility Capture and expose Prometheus metrics for all inference requests
Metrics (8 total) Request count, request latency histograms (p50/p95/p99), error rate, active tasks, queue depth, cache hit rate
Endpoint GET /metrics
Failure behavior Non-blocking; metrics collection failure does not affect inference