Component View (C4 — Level 3)
This view breaks down the internal components of both the offline ML pipeline and the online runtime.
Each component has a defined responsibility, contract, failure behavior, and implementation status.
Component Map
flowchart TB
subgraph Offline[Offline ML Pipeline — DVC Stages]
SE[Source Extraction]
PP[Preprocessing]
GE1[GE: validate_raw]
GE2[GE: validate_finished / validate_future]
FE[Feature Engineering]
GE3[GE: validate_features]
SP[Temporal Split]
BL[Baseline Model]
XGB[Classifier]
AB[Ablation Study]
TN[Hyperparameter Tuning]
FT[Final Train + Calibration]
BI[Batch Inference\nFeature Assembly]
MR[Model Registration]
end
subgraph Contracts[Contract Layer]
DC[Data Contract — GE Suites]
MC[Model Contract — MLflow Signature]
AC[API Contract — Pydantic Schemas]
end
subgraph Runtime[Online Runtime — FastAPI + Celery]
RV[Request Validation]
FA[Feature Assembly at Inference]
IE[Inference Execution]
TD[Task Dispatch\nSync vs Async]
TM[Telemetry — Prometheus]
end
SE --> GE1 --> PP --> GE2 --> FE --> GE3 --> SP
SP --> BL
SP --> XGB
SP --> AB
SP --> TN
TN --> FT
FT --> MR
BI -.feature parquet.-> IE
DC -.gates.-> GE1
DC -.gates.-> GE2
DC -.gates.-> GE3
MC -.enforced.-> MR
AC -.enforced.-> RV
MR --> IE
RV --> TD
TD --> FA
FA --> IE
IE --> TM
Offline Pipeline Components
Attribute
Detail
Responsibility
Scrape WhoScored.com via Selenoid; normalize data; write to PostgreSQL; export raw parquet to MinIO
Inputs
Airflow schedule trigger → FastAPI HTTP → RabbitMQ → celery-worker-api
Outputs
PostgreSQL tables (canonical scraped data); data/raw/*.parquet (DVC-tracked)
Contract
None at input; GE validate_raw gate immediately after
Failure behavior
Celery retry with backoff; Airflow marks DAG failed; data gap logged
Idempotency
Upsert logic in PostgreSQL; safe to replay
Preprocessing — ✅ Implemented
Attribute
Detail
Responsibility
Clean and normalize raw match records; resolve team/tournament IDs; produce finished.parquet and future.parquet
Inputs
data/raw/*.parquet (DVC stage dep)
Outputs
data/interim/finished.parquet, data/interim/future.parquet
Contract
GE validate_finished and validate_future gates downstream
Failure behavior
DVC stage failure; pipeline blocked
Idempotency
Deterministic; safe to re-run
Great Expectations Validation Gates — ✅ Implemented
Three distinct GE suites act as blocking gates:
Gate
DVC stage
Dataset
Failure action
validate_raw
After load_data_from_sources
Raw parquet
Block pipeline; raise on expectation failure
validate_finished / validate_future
After preprocessing
Interim parquets
Block feature engineering
validate_features
After feature_engineering
Feature parquet
Block training
Suites are versioned in the repository. Schema evolution requires explicit suite updates.
Feature Engineering — ✅ Implemented
Attribute
Detail
Responsibility
Compute time-windowed match statistics and rating-based features for each team
Inputs
data/interim/finished.parquet, data/interim/future.parquet
Outputs
data/features/*.parquet
Contract
GE validate_features gate downstream
Architectural invariant
Feature logic (src/features/) is shared between the offline pipeline and online inference — no separate implementation for serving
Failure behavior
DVC stage failure
Idempotency
Deterministic; pure functions; no IO side-effects
Temporal Split — ✅ Implemented
Attribute
Detail
Responsibility
Split data into training folds and holdout set using time-based boundaries (no random shuffling)
Inputs
data/features/*.parquet
Outputs
data/splits/*.parquet (folds + holdout)
Contract
No data from the holdout period may appear in any training fold (leakage invariant); split boundaries come from params.yaml
Failure behavior
DVC stage failure if leakage detected
Idempotency
Deterministic given fixed split configuration
Baseline Model — ✅ Implemented
Attribute
Detail
Responsibility
Train a reference model to establish a minimum performance bound
Inputs
data/splits/
Outputs
MLflow run with baseline metrics
Contract
Provides a lower-bound benchmark; all production candidates must exceed it
Failure behavior
DVC stage failure
Gradient Boosting Classifier — ✅ Implemented
Attribute
Detail
Responsibility
Train a gradient boosting classifier for match outcome prediction
Inputs
data/splits/
Outputs
MLflow run; serialized model artifact
Target
outcome_1x2 (match result)
Failure behavior
DVC stage failure; partial metrics logged to MLflow
Ablation Study — ✅ Implemented
Attribute
Detail
Responsibility
Measure the contribution of individual feature groups to model performance
Inputs
data/splits/
Outputs
MLflow runs per feature set configuration
Contract
Results inform which feature groups are retained in the production pipeline
Failure behavior
DVC stage failure; individual runs logged to MLflow
Hyperparameter Tuning — ✅ Implemented
Attribute
Detail
Responsibility
Search the model hyperparameter space and select the configuration that maximizes holdout performance
Inputs
data/splits/ + tuning configuration from params.yaml
Outputs
Best hyperparameter set (artifact); MLflow runs per trial
Failure behavior
DVC stage failure; partial trial results preserved in MLflow
Final Train + Calibration — ✅ Implemented
Attribute
Detail
Responsibility
Train the final model on the full training set with selected hyperparameters; apply probability calibration
Inputs
data/splits/, best params from tuning stage
Outputs
Calibrated model artifact; MLflow run
Contract
Calibrated probability outputs signed in MLflow model signature
Failure behavior
DVC stage failure
Model Registration — 🚧 Partially Implemented
Attribute
Detail
Responsibility
Register final model to MLflow Registry; assign version; promote to Staging
Inputs
Final calibrated model artifact; MLflow run ID
Outputs
MLflow registered model version
Contract
MLflow pyfunc model signature enforced at registration
Current limitation
Staging → Production promotion is manual; no automated metric-threshold gate
Planned
Automated promotion policy (see Roadmap )
Batch Inference Feature Assembly — ✅ Implemented
Attribute
Detail
Responsibility
Assemble feature vectors for all upcoming matches; write to data/predictions/match_features.parquet
Inputs
data/features/, future match schedule
Outputs
data/predictions/match_features.parquet
Contract
Feature schema must match training feature schema
Failure behavior
DVC stage failure
Online Runtime Components
Request Validation — ✅ Implemented
Attribute
Detail
Responsibility
Validate all incoming API requests against Pydantic schemas before any processing
Schema
PredictRequest / PredictResponse in src/app/schemas/
Failure behavior
Returns 422 Unprocessable Entity with structured error details; no inference runs
Contract
API contract; OpenAPI schema auto-generated by FastAPI
Feature Assembly at Inference — ✅ Implemented
Attribute
Detail
Responsibility
Assemble feature vectors at inference time using the same src/features/ code as the offline pipeline
Inputs
Match context from request; historical data from Redis cache or recomputed
Outputs
Feature vector matching training schema
Contract
Must produce identical features to offline pipeline for the same input
Failure behavior
Inference task fails; error returned to FastAPI
Inference Execution — ✅ Implemented
Attribute
Detail
Responsibility
Run the loaded model against assembled feature vectors; return probability distribution
Model loading
Lazy, once per worker process; resolved from MLflow Registry champion alias
Outputs
Probability vector [p_home_win, p_draw, p_away_win]; model version metadata
Failure behavior
Task fails; FastAPI returns 500/504 depending on mode
Task Dispatch (Sync vs Async) — ✅ Implemented
Attribute
Detail
Responsibility
Route inference requests to Celery ml queue; manage timeout for sync path; return task_id for async
Sync timeout
30 s (configurable)
Async return
task_id for polling via GET /monitoring/task_status/{task_id}
Failure behavior
Sync: 504 on timeout. Async: task state FAILURE retrievable via status endpoint
Telemetry — ✅ Implemented
Attribute
Detail
Responsibility
Capture and expose Prometheus metrics for all inference requests
Metrics (8 total)
Request count, request latency histograms (p50/p95/p99), error rate, active tasks, queue depth, cache hit rate
Endpoint
GET /metrics
Failure behavior
Non-blocking; metrics collection failure does not affect inference