Skip to content

Architecture Overview

SoccerPredictAI is a layered ML system with hybrid inference — a production-style MLOps platform connecting scheduled data acquisition, versioned datasets, reproducible training, experiment tracking, and sync/async serving through explicit contracts at every boundary.


Architectural Style

Property Description
System type Layered end-to-end MLOps system with hybrid sync + async inference
Offline / online separation DVC pipeline (offline) and FastAPI + Celery (online) are independent execution environments sharing contracts and feature logic, not runtime infrastructure
Orchestration model Calendar-driven ingestion (Airflow) + artifact-driven ML pipeline (DVC) — each tool used for its native purpose
Deployment model Self-hosted single-node Kubernetes; stateless services with stateful storage managed by K8s
Contract discipline Every subsystem boundary has a formal, tested contract: Great Expectations (data), MLflow signature (model), Pydantic (API)

System Summary

  • Scraping: Airflow triggers browser automation (Selenoid) via a Celery task chain; scraped data lands in PostgreSQL.
  • Data pipeline: DVC orchestrates reproducible stages from raw export through feature engineering to model registration.
  • Experiment tracking: MLflow records all runs; models are versioned and promoted via champion/challenger aliases.
  • Serving: FastAPI exposes sync and async prediction endpoints; Celery workers handle inference; Redis caches results.
  • Observability: Prometheus collects 8 metrics from FastAPI and Celery; Grafana dashboards are planned.

Current Implementation vs Target Design

Layer Current State Target / Planned
Data ingestion ✅ Airflow → Selenoid → PostgreSQL → MinIO → DVC Streaming ingestion (long-term)
Feature engineering ✅ Time-windowed stats + ELO, DVC stage Player-level features (long-term)
Training pipeline ✅ DVC multi-stage, MLflow tracking Automated retraining trigger (mid-term)
Model registry 🚧 Staging/Production aliases; manual promotion gate Automated promotion policy (near-term)
Sync serving ✅ FastAPI + Celery, sync + async paths, Redis cache Cache invalidation on model promotion (mid-term)
Async serving POST /predict/async/ + polling endpoint Batch HTTP endpoint (mid-term)
Data validation ✅ Great Expectations at raw / interim / features
Metrics export ✅ Prometheus /metrics, 8 metrics Grafana dashboards (near-term)
Alerting 📋 Rules designed Alertmanager rules (near-term)
Drift detection 📋 Evidently designed Integration pending (mid-term)
Secrets ✅ SOPS + age, GitLab CI injection
Deployment ✅ K8s single-node, Helm, GitLab CI/CD HA K8s if scale justifies (long-term)

Layer Contracts Diagram

WhoScored → [Airflow + Selenoid] → PostgreSQL
                              [DVC raw export] → MinIO
                            [GE: validate_raw gate]
                          [DVC preprocessing] → data/interim/
                  [GE: validate_finished / validate_future gates]
                      [DVC feature_engineering] → data/features/
                          [GE: validate_features gate]
                [DVC train / tune / calibrate] → MLflow Registry
                      [FastAPI + Celery + Redis] → User / UI
                               Prometheus ← /metrics

Each arrow is an explicit contract:

  • Data schemas validated with Great Expectations at raw, interim, and features stages.
  • Model signature enforced via MLflow pyfunc wrapper.
  • API request/response schemas enforced with Pydantic.

Known Architectural Limitations

These are current limitations of the system as deployed. They are documented explicitly to avoid presenting a planned or aspirational design as the current runtime reality.

Limitation Detail
Single-node Kubernetes All services run on one VPS (healserver). A node failure is a full-service outage. No pod rescheduling across nodes is possible.
No High Availability No replicated control plane. No multi-node worker pool. Accepted tradeoff against infrastructure cost and operational overhead.
Single RabbitMQ broker The message queue has no replication or clustering. RabbitMQ unavailability blocks all inference (sync and async).
No autoscaling No Horizontal Pod Autoscaler is configured. Celery worker replicas are static. Cannot scale under unexpected load.
Manual model promotion Models are promoted to champion alias manually after review. No automated promotion policy or evaluation gate exists today.
No API authentication All public endpoints are unauthenticated. Access control is limited to network-level TLS termination and operator-managed exposure.
Cache invalidation not tied to model promotion When a new model is promoted to champion, existing Redis cache entries are not flushed. Stale predictions from the previous model may be served until TTL expires.

For the path to resolving these, see Roadmap and Trade-offs.


Architecture Pages

Foundation

Page Purpose
Architecture Principles Design philosophy and how it shapes every decision
System Requirements Functional, non-functional, constraints, non-goals
System Boundary What is inside vs outside the system; trust zones

Structural Views

Page Purpose
System Context (C4 L1) External actors and system boundary
Container View (C4 L2) Deployable services and their responsibilities
Component View (C4 L3) Module-level breakdown with contracts

Behavioral and Operational Views

Page Purpose
Data & ML Flow End-to-end pipeline: trigger → input → gate → output
Runtime View Sync and async prediction paths, cache rules, model loading
Deployment View Physical topology, namespace layout, ingress path
Failure Modes Detection, impact, recovery, and preventive controls

Design Quality

Page Purpose
Environments Runtime layers and dependency strategy
Security Threat model, secret lifecycle, access boundaries
Trade-offs Key decisions with alternatives and consequences
Roadmap Near / mid / long-term planned improvements