Architecture Overview¶

SoccerPredictAI is a layered ML system with hybrid inference — a production-style MLOps platform connecting scheduled data acquisition, versioned datasets, reproducible training, experiment tracking, and sync/async serving through explicit contracts at every boundary.

Architectural Style¶

Property	Description
System type	Layered end-to-end MLOps system with hybrid sync + async inference
Offline / online separation	DVC pipeline (offline) and FastAPI + Celery (online) are independent execution environments sharing contracts and feature logic, not runtime infrastructure
Orchestration model	Calendar-driven ingestion (Airflow) + artifact-driven ML pipeline (DVC) — each tool used for its native purpose
Deployment model	Self-hosted single-node Kubernetes; stateless services with stateful storage managed by K8s
Contract discipline	Every subsystem boundary has a formal, tested contract: Great Expectations (data), MLflow signature (model), Pydantic (API)

System Summary¶

Scraping: Airflow triggers browser automation (Selenoid) via a Celery task chain; scraped data lands in PostgreSQL.
Data pipeline: DVC orchestrates reproducible stages from raw export through feature engineering to model registration.
Experiment tracking: MLflow records all runs; models are versioned and promoted via champion/challenger aliases.
Serving: FastAPI exposes sync and async prediction endpoints; Celery workers handle inference; Redis caches results.
Observability: Prometheus collects 8 metrics from FastAPI and Celery; Grafana dashboards are planned.

Current Implementation vs Target Design¶

Layer	Current State	Target / Planned
Data ingestion	✅ Airflow → Selenoid → PostgreSQL → MinIO → DVC	Streaming ingestion (long-term)
Feature engineering	✅ Time-windowed stats + ELO, DVC stage	Player-level features (long-term)
Training pipeline	✅ DVC multi-stage, MLflow tracking	Automated retraining trigger (mid-term)
Model registry	🚧 Staging/Production aliases; manual promotion gate	Automated promotion policy (near-term)
Sync serving	✅ FastAPI + Celery, sync + async paths, Redis cache	Cache invalidation on model promotion (mid-term)
Async serving	✅ `POST /predict/async/` + polling endpoint	Batch HTTP endpoint (mid-term)
Data validation	✅ Great Expectations at raw / interim / features	—
Metrics export	✅ Prometheus `/metrics`, 8 metrics	Grafana dashboards (near-term)
Alerting	📋 Rules designed	Alertmanager rules (near-term)
Drift detection	📋 Evidently designed	Integration pending (mid-term)
Secrets	✅ SOPS + age, GitLab CI injection	—
Deployment	✅ K8s single-node, Helm, GitLab CI/CD	HA K8s if scale justifies (long-term)

Layer Contracts Diagram¶

WhoScored → [Airflow + Selenoid] → PostgreSQL
                                       ↓
                              [DVC raw export] → MinIO
                                       ↓
                            [GE: validate_raw gate]
                                       ↓
                          [DVC preprocessing] → data/interim/
                                       ↓
                  [GE: validate_finished / validate_future gates]
                                       ↓
                      [DVC feature_engineering] → data/features/
                                       ↓
                          [GE: validate_features gate]
                                       ↓
                [DVC train / tune / calibrate] → MLflow Registry
                                       ↓
                      [FastAPI + Celery + Redis] → User / UI
                                       ↓
                               Prometheus ← /metrics

Each arrow is an explicit contract:

Data schemas validated with Great Expectations at raw, interim, and features stages.
Model signature enforced via MLflow pyfunc wrapper.
API request/response schemas enforced with Pydantic.

Known Architectural Limitations¶

These are current limitations of the system as deployed. They are documented explicitly to avoid presenting a planned or aspirational design as the current runtime reality.

Limitation	Detail
Single-node Kubernetes	All services run on one VPS (`healserver`). A node failure is a full-service outage. No pod rescheduling across nodes is possible.
No High Availability	No replicated control plane. No multi-node worker pool. Accepted tradeoff against infrastructure cost and operational overhead.
Single RabbitMQ broker	The message queue has no replication or clustering. RabbitMQ unavailability blocks all inference (sync and async).
No autoscaling	No Horizontal Pod Autoscaler is configured. Celery worker replicas are static. Cannot scale under unexpected load.
Manual model promotion	Models are promoted to `champion` alias manually after review. No automated promotion policy or evaluation gate exists today.
No API authentication	All public endpoints are unauthenticated. Access control is limited to network-level TLS termination and operator-managed exposure.
Cache invalidation not tied to model promotion	When a new model is promoted to `champion`, existing Redis cache entries are not flushed. Stale predictions from the previous model may be served until TTL expires.

For the path to resolving these, see Roadmap and Trade-offs.

Architecture Pages¶

Foundation¶

Page	Purpose
Architecture Principles	Design philosophy and how it shapes every decision
System Requirements	Functional, non-functional, constraints, non-goals
System Boundary	What is inside vs outside the system; trust zones

Structural Views¶

Page	Purpose
System Context (C4 L1)	External actors and system boundary
Container View (C4 L2)	Deployable services and their responsibilities
Component View (C4 L3)	Module-level breakdown with contracts

Behavioral and Operational Views¶

Page	Purpose
Data & ML Flow	End-to-end pipeline: trigger → input → gate → output
Runtime View	Sync and async prediction paths, cache rules, model loading
Deployment View	Physical topology, namespace layout, ingress path
Failure Modes	Detection, impact, recovery, and preventive controls

Design Quality¶

Page	Purpose
Environments	Runtime layers and dependency strategy
Security	Threat model, secret lifecycle, access boundaries
Trade-offs	Key decisions with alternatives and consequences
Roadmap	Near / mid / long-term planned improvements

ADR Decisions — formal architectural decision records
Implementation Status — what is actually running today
Code Structure — layer rules and naming conventions