System Boundary¶

This page defines what is inside the SoccerPredictAI system, what is outside it, and how the two interact. Understanding the boundary is essential for reasoning about ownership, trust, and failure modes.

What Is Inside the System¶

The runtime system boundary includes all services responsible for prediction serving, data ingestion, model lifecycle, and observability. The offline ML pipeline is part of the system when it produces artifacts consumed at runtime (models in MLflow, data in DVC/MinIO).

Runtime services (Kubernetes — healserver)¶

Component	Namespace	Responsibility
Nginx Ingress Controller	`ingress-nginx`	Routes inbound HTTPS traffic to internal services
Airflow Scheduler + Workers	`ds`	Schedules ETL and scraping triggers
PostgreSQL	`ds`	Authoritative store for normalized scraped data
MinIO (S3-compatible)	`ds`	DVC remote: raw parquet exports, ML artifacts
MLflow Tracking + Registry	`ds`	Experiment records, model versions, promotion lifecycle
Prometheus	`ds`	Metrics collection
Grafana	`ds`	Dashboards (📋 Planned: dashboards defined)
kube-state-metrics	`monitoring`	K8s cluster metrics
node-exporter	`monitoring`	Host-level metrics
FastAPI Inference Service	`soccer-api`	REST API, sync + async predictions, health + metrics endpoints
RabbitMQ	`soccer-api`	Message broker for Celery task queues
Celery worker-api	`soccer-api`	Short tasks: scraping trigger, cache operations, request pre-processing
Celery worker-ml	`soccer-api`	Heavy tasks: feature assembly at inference, batch scoring
Redis	`soccer-api`	Prediction and feature vector cache (caching optimization layer)

Offline execution context¶

Component	Boundary	Responsibility
DVC pipeline	Local / CI execution	Reproducible ML pipeline: preprocessing through model registration

Offline Pipeline Boundary¶

The DVC pipeline occupies a deliberate position in the system boundary: it executes outside the runtime cluster (locally or in CI), but it is part of the system as the authoritative producer of all ML artifacts consumed at runtime.

This is not an omission — it is an explicit architectural decision.

Why DVC is outside the runtime boundary:

The pipeline is artifact-driven and reproducible, not service-based. It does not run continuously.
Executing training inside Kubernetes would add operational complexity (GPU scheduling, ephemeral storage, long-running job management) without benefit at the current scale.
CI execution provides a clean, reproducible environment without cluster-side state entanglement.

Why DVC is still part of the system:

Every model in the runtime registry was produced by a tracked, versioned DVC run.
Every dataset consumed by training is content-addressed and reproducible via dvc checkout.
The DVC pipeline is the explicit handoff point from data to models: it reads from MinIO and writes registered artifacts into MLflow, which the runtime cluster reads.

Architectural consequence:

The boundary crossing happens at the MLflow Registry: DVC pushes a model artifact and assigns a champion alias; the serving layer loads it. This is the only coupling point between the offline pipeline and the runtime system. They share no runtime infrastructure, only contracts (model signature, feature schema, MLflow alias convention).

[DVC pipeline — local/CI]
        │
        │  writes model artifact + champion alias
        ▼
[MLflow Registry — runtime cluster]
        │
        │  model_uri resolved by champion alias
        ▼
[FastAPI + Celery workers — runtime serving]

Limitation: there is no automated handoff — model promotion is a manual operation today. See Known Architectural Limitations and Roadmap.

What Is Outside the System¶

External runtime dependencies¶

External component	Owner	Role	Trust level
WhoScored.com	Third party	Source of football match statistics	Untrusted; validated after ingestion
Selenoid Server	Operator (external host)	Headless browser grid for scraping; called by `celery-worker-api`	Trusted operator; separate ops boundary
Streamlit Web UI (`time2bet.ru`)	Operator (external VPS)	User-facing prediction frontend	Trusted; calls the inference API over HTTPS
Host-level Nginx (VPS)	Operator	Reverse proxy in front of K8s NodePort; handles TLS termination	Trusted operator

Delivery and tooling boundary¶

External component	Owner	Role	Trust level
GitLab CI/CD	SaaS (GitLab.com)	Build, test, and Helm deployment pipeline	Trusted for delivery; accesses encrypted secrets via protected variables

GitLab CI/CD is outside the runtime system boundary: it does not participate in normal system operation. It crosses the boundary only during deployment events — at which point it decrypts SOPS-encrypted secrets and pushes Helm releases to the cluster.

External Dependency Trust Model¶

flowchart TB subgraph PublicInternet[Public Internet — Untrusted] WhoScored[WhoScored.com] User[End User] end subgraph ExternalOps[Operator-controlled — Trusted] Selenoid[Selenoid Server\nexternal host] StreamlitUI[Streamlit Web UI\next. VPS — time2bet.ru] HostNginx[Host-level Nginx\nVPS reverse proxy] end subgraph CI[GitLab CI/CD — Trusted] CICD[GitLab CI Runner\nbuild / test / deploy] end subgraph K8sCluster[Kubernetes Cluster — healserver — Internal] Ingress[Ingress Controller] API[FastAPI] Workers[Celery Workers] DB[PostgreSQL] S3[MinIO] MLflow[MLflow] Queue[RabbitMQ] Cache[Redis] Prom[Prometheus] end subgraph Offline[Offline / CI — ML Pipeline] DVC[DVC Pipeline] end WhoScored -->|scraped via browser| Selenoid Selenoid -->|normalized data| Workers User -->|HTTPS| StreamlitUI StreamlitUI -->|HTTPS /predict| HostNginx HostNginx -->|NodePort 31390| Ingress Ingress --> API API --> Workers Workers --> DB DB --> S3 S3 --> DVC DVC --> MLflow API -->|model_uri| MLflow CICD -->|Helm deploy| K8sCluster

Trust Boundaries¶

Public Internet¶

All requests from the public internet are untrusted by default. - WhoScored.com data is treated as untrusted input; Great Expectations validates it before use. - User requests to the API pass through Nginx TLS termination and Pydantic schema validation.

K8s Cluster Internal¶

Services within the same namespace can communicate freely via cluster DNS. Cross-namespace communication is restricted via Kubernetes NetworkPolicy (where defined). No service inside the cluster exposes a plaintext secret to application code — all secrets are injected via Kubernetes Secrets from SOPS-decrypted manifests.

External Scraping Host (Selenoid)¶

The Selenoid host is operator-controlled but runs outside the K8s network boundary. Traffic from celery-worker-api to Selenoid crosses the network boundary. This is an accepted operational dependency; Selenoid unavailability is a known failure mode.

CI/CD Boundary¶

GitLab CI has access to: - the source code repository, - encrypted SOPS secret files (committed to git), - the age private key (stored as a protected CI variable).

CI decrypts secrets only in scoped deployment steps. No secret appears in CI logs (masked variables enforced).

Deployment View — physical topology and namespace layout
C4 Context Diagram — external actors and system responsibilities
Security — threat model and secret lifecycle
Failure Modes — what happens when external dependencies fail