Serving Deployment¶

This page covers the serving-specific deployment structure: runtime components, configuration, model loading, and current operational state.

For the full physical topology and traffic routing, see Architecture: Deployment View.

Serving components¶

All serving components run in the soccer-api Kubernetes namespace on the single-node healserver cluster.

Component	Role	Current state
FastAPI (`worker-api` pod)	HTTP inference service	2 pods via Deployment
`celery-worker-ml`	Executes `predict_match` tasks from the `ml` queue	2 pods via Deployment
RabbitMQ	Message broker for Celery task queues	Single broker pod
Redis	Prediction cache + Celery result backend	Single pod
Helm chart	All of the above managed via `k8s/helm/soccer-api/`	Parameterized
HPA	Horizontal pod autoscaler	Deployed (queue-depth signal)

Traffic path¶

Internet
  → host-level Nginx (TLS termination, port 443)
    → K8s NodePort 31390
      → Nginx Ingress Controller (ingress-nginx namespace)
        → FastAPI service (soccer-api namespace)
          → RabbitMQ → celery-worker-ml

The Streamlit UI (external VPS time2bet.ru) also routes prediction requests through this path over public HTTPS.

Configuration and secrets¶

Runtime configuration is provided via:

Helm values (k8s/helm/soccer-api/values.yaml)
Kubernetes ConfigMaps (non-sensitive settings)
Kubernetes Secrets (credentials, decrypted from SOPS-encrypted values-*.enc.yaml at deploy time)

No secrets are baked into Docker images. The age private key used for SOPS decryption is stored as a protected CI variable.

Model loading¶

PredictionService in src/app/services/predict.py loads the model lazily on first inference request in each worker process:

mlflow.pyfunc.load_model(f"models:/{model_name}/{stage}")

model_name and stage come from application settings (settings.mlflow.*) for the serving layer. The pipeline registers the model using values from params.yaml → register_model.*.
The model is loaded once per celery-worker-ml process via the worker_process_init signal.
Subsequent tasks within the same worker process reuse the loaded model — no per-request reload.
If the MLflow Registry is unreachable at load time, the worker fails to start and the pod restarts.

The model serving any request is always a registered MLflow artifact. No local file paths are used. See ML: Model Registry.

Current vs target scaling¶

Aspect	Current state	Notes
API pods	2 (Deployment)	HPA deployed; scales on request load
ML worker pods	2 (Deployment)	HPA scales on queue depth
RabbitMQ	Single broker, no clustering	Single point of failure for inference path
Redis	Single pod	Cache miss on pod restart; inference continues
Node	Single-node K8s	Node failure = full service outage

The single-node, single-broker constraints are documented tradeoffs for portfolio scope. See Architecture: Deployment View.

Rollback¶

Model-level rollback: update the Production alias in the MLflow Registry to the previous version. Running workers detect the alias change on next process startup.

Helm-level rollback: helm rollback soccer-api <revision> — restores the previous Kubernetes resource state including image tags and config.