Skip to content

Deployment & Runtime Architecture

Platform

Serving components are deployed on Kubernetes using Helm for configuration and templating.


Deployed components

  • FastAPI inference service,
  • Celery worker deployment,
  • RabbitMQ message broker,
  • Redis cache (optional),
  • Prometheus scraping targets.

Configuration management

Runtime configuration is provided via: - environment variables, - Helm values, - Kubernetes secrets (decrypted at deploy time).

No secrets are baked into images.


Model loading strategy

  • models are loaded from MLflow via model_uri,
  • startup fails fast if the model is unavailable,
  • model version is logged on startup.

This ensures: - explicit dependency on registry availability, - clear observability of active model version.


Scaling strategy

  • API scaled horizontally based on request load,
  • workers scaled based on queue depth,
  • scaling policies are independent per component.

Failure handling

  • readiness probes block traffic to unhealthy pods,
  • crash loops surface immediately via alerts,
  • rollback is performed by switching model version or Helm release. ```