Deployment & Runtime Architecture¶
Platform¶
Serving components are deployed on Kubernetes using Helm for configuration and templating.
Deployed components¶
- FastAPI inference service,
- Celery worker deployment,
- RabbitMQ message broker,
- Redis cache (optional),
- Prometheus scraping targets.
Configuration management¶
Runtime configuration is provided via: - environment variables, - Helm values, - Kubernetes secrets (decrypted at deploy time).
No secrets are baked into images.
Model loading strategy¶
- models are loaded from MLflow via
model_uri, - startup fails fast if the model is unavailable,
- model version is logged on startup.
This ensures: - explicit dependency on registry availability, - clear observability of active model version.
Scaling strategy¶
- API scaled horizontally based on request load,
- workers scaled based on queue depth,
- scaling policies are independent per component.
Failure handling¶
- readiness probes block traffic to unhealthy pods,
- crash loops surface immediately via alerts,
- rollback is performed by switching model version or Helm release. ```