Serving¶
The serving layer exposes trained models as an operational inference service. It is deployed on Kubernetes, receives predictions requests over HTTP, dispatches inference tasks to Celery workers, and returns structured probability output.
See Current Serving Status for a full readiness matrix.
Responsibilities¶
The serving layer is responsible for:
- exposing the canonical inference API (sync and async prediction endpoints),
- validating request schemas before any inference logic runs,
- dispatching inference tasks to the Celery
mlqueue, - loading the registered model from the MLflow Registry (once per worker process),
- returning structured prediction output with model traceability metadata,
- surfacing health and operational metrics.
The serving layer is not responsible for:
- training or evaluating models — see ML,
- promoting models between registry stages — see Model Registry,
- scraping or storing raw data — see Data.
Scope of this section¶
| Page | Content |
|---|---|
| API Contract | Canonical endpoint set, request/response schemas, error semantics |
| Examples | Concrete cURL and Python examples for all implemented endpoints |
| Inference Modes | Sync vs async execution paths and operational trade-offs |
| Deployment | Serving-specific runtime components, configuration, model loading |
| Health & Failures | Health probes, degraded-mode behavior, failure responses |
| Performance | Latency behavior, SLO targets, interpretation guide |
| Status | Authoritative readiness matrix for the serving subsystem |
Boundaries¶
- High-level runtime design and sequence diagrams belong in Architecture: Runtime View.
- Physical topology and deployment constraints belong in Architecture: Deployment View.
- Model input/output contract belongs in ML: Model Contract.
- Model promotion lifecycle belongs in ML: Model Registry.
- Global implementation readiness belongs in Status.
This section deepens those topics for the inference API subsystem specifically.