Serving¶

The serving layer exposes trained models as an operational inference service. It is deployed on Kubernetes, receives predictions requests over HTTP, dispatches inference tasks to Celery workers, and returns structured probability output.

See Current Serving Status for a full readiness matrix.

Responsibilities¶

The serving layer is responsible for:

exposing the canonical inference API (sync and async prediction endpoints),
validating request schemas before any inference logic runs,
dispatching inference tasks to the Celery ml queue,
loading the registered model from the MLflow Registry (once per worker process),
returning structured prediction output with model traceability metadata,
surfacing health and operational metrics.

The serving layer is not responsible for:

training or evaluating models — see ML,
promoting models between registry stages — see Model Registry,
scraping or storing raw data — see Data.

Scope of this section¶

Page	Content
API Contract	Canonical endpoint set, request/response schemas, error semantics
Examples	Concrete cURL and Python examples for all implemented endpoints
Inference Modes	Sync vs async execution paths and operational trade-offs
Deployment	Serving-specific runtime components, configuration, model loading
Health & Failures	Health probes, degraded-mode behavior, failure responses
Performance	Latency behavior, SLO targets, interpretation guide
Status	Authoritative readiness matrix for the serving subsystem

Boundaries¶

High-level runtime design and sequence diagrams belong in Architecture: Runtime View.
Physical topology and deployment constraints belong in Architecture: Deployment View.
Model input/output contract belongs in ML: Model Contract.
Model promotion lifecycle belongs in ML: Model Registry.
Global implementation readiness belongs in Status.

This section deepens those topics for the inference API subsystem specifically.