Skip to content

Serving

The serving layer exposes trained models as an operational inference service. It is deployed on Kubernetes, receives predictions requests over HTTP, dispatches inference tasks to Celery workers, and returns structured probability output.

See Current Serving Status for a full readiness matrix.


Responsibilities

The serving layer is responsible for:

  • exposing the canonical inference API (sync and async prediction endpoints),
  • validating request schemas before any inference logic runs,
  • dispatching inference tasks to the Celery ml queue,
  • loading the registered model from the MLflow Registry (once per worker process),
  • returning structured prediction output with model traceability metadata,
  • surfacing health and operational metrics.

The serving layer is not responsible for:

  • training or evaluating models — see ML,
  • promoting models between registry stages — see Model Registry,
  • scraping or storing raw data — see Data.

Scope of this section

Page Content
API Contract Canonical endpoint set, request/response schemas, error semantics
Examples Concrete cURL and Python examples for all implemented endpoints
Inference Modes Sync vs async execution paths and operational trade-offs
Deployment Serving-specific runtime components, configuration, model loading
Health & Failures Health probes, degraded-mode behavior, failure responses
Performance Latency behavior, SLO targets, interpretation guide
Status Authoritative readiness matrix for the serving subsystem

Boundaries

This section deepens those topics for the inference API subsystem specifically.