Performance & SLOs¶

This page explains how to interpret serving performance figures, what is currently measured, and what informal SLO targets exist.

Measurement status¶

SLO	Target	Measurement status
Sync `POST /predict/` p50 latency	≤ 50 ms	Not formally measured; from manual testing
Sync `POST /predict/` p99 latency	≤ 200 ms	Not formally measured; from manual testing
Service availability (30-day)	≥ 99%	Not formally tracked
Async job completion p95	≤ 10 s (typical)	Not formally measured
Sync hard timeout	30 s	Enforced in code — not a target, a bound

These targets are informal operational goals, not contractual SLOs backed by continuous measurement infrastructure. Prometheus exports latency histograms; Grafana dashboards are not yet deployed. See Status.

A Locust-based load test baseline is available at tests/load/locustfile.py.

Sync inference latency¶

The sync path (POST /predict/) involves:

Pydantic request validation (< 1 ms)
Redis cache lookup (< 5 ms when Redis is healthy)
Cache hit: return immediately — sub-10 ms total
Cache miss: Celery task enqueued → worker picks up → inference → result stored in Redis

On cache miss the dominant cost is:

Worker queue wait time (near-zero at low load; grows under backlog)
Model predict_proba() call (< 10 ms for current model size)
Feature assembly from pre-computed batch data (< 5 ms)

The 30 s hard timeout (_SYNC_TIMEOUT) is a ceiling, not a performance target. A timeout response (504 Gateway Timeout) indicates a worker backlog or worker failure, not normal operation.

Async inference¶

The async path (POST /predict/async/) returns immediately (< 20 ms). Task completion time depends on:

Worker availability (how many celery-worker-ml pods are running)
Queue depth at submission time
Model inference time

Under normal conditions with 2 workers and low load, completion is typically < 5 s. Under sustained load or worker restarts, tasks queue in RabbitMQ until a worker is available.

Worker cold start / lazy model load¶

The first inference request handled by a freshly started celery-worker-ml process triggers a model load from the MLflow Registry. This load takes a few seconds. All subsequent requests in that process reuse the loaded model — no per-request load.

Under HPA scale-up, new worker pods incur this one-time cold start cost before contributing to throughput.

Payload and feature lookup characteristics¶

Feature vectors are fixed-width flat dicts; payload size is small (< 2 KB).
Batch-lookup (GET /predict/{match_id}) reads from a pre-computed parquet file; no inference occurs.
Prediction caching (Redis) eliminates repeated inference for the same input within the TTL window.

How to read performance claims¶

When interpreting any latency figure in this project's documentation:

"Measured" means a load test or profiling run produced the number.
"Target" means an informal operational goal — not yet backed by continuous monitoring.
"Bound" means a hard limit enforced in code (e.g., 30 s timeout).

The values on this page are targets and bounds, not measured facts, unless explicitly stated otherwise.

Inference Modes — sync vs async path details
Status — SLO table with measurement status
Architecture: Runtime View