Skip to content

Performance & SLOs

This page explains how to interpret serving performance figures, what is currently measured, and what informal SLO targets exist.


Measurement status

SLO Target Measurement status
Sync POST /predict/ p50 latency ≤ 50 ms Not formally measured; from manual testing
Sync POST /predict/ p99 latency ≤ 200 ms Not formally measured; from manual testing
Service availability (30-day) ≥ 99% Not formally tracked
Async job completion p95 ≤ 10 s (typical) Not formally measured
Sync hard timeout 30 s Enforced in code — not a target, a bound

These targets are informal operational goals, not contractual SLOs backed by continuous measurement infrastructure. Prometheus exports latency histograms; Grafana dashboards are not yet deployed. See Status.

A Locust-based load test baseline is available at tests/load/locustfile.py.


Sync inference latency

The sync path (POST /predict/) involves:

  1. Pydantic request validation (< 1 ms)
  2. Redis cache lookup (< 5 ms when Redis is healthy)
  3. Cache hit: return immediately — sub-10 ms total
  4. Cache miss: Celery task enqueued → worker picks up → inference → result stored in Redis

On cache miss the dominant cost is:

  • Worker queue wait time (near-zero at low load; grows under backlog)
  • Model predict_proba() call (< 10 ms for current model size)
  • Feature assembly from pre-computed batch data (< 5 ms)

The 30 s hard timeout (_SYNC_TIMEOUT) is a ceiling, not a performance target. A timeout response (504 Gateway Timeout) indicates a worker backlog or worker failure, not normal operation.


Async inference

The async path (POST /predict/async/) returns immediately (< 20 ms). Task completion time depends on:

  • Worker availability (how many celery-worker-ml pods are running)
  • Queue depth at submission time
  • Model inference time

Under normal conditions with 2 workers and low load, completion is typically < 5 s. Under sustained load or worker restarts, tasks queue in RabbitMQ until a worker is available.


Worker cold start / lazy model load

The first inference request handled by a freshly started celery-worker-ml process triggers a model load from the MLflow Registry. This load takes a few seconds. All subsequent requests in that process reuse the loaded model — no per-request load.

Under HPA scale-up, new worker pods incur this one-time cold start cost before contributing to throughput.


Payload and feature lookup characteristics

  • Feature vectors are fixed-width flat dicts; payload size is small (< 2 KB).
  • Batch-lookup (GET /predict/{match_id}) reads from a pre-computed parquet file; no inference occurs.
  • Prediction caching (Redis) eliminates repeated inference for the same input within the TTL window.

How to read performance claims

When interpreting any latency figure in this project's documentation:

  • "Measured" means a load test or profiling run produced the number.
  • "Target" means an informal operational goal — not yet backed by continuous monitoring.
  • "Bound" means a hard limit enforced in code (e.g., 30 s timeout).

The values on this page are targets and bounds, not measured facts, unless explicitly stated otherwise.