Performance & SLOs¶
This page explains how to interpret serving performance figures, what is currently measured, and what informal SLO targets exist.
Measurement status¶
| SLO | Target | Measurement status |
|---|---|---|
Sync POST /predict/ p50 latency |
≤ 50 ms | Not formally measured; from manual testing |
Sync POST /predict/ p99 latency |
≤ 200 ms | Not formally measured; from manual testing |
| Service availability (30-day) | ≥ 99% | Not formally tracked |
| Async job completion p95 | ≤ 10 s (typical) | Not formally measured |
| Sync hard timeout | 30 s | Enforced in code — not a target, a bound |
These targets are informal operational goals, not contractual SLOs backed by continuous measurement infrastructure. Prometheus exports latency histograms; Grafana dashboards are not yet deployed. See Status.
A Locust-based load test baseline is available at tests/load/locustfile.py.
Sync inference latency¶
The sync path (POST /predict/) involves:
- Pydantic request validation (< 1 ms)
- Redis cache lookup (< 5 ms when Redis is healthy)
- Cache hit: return immediately — sub-10 ms total
- Cache miss: Celery task enqueued → worker picks up → inference → result stored in Redis
On cache miss the dominant cost is:
- Worker queue wait time (near-zero at low load; grows under backlog)
- Model
predict_proba()call (< 10 ms for current model size) - Feature assembly from pre-computed batch data (< 5 ms)
The 30 s hard timeout (_SYNC_TIMEOUT) is a ceiling, not a performance target.
A timeout response (504 Gateway Timeout) indicates a worker backlog or worker failure, not normal operation.
Async inference¶
The async path (POST /predict/async/) returns immediately (< 20 ms).
Task completion time depends on:
- Worker availability (how many
celery-worker-mlpods are running) - Queue depth at submission time
- Model inference time
Under normal conditions with 2 workers and low load, completion is typically < 5 s. Under sustained load or worker restarts, tasks queue in RabbitMQ until a worker is available.
Worker cold start / lazy model load¶
The first inference request handled by a freshly started celery-worker-ml process
triggers a model load from the MLflow Registry. This load takes a few seconds.
All subsequent requests in that process reuse the loaded model — no per-request load.
Under HPA scale-up, new worker pods incur this one-time cold start cost before contributing to throughput.
Payload and feature lookup characteristics¶
- Feature vectors are fixed-width flat dicts; payload size is small (< 2 KB).
- Batch-lookup (
GET /predict/{match_id}) reads from a pre-computed parquet file; no inference occurs. - Prediction caching (Redis) eliminates repeated inference for the same input within the TTL window.
How to read performance claims¶
When interpreting any latency figure in this project's documentation:
- "Measured" means a load test or profiling run produced the number.
- "Target" means an informal operational goal — not yet backed by continuous monitoring.
- "Bound" means a hard limit enforced in code (e.g., 30 s timeout).
The values on this page are targets and bounds, not measured facts, unless explicitly stated otherwise.
Related¶
- Inference Modes — sync vs async path details
- Status — SLO table with measurement status
- Architecture: Runtime View