Performance, Capacity & SLOs¶
Why SLOs matter¶
Serving systems must have explicit performance targets. Without SLOs, reliability cannot be measured or improved.
Synchronous inference SLOs¶
Target values (initial):
- p95 latency: < 500 ms
- p99 latency: < 1 s
- error rate: < 0.5%
- max payload size: bounded and validated
Asynchronous inference SLOs¶
Target values (initial):
- job completion p95: < 30 s
- queue backlog: bounded
- retry success rate: monitored
Capacity planning¶
Key signals: - request rate, - CPU/memory utilization, - queue depth, - worker execution time.
Scaling decisions are driven by metrics, not by manual intervention.
Degradation strategy¶
In overload scenarios: - async inference is preferred, - non-critical requests may be rate-limited, - alerts notify operators before SLO violation.
Continuous improvement¶
SLOs are reviewed and adjusted based on: - observed traffic patterns, - model complexity, - infrastructure changes.