Skip to content

Performance, Capacity & SLOs

Why SLOs matter

Serving systems must have explicit performance targets. Without SLOs, reliability cannot be measured or improved.


Synchronous inference SLOs

Target values (initial):

  • p95 latency: < 500 ms
  • p99 latency: < 1 s
  • error rate: < 0.5%
  • max payload size: bounded and validated

Asynchronous inference SLOs

Target values (initial):

  • job completion p95: < 30 s
  • queue backlog: bounded
  • retry success rate: monitored

Capacity planning

Key signals: - request rate, - CPU/memory utilization, - queue depth, - worker execution time.

Scaling decisions are driven by metrics, not by manual intervention.


Degradation strategy

In overload scenarios: - async inference is preferred, - non-critical requests may be rate-limited, - alerts notify operators before SLO violation.


Continuous improvement

SLOs are reviewed and adjusted based on: - observed traffic patterns, - model complexity, - infrastructure changes.