Skip to content

ADR-0005 — Serving Modes: Sync vs Async Inference

Status

Accepted

Context

The system serves predictions to different consumers: - interactive users (low latency), - batch or heavy workloads (higher latency tolerance).

A single inference mode is insufficient.

Decision

We support two inference modes:

  • Synchronous inference (FastAPI):
  • low-latency requests,
  • immediate responses,
  • strict SLOs.

  • Asynchronous inference (Celery + RabbitMQ):

  • heavy feature computation,
  • batch predictions,
  • retries and backpressure.

Alternatives Considered

  • Sync-only: rejected due to scalability limits.
  • Async-only: rejected due to poor UX for interactive use cases.
  • Streaming inference: out of scope for current requirements.

Consequences

Positive

  • Flexible serving architecture.
  • Better resource utilization.
  • Clear performance isolation.

Negative

  • Increased system complexity.
  • Requires careful monitoring and retry policies.

Rollback / Change Strategy

Async inference can be disabled if operational complexity outweighs benefits.

References

  • FastAPI documentation
  • Celery documentation