ADR-0005 — Serving Modes: Sync vs Async Inference¶
Status¶
Accepted
Context¶
The system serves predictions to different consumers: - interactive users (low latency), - batch or heavy workloads (higher latency tolerance).
A single inference mode is insufficient.
Decision¶
We support two inference modes:
- Synchronous inference (FastAPI):
- low-latency requests,
- immediate responses,
-
strict SLOs.
-
Asynchronous inference (Celery + RabbitMQ):
- heavy feature computation,
- batch predictions,
- retries and backpressure.
Alternatives Considered¶
- Sync-only: rejected due to scalability limits.
- Async-only: rejected due to poor UX for interactive use cases.
- Streaming inference: out of scope for current requirements.
Consequences¶
Positive¶
- Flexible serving architecture.
- Better resource utilization.
- Clear performance isolation.
Negative¶
- Increased system complexity.
- Requires careful monitoring and retry policies.
Rollback / Change Strategy¶
Async inference can be disabled if operational complexity outweighs benefits.
References¶
- FastAPI documentation
- Celery documentation