Sync vs Async Inference Modes¶
Motivation¶
Different consumers have different requirements: - interactive users require low latency, - batch or heavy workloads tolerate higher latency.
Synchronous inference¶
Endpoint: POST /predict/ (inline features) and GET /predict/{match_id} (batch lookup)
Implementation:
- FastAPI handler submits
predict_matchtask to the Celerymlqueue. - Blocks up to
_SYNC_TIMEOUT = 30svialoop.run_in_executor(non-blocking for asyncio). - Returns
PredictResponsedirectly on success;504 Gateway Timeouton timeout.
Endpoint list:
POST /predict/ # inline features → sync Celery dispatch
GET /predict/{match_id} # lookup from batch_inference parquet
GET /predict/matches/ # list upcoming matches
GET /predict/model/info # MLflow registry metadata
Characteristics:
- Strict 30 s SLO.
- Bounded payload size via Pydantic schema.
- Immediate failure feedback (4xx/5xx).
When to use:
- UI-driven predictions.
- Real-time decision support.
Asynchronous inference¶
Endpoint: POST /predict/async/
Implementation:
- FastAPI submits
predict_matchtask to RabbitMQmlqueue and returnstask_idimmediately. PredictionServiceis initialised once per worker process viaworker_process_initsignal, avoiding repeated MLflow model loads.- Results stored in Redis (Celery result backend); retrieved via
GET /monitoring/task_status/{task_id}.
Polling:
# Submit
curl -X POST http://localhost:8000/predict/async/ \
-H "Content-Type: application/json" \
-d '{"match_id": 42, "features": {...}}'
# → {"task_id": "abc-123", "status": "submitted", "status_url": "/monitoring/task_status/abc-123"}
# Poll
curl http://localhost:8000/monitoring/task_status/abc-123
# → {"task_id": "abc-123", "status": "success", "result": {...}}
Characteristics:
- Higher throughput via task queue.
- Retries and backoff managed by Celery.
- Decoupled request/response lifecycle.
When to use:
- Streamlit UI polling for results.
- Computationally expensive feature assembly.
- Batch workloads.
Operational trade-offs¶
| Aspect | Sync | Async |
|---|---|---|
| Latency | Low (≤30 s SLO) | Higher (queue wait) |
| Throughput | Limited | High |
| Complexity | Lower | Higher |
| Failure mode | Immediate (504) | Deferred |
| UX | Direct response | Poll status_url |
Safety considerations¶
- Async jobs are idempotent — same
match_idre-submission is safe. - Retries are bounded by Celery configuration.
- Dead-letter queue is configured for failed tasks.
- Prometheus metrics:
prediction_requests_total{source="sync|async"},prediction_timeouts_total.