Sync vs Async Inference Modes¶
Status: ✅ Implemented
Motivation¶
Different consumers have different requirements: - interactive users require low latency, - batch or heavy workloads tolerate higher latency.
Synchronous inference¶
Endpoint: POST /predict/ (inline features) and GET /predict/{match_id} (batch lookup)
Implementation:
- FastAPI handler submits predict_match task to the Celery ml queue.
- Blocks up to _SYNC_TIMEOUT = 30s via loop.run_in_executor (non-blocking for asyncio).
- Returns PredictResponse directly on success; 504 on timeout.
Endpoint list:
POST /predict/ # inline features
GET /predict/{match_id} # precomputed feature lookup
GET /predict/matches/ # list upcoming matches
GET /predict/model/info # MLflow registry metadata
Characteristics: - Strict 30 s SLO. - Bounded payload size via Pydantic schema. - Immediate failure feedback (4xx/5xx).
When to use: - UI-driven predictions. - Real-time decision support.
Asynchronous inference¶
Endpoint: POST /predict/async/
Implementation:
- FastAPI submits predict_match task to RabbitMQ ml queue and returns task_id immediately.
- PredictionService is initialised once per worker process via worker_process_init signal
(avoids reloading the MLflow model for every task).
- Results stored in Celery result backend; retrieved via GET /monitoring/task_status/{task_id}.
Polling:
# Submit
curl -X POST /predict/async/ -d '{"match_id": 42}'
# → {"task_id": "abc-123", "status": "submitted", "status_url": "/monitoring/task_status/abc-123"}
# Poll
curl /monitoring/task_status/abc-123
# → {"status": "SUCCESS", "result": {...}}
Characteristics: - Higher throughput via task queue. - Retries and backoff managed by Celery. - Decoupled request/response lifecycle.
When to use: - Streamlit UI polling for results. - Computationally expensive feature assembly. - Batch workloads.
Operational trade-offs¶
| Aspect | Sync | Async |
|---|---|---|
| Latency | Low (≤30 s SLO) | Higher (queue wait) |
| Throughput | Limited | High |
| Complexity | Lower | Higher |
| Failure mode | Immediate (504) | Deferred |
| UX | Direct response | Poll status_url |
Safety considerations¶
- Async jobs are idempotent — same
match_idre-submission is safe. - Retries are bounded by Celery config.
- Dead-letter queue configured for failed tasks.
- Prometheus counters:
prediction_requests_total{source="sync|async"},prediction_timeouts_total.