Skip to content

Sync vs Async Inference Modes

Status: ✅ Implemented


Motivation

Different consumers have different requirements: - interactive users require low latency, - batch or heavy workloads tolerate higher latency.


Synchronous inference

Endpoint: POST /predict/ (inline features) and GET /predict/{match_id} (batch lookup)

Implementation: - FastAPI handler submits predict_match task to the Celery ml queue. - Blocks up to _SYNC_TIMEOUT = 30s via loop.run_in_executor (non-blocking for asyncio). - Returns PredictResponse directly on success; 504 on timeout.

Endpoint list:

POST /predict/                # inline features
GET  /predict/{match_id}      # precomputed feature lookup
GET  /predict/matches/        # list upcoming matches
GET  /predict/model/info      # MLflow registry metadata

Characteristics: - Strict 30 s SLO. - Bounded payload size via Pydantic schema. - Immediate failure feedback (4xx/5xx).

When to use: - UI-driven predictions. - Real-time decision support.


Asynchronous inference

Endpoint: POST /predict/async/

Implementation: - FastAPI submits predict_match task to RabbitMQ ml queue and returns task_id immediately. - PredictionService is initialised once per worker process via worker_process_init signal (avoids reloading the MLflow model for every task). - Results stored in Celery result backend; retrieved via GET /monitoring/task_status/{task_id}.

Polling:

# Submit
curl -X POST /predict/async/ -d '{"match_id": 42}'
# → {"task_id": "abc-123", "status": "submitted", "status_url": "/monitoring/task_status/abc-123"}

# Poll
curl /monitoring/task_status/abc-123
# → {"status": "SUCCESS", "result": {...}}

Characteristics: - Higher throughput via task queue. - Retries and backoff managed by Celery. - Decoupled request/response lifecycle.

When to use: - Streamlit UI polling for results. - Computationally expensive feature assembly. - Batch workloads.


Operational trade-offs

Aspect Sync Async
Latency Low (≤30 s SLO) Higher (queue wait)
Throughput Limited High
Complexity Lower Higher
Failure mode Immediate (504) Deferred
UX Direct response Poll status_url

Safety considerations

  • Async jobs are idempotent — same match_id re-submission is safe.
  • Retries are bounded by Celery config.
  • Dead-letter queue configured for failed tasks.
  • Prometheus counters: prediction_requests_total{source="sync|async"}, prediction_timeouts_total.