Skip to content

Sync vs Async Inference Modes


Motivation

Different consumers have different requirements: - interactive users require low latency, - batch or heavy workloads tolerate higher latency.


Synchronous inference

Endpoint: POST /predict/ (inline features) and GET /predict/{match_id} (batch lookup)

Implementation:

  • FastAPI handler submits predict_match task to the Celery ml queue.
  • Blocks up to _SYNC_TIMEOUT = 30s via loop.run_in_executor (non-blocking for asyncio).
  • Returns PredictResponse directly on success; 504 Gateway Timeout on timeout.

Endpoint list:

POST /predict/                # inline features → sync Celery dispatch
GET  /predict/{match_id}      # lookup from batch_inference parquet
GET  /predict/matches/        # list upcoming matches
GET  /predict/model/info      # MLflow registry metadata

Characteristics:

  • Strict 30 s SLO.
  • Bounded payload size via Pydantic schema.
  • Immediate failure feedback (4xx/5xx).

When to use:

  • UI-driven predictions.
  • Real-time decision support.

Asynchronous inference

Endpoint: POST /predict/async/

Implementation:

  • FastAPI submits predict_match task to RabbitMQ ml queue and returns task_id immediately.
  • PredictionService is initialised once per worker process via worker_process_init signal, avoiding repeated MLflow model loads.
  • Results stored in Redis (Celery result backend); retrieved via GET /monitoring/task_status/{task_id}.

Polling:

# Submit
curl -X POST http://localhost:8000/predict/async/ \
  -H "Content-Type: application/json" \
  -d '{"match_id": 42, "features": {...}}'
# → {"task_id": "abc-123", "status": "submitted", "status_url": "/monitoring/task_status/abc-123"}

# Poll
curl http://localhost:8000/monitoring/task_status/abc-123
# → {"task_id": "abc-123", "status": "success", "result": {...}}

Characteristics:

  • Higher throughput via task queue.
  • Retries and backoff managed by Celery.
  • Decoupled request/response lifecycle.

When to use:

  • Streamlit UI polling for results.
  • Computationally expensive feature assembly.
  • Batch workloads.

Operational trade-offs

Aspect Sync Async
Latency Low (≤30 s SLO) Higher (queue wait)
Throughput Limited High
Complexity Lower Higher
Failure mode Immediate (504) Deferred
UX Direct response Poll status_url

Safety considerations

  • Async jobs are idempotent — same match_id re-submission is safe.
  • Retries are bounded by Celery configuration.
  • Dead-letter queue is configured for failed tasks.
  • Prometheus metrics: prediction_requests_total{source="sync|async"}, prediction_timeouts_total.