Sync vs Async Inference Modes¶

Motivation¶

Different consumers have different requirements: - interactive users require low latency, - batch or heavy workloads tolerate higher latency.

Synchronous inference¶

Endpoint: POST /predict/ (inline features) and GET /predict/{match_id} (batch lookup)

Implementation:

FastAPI handler submits predict_match task to the Celery ml queue.
Blocks up to _SYNC_TIMEOUT = 30s via loop.run_in_executor (non-blocking for asyncio).
Returns PredictResponse directly on success; 504 Gateway Timeout on timeout.

Endpoint list:

POST /predict/                # inline features → sync Celery dispatch
GET  /predict/{match_id}      # lookup from batch_inference parquet
GET  /predict/matches/        # list upcoming matches
GET  /predict/model/info      # MLflow registry metadata

Characteristics:

Strict 30 s SLO.
Bounded payload size via Pydantic schema.
Immediate failure feedback (4xx/5xx).

When to use:

UI-driven predictions.
Real-time decision support.

Asynchronous inference¶

Endpoint: POST /predict/async/

Implementation:

FastAPI submits predict_match task to RabbitMQ ml queue and returns task_id immediately.
PredictionService is initialised once per worker process via worker_process_init signal, avoiding repeated MLflow model loads.
Results stored in Redis (Celery result backend); retrieved via GET /monitoring/task_status/{task_id}.

Polling:

# Submit
curl -X POST http://localhost:8000/predict/async/ \
  -H "Content-Type: application/json" \
  -d '{"match_id": 42, "features": {...}}'
# → {"task_id": "abc-123", "status": "submitted", "status_url": "/monitoring/task_status/abc-123"}

# Poll
curl http://localhost:8000/monitoring/task_status/abc-123
# → {"task_id": "abc-123", "status": "success", "result": {...}}