Skip to content

Current Serving Status¶

This page is the authoritative source of truth for what the inference API does today.

Component status¶

Component	Status	Notes
`POST /predict` (sync, Celery `ml` queue, 30s timeout)	✅ Implemented	Returns probabilities
`GET /predict/{match_id}` (lookup from batch parquet)	✅ Implemented	Batch result retrieval
`POST /predict/async/` (async Celery job)	✅ Implemented	Returns `task_id`
`GET /monitoring/task_status/{task_id}` (poll result)	✅ Implemented	Status + result polling
`GET /predict/model/info` (MLflow registry metadata)	✅ Implemented	Shows loaded model version
`GET /healthcheck/` (liveness — single health probe)	✅ Implemented	K8s liveness probe; no separate readiness endpoint
`GET /metrics` (Prometheus)	✅ Implemented	8 counters/histograms/gauges
Pydantic request validation	✅ Implemented	`src/app/schemas/predict.py`
Model lazy-loading from MLflow registry	✅ Implemented	Once per Celery worker process
Streamlit UI (match list + predictions + polling)	✅ Implemented	`src/ui/`
HTTP batch endpoint (`POST /predict/batch`)	📋 Planned	Batch parquet exists; no HTTP endpoint
Grafana dashboard for API metrics	📋 Planned	Prometheus collecting; dashboards not yet deployed
Docker image	✅ Built	Multi-stage build
Kubernetes deployment + HPA	✅ Deployed	2 API pods, 2 Celery workers
Helm chart	✅ Complete	Parameterized values

Inference flow (sync path)¶

User → POST /predict
         ↓
    Pydantic validation
         ↓
    Celery task enqueued (ml queue)
         ↓
    Worker: load model from MLflow registry (lazy, cached per worker)
         ↓
    Worker: compute features → model.predict_proba()
         ↓
    Task result returned via Celery result backend
         ↓
    FastAPI returns JSON response (≤ 30s timeout)

Inference flow (async path)¶

User → POST /predict/async/  →  {"task_id": "abc-123", "status": "queued"}
                                            ↓
User → GET /monitoring/task_status/abc-123  →  {"status": "pending"}
                                            ↓ (after completion)
User → GET /monitoring/task_status/abc-123  →  {"status": "success", "result": {...}}

Known limitations¶

HTTP batch endpoint not yet exposed — batch inference runs via DVC pipeline only.
RabbitMQ is single-broker (no clustering); task backlog possible under sustained load.
Model promotion from Staging to Production requires manual approval.
Evidently drift detection not yet integrated — model degradation detected by monitoring metrics only.

SLO targets (informal)¶

SLO	Target	Status
Sync p50 latency	≤ 50ms	Not formally measured
Sync p99 latency	≤ 200ms	Not formally measured
Service availability (30d)	≥ 99%	Not formally tracked

Load test baseline available via tests/load/locustfile.py.