Current Serving Status¶
This page is the authoritative source of truth for what the inference API does today.
Component status¶
| Component | Status | Notes |
|---|---|---|
POST /predict (sync, Celery ml queue, 30s timeout) |
✅ Implemented | Returns probabilities |
GET /predict/{match_id} (lookup from batch parquet) |
✅ Implemented | Batch result retrieval |
POST /predict/async/ (async Celery job) |
✅ Implemented | Returns task_id |
GET /monitoring/task_status/{task_id} (poll result) |
✅ Implemented | Status + result polling |
GET /predict/model/info (MLflow registry metadata) |
✅ Implemented | Shows loaded model version |
GET /healthcheck/ (liveness — single health probe) |
✅ Implemented | K8s liveness probe; no separate readiness endpoint |
GET /metrics (Prometheus) |
✅ Implemented | 8 counters/histograms/gauges |
| Pydantic request validation | ✅ Implemented | src/app/schemas/predict.py |
| Model lazy-loading from MLflow registry | ✅ Implemented | Once per Celery worker process |
| Streamlit UI (match list + predictions + polling) | ✅ Implemented | src/ui/ |
HTTP batch endpoint (POST /predict/batch) |
📋 Planned | Batch parquet exists; no HTTP endpoint |
| Grafana dashboard for API metrics | 📋 Planned | Prometheus collecting; dashboards not yet deployed |
| Docker image | ✅ Built | Multi-stage build |
| Kubernetes deployment + HPA | ✅ Deployed | 2 API pods, 2 Celery workers |
| Helm chart | ✅ Complete | Parameterized values |
Inference flow (sync path)¶
User → POST /predict
↓
Pydantic validation
↓
Celery task enqueued (ml queue)
↓
Worker: load model from MLflow registry (lazy, cached per worker)
↓
Worker: compute features → model.predict_proba()
↓
Task result returned via Celery result backend
↓
FastAPI returns JSON response (≤ 30s timeout)
Inference flow (async path)¶
User → POST /predict/async/ → {"task_id": "abc-123", "status": "queued"}
↓
User → GET /monitoring/task_status/abc-123 → {"status": "pending"}
↓ (after completion)
User → GET /monitoring/task_status/abc-123 → {"status": "success", "result": {...}}
Known limitations¶
- HTTP batch endpoint not yet exposed — batch inference runs via DVC pipeline only.
- RabbitMQ is single-broker (no clustering); task backlog possible under sustained load.
- Model promotion from Staging to Production requires manual approval.
- Evidently drift detection not yet integrated — model degradation detected by monitoring metrics only.
SLO targets (informal)¶
| SLO | Target | Status |
|---|---|---|
| Sync p50 latency | ≤ 50ms | Not formally measured |
| Sync p99 latency | ≤ 200ms | Not formally measured |
| Service availability (30d) | ≥ 99% | Not formally tracked |
Load test baseline available via tests/load/locustfile.py.