Skip to content

Current Serving Status

This page is the authoritative source of truth for what the inference API does today.


Component status

Component Status Notes
POST /predict (sync, Celery ml queue, 30s timeout) ✅ Implemented Returns probabilities
GET /predict/{match_id} (lookup from batch parquet) ✅ Implemented Batch result retrieval
POST /predict/async/ (async Celery job) ✅ Implemented Returns task_id
GET /monitoring/task_status/{task_id} (poll result) ✅ Implemented Status + result polling
GET /predict/model/info (MLflow registry metadata) ✅ Implemented Shows loaded model version
GET /healthcheck/ (liveness — single health probe) ✅ Implemented K8s liveness probe; no separate readiness endpoint
GET /metrics (Prometheus) ✅ Implemented 8 counters/histograms/gauges
Pydantic request validation ✅ Implemented src/app/schemas/predict.py
Model lazy-loading from MLflow registry ✅ Implemented Once per Celery worker process
Streamlit UI (match list + predictions + polling) ✅ Implemented src/ui/
HTTP batch endpoint (POST /predict/batch) 📋 Planned Batch parquet exists; no HTTP endpoint
Grafana dashboard for API metrics 📋 Planned Prometheus collecting; dashboards not yet deployed
Docker image ✅ Built Multi-stage build
Kubernetes deployment + HPA ✅ Deployed 2 API pods, 2 Celery workers
Helm chart ✅ Complete Parameterized values

Inference flow (sync path)

User → POST /predict
    Pydantic validation
    Celery task enqueued (ml queue)
    Worker: load model from MLflow registry (lazy, cached per worker)
    Worker: compute features → model.predict_proba()
    Task result returned via Celery result backend
    FastAPI returns JSON response (≤ 30s timeout)

Inference flow (async path)

User → POST /predict/async/  →  {"task_id": "abc-123", "status": "queued"}
User → GET /monitoring/task_status/abc-123  →  {"status": "pending"}
                                            ↓ (after completion)
User → GET /monitoring/task_status/abc-123  →  {"status": "success", "result": {...}}

Known limitations

  • HTTP batch endpoint not yet exposed — batch inference runs via DVC pipeline only.
  • RabbitMQ is single-broker (no clustering); task backlog possible under sustained load.
  • Model promotion from Staging to Production requires manual approval.
  • Evidently drift detection not yet integrated — model degradation detected by monitoring metrics only.

SLO targets (informal)

SLO Target Status
Sync p50 latency ≤ 50ms Not formally measured
Sync p99 latency ≤ 200ms Not formally measured
Service availability (30d) ≥ 99% Not formally tracked

Load test baseline available via tests/load/locustfile.py.