Health Checks & Failure Modes¶
This page documents the serving layer's health surface, how Kubernetes probes use it, and how the API behaves under each failure condition.
For system-wide failure mode coverage see Architecture: Failure Modes.
Health endpoint¶
GET /healthcheck/¶
The single implemented liveness probe. Returns 200 OK when the FastAPI process is alive.
Used by Kubernetes as the pod liveness probe. The Ingress Controller stops routing traffic to a pod when this probe fails.
Note: there is no separate readiness endpoint today. The liveness probe at /healthcheck/
serves both purposes in the current deployment.
Kubernetes probe configuration¶
livenessProbe:
httpGet:
path: /healthcheck/
port: 8000
initialDelaySeconds: 10
periodSeconds: 15
failureThreshold: 3
Degraded-mode behavior by failure type¶
MLflow Registry unavailable¶
| Phase | Behavior |
|---|---|
| Worker startup | Worker fails to load model; pod enters CrashLoopBackOff; K8s restarts it automatically |
| Running workers | Already-loaded model continues to serve requests; no impact until pod restart |
Recovery: restart MLflow pod; running workers are unaffected until their next process start. No automated recovery beyond K8s restart policy.
RabbitMQ unavailable¶
| Impact path | Behavior |
|---|---|
POST /predict/ (sync) |
FastAPI cannot enqueue task; returns 500 or 503 |
POST /predict/async/ |
Same — task submission fails |
GET /predict/{match_id} |
Unaffected — reads from batch parquet directly |
GET /healthcheck/ |
Unaffected — process is alive |
Celery workers reconnect automatically when RabbitMQ recovers. RabbitMQ is a single broker — its unavailability is a P1 (full inference outage) event.
Redis unavailable¶
Redis failures are handled gracefully:
- Cache GET/SET failures are caught and logged as errors.
- Inference proceeds via the full Celery path on every request (no cache hits).
- No data loss; no incorrect predictions.
- Performance degrades (higher latency; more load on workers).
Recovery: Redis pod restarts; cache warms automatically on subsequent requests.
Celery worker unavailable (async path only)¶
POST /predict/async/enqueues successfully; task stays queued.GET /monitoring/task_status/{task_id}returnspendingindefinitely.POST /predict/(sync) blocks and returns504 Gateway Timeoutafter 30 s.
Recovery: when a worker restarts, it picks up queued tasks from RabbitMQ. Dead-letter queue captures tasks that exhaust retry budget.
Batch parquet unavailable¶
GET /predict/{match_id}returns404or500.POST /predict/(inline features) is unaffected.
Recovery: re-run the batch_inference DVC stage.
Failure severity quick reference¶
| Failure | Severity | Inference impact |
|---|---|---|
| FastAPI pod crash | P1 | Full API unreachable |
| RabbitMQ down | P1 | All sync + async inference fails |
| MLflow unavailable | P2 | New workers cannot load model; running workers unaffected |
| Redis down | P2 | Inference continues; performance degrades (cache miss on all requests) |
| Worker unavailable | P2 | Async tasks queue; sync requests time out after 30 s |
| Batch parquet missing | P3 | Batch lookup fails; inline inference unaffected |
See Architecture: Failure Modes for full severity definitions and recovery runbook pointers.