Health Checks & Failure Modes¶

This page documents the serving layer's health surface, how Kubernetes probes use it, and how the API behaves under each failure condition.

For system-wide failure mode coverage see Architecture: Failure Modes.

Health endpoint¶

`GET /healthcheck/`¶

The single implemented liveness probe. Returns 200 OK when the FastAPI process is alive.

{
  "status": "ok",
  "worker_id": "celery@worker-ml-abc123",
  "memory_usage_mb": 210.4
}

Used by Kubernetes as the pod liveness probe. The Ingress Controller stops routing traffic to a pod when this probe fails.

Note: there is no separate readiness endpoint today. The liveness probe at /healthcheck/ serves both purposes in the current deployment.

Kubernetes probe configuration¶

livenessProbe:
  httpGet:
    path: /healthcheck/
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 15
  failureThreshold: 3

Degraded-mode behavior by failure type¶

MLflow Registry unavailable¶

Phase	Behavior
Worker startup	Worker fails to load model; pod enters CrashLoopBackOff; K8s restarts it automatically
Running workers	Already-loaded model continues to serve requests; no impact until pod restart

Recovery: restart MLflow pod; running workers are unaffected until their next process start. No automated recovery beyond K8s restart policy.

RabbitMQ unavailable¶

Impact path	Behavior
`POST /predict/` (sync)	FastAPI cannot enqueue task; returns `500` or `503`
`POST /predict/async/`	Same — task submission fails
`GET /predict/{match_id}`	Unaffected — reads from batch parquet directly
`GET /healthcheck/`	Unaffected — process is alive

Celery workers reconnect automatically when RabbitMQ recovers. RabbitMQ is a single broker — its unavailability is a P1 (full inference outage) event.

Redis unavailable¶

Redis failures are handled gracefully:

Cache GET/SET failures are caught and logged as errors.
Inference proceeds via the full Celery path on every request (no cache hits).
No data loss; no incorrect predictions.
Performance degrades (higher latency; more load on workers).

Recovery: Redis pod restarts; cache warms automatically on subsequent requests.

Celery worker unavailable (async path only)¶

POST /predict/async/ enqueues successfully; task stays queued.
GET /monitoring/task_status/{task_id} returns pending indefinitely.
POST /predict/ (sync) blocks and returns 504 Gateway Timeout after 30 s.

Recovery: when a worker restarts, it picks up queued tasks from RabbitMQ. Dead-letter queue captures tasks that exhaust retry budget.

Batch parquet unavailable¶

GET /predict/{match_id} returns 404 or 500.
POST /predict/ (inline features) is unaffected.

Recovery: re-run the batch_inference DVC stage.

Failure severity quick reference¶

Failure	Severity	Inference impact
FastAPI pod crash	P1	Full API unreachable
RabbitMQ down	P1	All sync + async inference fails
MLflow unavailable	P2	New workers cannot load model; running workers unaffected
Redis down	P2	Inference continues; performance degrades (cache miss on all requests)
Worker unavailable	P2	Async tasks queue; sync requests time out after 30 s
Batch parquet missing	P3	Batch lookup fails; inline inference unaffected

See Architecture: Failure Modes for full severity definitions and recovery runbook pointers.