Skip to content

Health Checks & Failure Modes

This page documents the serving layer's health surface, how Kubernetes probes use it, and how the API behaves under each failure condition.

For system-wide failure mode coverage see Architecture: Failure Modes.


Health endpoint

GET /healthcheck/

The single implemented liveness probe. Returns 200 OK when the FastAPI process is alive.

{
  "status": "ok",
  "worker_id": "celery@worker-ml-abc123",
  "memory_usage_mb": 210.4
}

Used by Kubernetes as the pod liveness probe. The Ingress Controller stops routing traffic to a pod when this probe fails.

Note: there is no separate readiness endpoint today. The liveness probe at /healthcheck/ serves both purposes in the current deployment.


Kubernetes probe configuration

livenessProbe:
  httpGet:
    path: /healthcheck/
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 15
  failureThreshold: 3

Degraded-mode behavior by failure type

MLflow Registry unavailable

Phase Behavior
Worker startup Worker fails to load model; pod enters CrashLoopBackOff; K8s restarts it automatically
Running workers Already-loaded model continues to serve requests; no impact until pod restart

Recovery: restart MLflow pod; running workers are unaffected until their next process start. No automated recovery beyond K8s restart policy.


RabbitMQ unavailable

Impact path Behavior
POST /predict/ (sync) FastAPI cannot enqueue task; returns 500 or 503
POST /predict/async/ Same — task submission fails
GET /predict/{match_id} Unaffected — reads from batch parquet directly
GET /healthcheck/ Unaffected — process is alive

Celery workers reconnect automatically when RabbitMQ recovers. RabbitMQ is a single broker — its unavailability is a P1 (full inference outage) event.


Redis unavailable

Redis failures are handled gracefully:

  • Cache GET/SET failures are caught and logged as errors.
  • Inference proceeds via the full Celery path on every request (no cache hits).
  • No data loss; no incorrect predictions.
  • Performance degrades (higher latency; more load on workers).

Recovery: Redis pod restarts; cache warms automatically on subsequent requests.


Celery worker unavailable (async path only)

  • POST /predict/async/ enqueues successfully; task stays queued.
  • GET /monitoring/task_status/{task_id} returns pending indefinitely.
  • POST /predict/ (sync) blocks and returns 504 Gateway Timeout after 30 s.

Recovery: when a worker restarts, it picks up queued tasks from RabbitMQ. Dead-letter queue captures tasks that exhaust retry budget.


Batch parquet unavailable

  • GET /predict/{match_id} returns 404 or 500.
  • POST /predict/ (inline features) is unaffected.

Recovery: re-run the batch_inference DVC stage.


Failure severity quick reference

Failure Severity Inference impact
FastAPI pod crash P1 Full API unreachable
RabbitMQ down P1 All sync + async inference fails
MLflow unavailable P2 New workers cannot load model; running workers unaffected
Redis down P2 Inference continues; performance degrades (cache miss on all requests)
Worker unavailable P2 Async tasks queue; sync requests time out after 30 s
Batch parquet missing P3 Batch lookup fails; inline inference unaffected

See Architecture: Failure Modes for full severity definitions and recovery runbook pointers.