Skip to content

Common Failures & Troubleshooting

This runbook lists frequent failure scenarios and recommended investigation steps.


High latency on /predict

Possible causes - increased load, - model complexity increase, - resource saturation.

Actions - check CPU/memory usage, - inspect active model version, - scale API replicas if needed.


Async queue backlog growing

Possible causes - insufficient workers, - stuck or failing tasks.

Actions - inspect Celery worker logs, - scale workers, - check retry and DLQ metrics.


Data contract failures

Possible causes - upstream schema changes, - incomplete scraping.

Actions - inspect failing expectations, - block downstream pipelines, - coordinate schema update.


MLflow unavailable

Possible causes - service downtime, - network issues.

Actions - verify registry availability, - prevent new deployments, - restore service before continuing.