Common Failures & Troubleshooting¶
This runbook lists frequent failure scenarios and recommended investigation steps.
High latency on /predict¶
Possible causes - increased load, - model complexity increase, - resource saturation.
Actions - check CPU/memory usage, - inspect active model version, - scale API replicas if needed.
Async queue backlog growing¶
Possible causes - insufficient workers, - stuck or failing tasks.
Actions - inspect Celery worker logs, - scale workers, - check retry and DLQ metrics.
Data contract failures¶
Possible causes - upstream schema changes, - incomplete scraping.
Actions - inspect failing expectations, - block downstream pipelines, - coordinate schema update.
MLflow unavailable¶
Possible causes - service downtime, - network issues.
Actions - verify registry availability, - prevent new deployments, - restore service before continuing.