Common Failures & Troubleshooting¶

This runbook lists frequent failure scenarios and recommended investigation steps.

High latency on /predict¶

Possible causes - increased load, - model complexity increase, - resource saturation.

Actions - check CPU/memory usage, - inspect active model version, - scale API replicas if needed.

Possible causes - insufficient workers, - stuck or failing tasks.

Actions - inspect Celery worker logs, - scale workers, - check retry and DLQ metrics.

Possible causes - upstream schema changes, - incomplete scraping.

Actions - inspect failing expectations, - block downstream pipelines, - coordinate schema update.

Possible causes - service downtime, - network issues.

Actions - verify registry availability, - prevent new deployments, - restore service before continuing.