Skip to content

Model Rollback & Recovery

This runbook describes how to recover from model-related incidents.


When to rollback a model

  • performance regression detected,
  • data or prediction drift alerts,
  • increased error rates or latency,
  • unexpected business behavior.

Rollback strategy

Model rollback is performed via MLflow Model Registry.

Steps: 1. Identify last known good model version. 2. Update model alias or stage (e.g., Production). 3. Redeploy serving service if required.

No retraining is required.


Verification

After rollback: - verify active model version in logs and dashboards, - monitor latency and error rate, - confirm alert resolution.


Escalation

If rollback does not resolve the issue: - disable affected inference endpoints, - fall back to degraded mode if available, - escalate to investigation.