On-call Cheat Sheet¶
This page provides a quick reference for operators during incidents.
First steps¶
- Acknowledge the alert.
- Identify affected layer:
- infrastructure,
- data,
- model.
- Check dashboards and logs.
Key dashboards¶
- Service Overview (latency, errors)
- Async Processing (queue depth)
- Model Health (drift signals)
Common actions¶
- scale API or workers,
- rollback model version,
- pause async processing,
- disable non-critical endpoints.
Golden rules¶
- do not apply hotfixes without traceability,
- prefer rollback over patching,
- document incidents and resolutions.
Post-incident¶
- perform root cause analysis,
- update runbooks if needed,
- consider preventive measures.