Skip to content

On-call Cheat Sheet¶

This page provides a quick reference for operators during incidents.

First steps¶

Acknowledge the alert.
Identify affected layer:
infrastructure,
data,
model.
Check dashboards and logs.

Key dashboards¶

Service Overview (latency, errors)
Async Processing (queue depth)
Model Health (drift signals)

Common actions¶

scale API or workers,
rollback model version,
pause async processing,
disable non-critical endpoints.

Golden rules¶

do not apply hotfixes without traceability,
prefer rollback over patching,
document incidents and resolutions.

Post-incident¶

perform root cause analysis,
update runbooks if needed,
consider preventive measures.