Skip to content

On-call Cheat Sheet

This page provides a quick reference for operators during incidents.


First steps

  1. Acknowledge the alert.
  2. Identify affected layer:
  3. infrastructure,
  4. data,
  5. model.
  6. Check dashboards and logs.

Key dashboards

  • Service Overview (latency, errors)
  • Async Processing (queue depth)
  • Model Health (drift signals)

Common actions

  • scale API or workers,
  • rollback model version,
  • pause async processing,
  • disable non-critical endpoints.

Golden rules

  • do not apply hotfixes without traceability,
  • prefer rollback over patching,
  • document incidents and resolutions.

Post-incident

  • perform root cause analysis,
  • update runbooks if needed,
  • consider preventive measures.