Good runbooks are critical for the smooth functioning of the team. I have seen many teams get frustrated with a poor DevOps experience. Sometimes, the answer was as simple as improving runbooks. Hence, today's topic is dedicated to sharing my tips on what makes runbooks awesome.
Last week, Jordan and I collaborated on 7 types of difficult coworkers and how to deal with them. Check it out for actionable tips!
Tomorrow, I will be sharing my career story on and hoping it has actionable tips for all.
Runbooks are like maps, guiding you with clear instructions to reach your destination. In DevOps, runbooks are a set of instructions that on-calls use to quickly identify and resolve ongoing problems. They help eliminate the need for team members to remember every detail or risk making mistakes during recovery.
A well-written runbook ensures smooth problem-solving and eliminates challenges caused by tribal knowledge.
Qualities of a Good Runbook:
A single runbook focuses on a single problem: short & sweet.
It has the necessary steps to confirm the problem in the form of graphs & dashboard links.
It is easy to follow, even for a relatively new person on the team, and eliminates any guessing game.
It is easy to distinguish between false positives & real issues.
It provides a list of straightforward actions to take based on the problems.
What Runbooks aren't
A fully-fledged documentation to train engineers on the processes and the tech stack
The ultimate solution for any problem that could ever occur
A single document written once that never needs an update
Runbook Template
Template
Brief explanation with severity and urgency
Instructions to confirm the problem
Common false positives
List of actions for every problem scenario (with exact commands or button clicks)
Instructions on how to escalate to another team or group
Advanced section
References to architecture, past investigations, and metrics
Advanced debugging steps
Expert contacts
Have these for each alert/alarm/problem. These are guidelines rather than rules, so adapt them for your product/system.
Brief explanation with severity & urgency should clearly tell the on-call what would happen if the issue is left unmitigated. It should tell which use cases can degrade further.
Instructions to confirm the problem can be as simple as a graph or a dashboard that clearly shows the problem. If the metrics used are not obvious then briefly explain the metrics.
Common false positives should save valuable time for the on-call if they are dealing with a false positive. These could be explained as a combination of metrics or log lines or a specific pattern. Ideally, there should be no false positives with your alerts but practically you may have some.
Actions for every problem scenario should be foolproof. Ideally, the on-call just needs to click on some UI or run short commands. Manual steps increase the chances of disastrous mistakes. Also, the commands should be complete and not leave the on-call hanging, trying to figure out a missing step.
How to escalate should exactly tell which teams to reach out to when the root cause of the issue is ABC vs XYZ. It should even point to the on-call alias for those teams and perhaps a template of how to ask the question.
Advanced section should be links to other wiki pages that contain deep dive information. The on-call should not need to reference these in the common case. Those are for advanced debugging only.
Runbook Health
Keep them live
Runbooks will never be perfect. Therefore, every incident in which they were not helpful should be used as opportunities to improve them. If the system undergoes a major change, ensure that you update all runbooks accordingly.
Ensure quality
New runbooks should be code reviewed. Also, have your reviewer or preferably a new hire run through them. Ensure that there are no hidden assumptions in the runbooks.
Build a culture
A healthy runbook culture needs to be ingrained in how the team operates. Having a champion who can keep the team accountable makes a huge difference. Linters are a great way to prevent new alerts from being shipped without runbooks.
Runbook Organization
Keep them discoverable
Runbooks for alarms should be linked or inline with the alarm. For non-alarm based issues, group runbooks of similar functionalities under a single folder or hierarchy. The last thing you want is someone taking 30 minutes just to find a runbook page.
Also, on-call members should familiarize themselves with the runbook hierarchy as a part of their onboarding.
Keep ongoing notes
It is important to carry over context from one on-call to another for recurring issues. You don't want to update the runbook with transient information that will get obsolete. So, you can choose to keep these notes separate.
Keep them accessible
When a disaster strikes, you want to have access to your runbooks. So, keep them in a place that is reliable and perhaps keep a stale backup handy as well.
Don't chase perfectionism with your runbooks. It is okay to start with the core scenarios first and gradually improve the quality and coverage. You should begin to notice an improvement in on-call productivity.
Please share if you have more tips that work for your team.
A few of you have already leveraged my pro bono mentorship. If you've been on the fence, then now is your chance. Schedule a time slot here
Here are some growing newsletters that may interest you
by Adrian Stanek
by Fran Soto
by Ricardo Morales
by Tobias Mende
Lastly, my newsletter just hit 2K subscribers earlier this week. Thank you! If you haven’t subscribed yet then please do.
If you enjoyed this article then hit the ❤️ button. It really helps!
Thanks for the mention, Raviraj!
I love runbooks. No fluffy words, no deep dives. Just the information required. Everyone should be able to follow the instructions if they are alone in the middle of the night. When you are paged at 3 a.m. your sleepy brain is equivalent to a new hire with zero context.
Also, I think they have the hidden benefit of making you think. To write it you have to identify all the possible failure scenarios of the system and come up with the mitigation and recovery steps. Without a runbook, most of them would go unnoticed until they are a real problem in prod.
A practice I find helpful in my team is reviewing an artifact every week in the on-call handoff meeting (runbooks ,dashboards, and even the alarms). This ensures we are collectively aware of the content and updating it.
Great post once again!
Great article! Thanks for sharing. And thanks for recommending my newsletter. 🙏😊