A ‘safe rollout’ is the only thing that minimizes the chances of you taking down production when enabling features.
Rolling out changes in distributed systems or a service requires you to maintain the SLA. One bad change can mess up your service and make recovery difficult. It can upset your customers and stakeholders. It can also tarnish your reputation if you repeat certain mistakes.
‘Rolling out’ is not just a task that you do on the side. It is core to the software development process. You not only need to have a reliability mindset but also incorporate the rollout strategy into feature development from day 1. You need to embrace RDD(Rollout driven development).
I have spent 8 years building distributed systems and want to share my learnings. This is a vast topic, and today I will try to highlight the most important aspects of it. Writing feature code is just 10% of the work; the remaining 90% is spent on ensuring a safe rollout.
Let’s dive into RDD and understand how you can safely rollout changes.
Thanks to the 6.6k subscribers that have joined this newsletter. If you love reading my newsletter then reply to this email and let me know your favorite article.
1. Visibility++
Have the right logs and metrics upfront. You won’t be able to log everything, and that is okay. Trying to be comprehensive will ensure you miss less. What we don't realize is that the cost of adding more instrumentation later keeps increasing.
How?
Don't rely solely on your top-line alerts. Add some specialized alerts for the rollout that highlight unexpected symptoms.
Add detailed logging that identifies the root cause of potential issues.
Err on the side of more logs and you can hide them behind a flag
Add the ability to obtain unsampled logs when encountering a hard-to-find bug.
Build a dashboard to provide easy access to these metrics.
2. Thorough Testing
Don't test in production for the first time. Your credibility will plummet when things break. Have a solid test plan that provides the right coverage.
How?
Ensure you have high code coverage with your unit tests.
Add integration tests that cover your core scenarios.
Test with production-like workloads in a shadow or canary environment.
3. Rollout and Rollback plan
You need to balance risks and find signals sooner. When your plan is too risk-averse, your rollout will seem safe, but you will find critical issues much later. This will delay the project completion. Also, when a disaster strikes, it should be easy to get to a safe state. So, invest in tools that make rolling back easy.
How?
Break down your rollout into meaningful stages, each of which gives signals and has the right balance of risks & safety.
Validate that your rollback steps actually work by exercising them before the real rollout.
Pay attention to the speed of rollback and whether impact during recovery can be tolerated.
4. Review
Your team will have opinions and experience with similar rollouts. Also, a rollout may affect a stakeholder outside the team, and you don't want to surprise them. A review is effective in getting gaps patched up and communicating the plan.
How?
Have a written-down version of the plan.
Share it with the team and stakeholders for a review. Address concerns that come up.
Consider a meeting if there are unresolved concerns in the document.
5. Have Enough Bake time
Don't rush through your rollout even if things seem "safe". Sometimes, issues arise under high load or specific peak or disaster conditions. So, a plan that includes enough time between stages can catch these issues.
How?
In addition to having a "long" delay between stages, you should simulate corner cases and observe how your feature performs.
Observe what happens during pushes, drains, peak load, and disasters.
Investigate the one-off issues that arise during the bake time.
6. Communicate
You have to keep your team and stakeholders up to date with the sequence of events. Every status matters. When things go south, your on-call team is the one responsible and needs to find the relevant information to recover the system.
How?
Announce when you proceed with the rollout, rollback, or pause.
Ensure that your plan and mitigation steps are accessible to everyone in the team runbook.
Broadcast your milestones and share your learnings when you encounter a major incident.
7. Learn and Adapt
Be aggressive in finding and fixing issues. If you make a mistake, learn from it and don't repeat it. Smaller issues can manifest as you progress through the rollout. So, adjust your plan if you find hidden issues that need special handling.
Planning how you roll out changes is as important, if not more, as building your features. Embrace RDD to be someone who consistently delivers features safely. Share tips that work for you in comments.
If you enjoyed this article then hit the ❤️ button. It really helps!
If you think someone else will benefit from this, then make sure to 🔁 share this post.
Until the last couple of years, I tended to be over confident in rollouts, especially with not using the 'undo' button. This is a button I highly suggest to be trigger-happy with :)
Often we think that our changes affect only area X, and after our release something in area Y gets broken. We tend to ignore it, and think that other people broke something, as we loath to press the undo button and start from scratch. I had a case where we rolled out a big change that didn't work well, and after a couple of days of debugging, I decided to rollback most of the code. I left a small part that I was sure 100% is not related, and was the most painful to rollback.
Of course, that small part was the one causing issues and hugely inflating our costs... Only couple of weeks later I figured it out.
My lesson was that if you have even the smallest doubt (and you always should when things break after you push to production!), rollback. It's better to rollback unneedlessly than be over confident...