This is a story I regularly tell engineers I work with because it’s profoundly impacted me as an engineering leader and person. When I was a team lead at carsales, my small team regularly made more than 20 deployments into production a week, including hotfixes for bugs that some of those deployments introduced.
I lost no sleep and suffered no anxiety in the lead-up to any of these deployments, even though the sites my team managed received almost two million sessions a month. Quite the opposite, I looked forward to each deployment and encouraged my team to push code out as soon as they could responsibly. Yet, returning to my time before car sales, the mere thought of deploying code into production would give me enormous anxiety and cause me to break out in cold sweats at night. So how did I overcome this fear?
“Big Bang” Deployments
I had come from a workplace culture where any deployment was viewed through a lens of mistrust and required advanced notice and approval by a change control board. Any major change needed to be approved two weeks in advance of the deployment date, and any production fix required the approval of a senior manager. If a change was required to an already approved change, then the change control board must reassess it.
Changes were strictly controlled for these reasons:
- The staging and production environments were significantly misaligned so that testing in staging did not give a high level of confidence
- There was a history of major unexpected issues during deployments
- Failed deployments or major issues in production were not looked on favourably by management, and there would often be strong responses to failure
The process for deploying changes was cumbersome, painful, and not very flexible…the very antithesis of agile. That meant deployments were few and far between and, perversely, tended to be quite risky because they generally involved a lot of changes going out at once.
To minimise the risk of downtime for our customers, deployments were scheduled during our off-peak periods, which meant after 9 p.m. in most cases. I’ll never forget one deployment that started in the office at 10 p.m. and finished at almost 4 a.m., more than double the expected duration. We had so many problems that night; the longer the night went on, the harder it was to focus and fix the issues. My experiences with these “big bang” deployments had conditioned me to fear releasing code.
When I arrived at carsales, there was a completely different mindset. Release quickly, release often, fail fast, and roll forward. This mindset was evident from top to bottom and was enshrined in the fabric of the company’s culture. Releases were anticipated rather than feared, and failures were treated very differently.
Your Baby is Crying
When I started at carsales, I felt like I made a lot of mistakes, and this was starting to affect my confidence. On one particular occasion, I created a message queue subscriber that would send a thank you email to every person who had submitted a review of their car. I pushed the code out one Friday afternoon, and then I got a strange message from my manager that night: “Your baby is crying.”
I jumped online and saw that my subscriber was spamming people that had previously written reviews. Some of these reviews were from several years ago! Some people had received the same email more than ten times in less than an hour. The negative customer feedback even made its way onto Twitter, with an ex-developer mentioning the CTO in a tweet. It was quite embarrassing in a public way. We quickly shut down the subscriber.
Monday morning, I trudged into the office and looked into what had happened. It turned out that I had made some bad assumptions about the topic my subscriber was listening to. I had assumed that a message would only arrive if a new review was created, but in fact, a message would also arrive if the data related to the car reviewed had changed. As it turned out, this car data changed quite often!!
Accepting Some Mistakes to Move at Pace
This event was a turning point for me, thanks to how the CTO at the time (now CIO), Jason Blackman, responded. I had expected a harsh and strong rebuke for embarrassing him and the business. Instead, he very calmly asked about the impact and cause. We had a discussion about how we could fix it and how we could prevent something similar from happening again. He put my mind at ease by saying mistakes happen and “If you haven’t broken anything yet, you aren’t trying hard enough. Just make sure you don’t make the same mistake twice!”
“If You Haven’t Broken Anything Yet, You Aren’t Trying Hard Enough.”
A Blame-Free Culture Starts with Leadership
It was a huge weight off my shoulders, and I soon shook off my fear of breaking stuff. Things would break from time to time. Our work is complex, dynamic, and highly technical, and it’s impossible to avoid some failures if we want to move at speed.
The key to preventing similar mistakes from happening again is to learn from them. This needs a blame-free culture where the investigative effort focuses on learning, solving the problem, and improving our systems. This is exactly the behaviour that Jason exhibited in response to my mistake and a behaviour I have tried to display as an engineering leader whenever in a similar situation.
Where I currently work at Xero, there is an excellent culture of learning from system failures that includes Post Incident Reviews (PIRs). PIRs are always performed blame-free and safe, ensuring the focus is on how we can do better.
Engineering for Resilience
Of course, your tolerance for mistakes depends on the potential impacts of those mistakes. When dealing with systems where human life is at stake or where the business could suffer financial penalties or even a loss of licence due to mistakes, we need to build these systems more resilient.
Engineering for resilience requires first accepting that failures will occur — both with the systems and applications but also with the people that manage them. We then build in the ability for the critical parts of the system to continue functioning despite failures in any one part. Improving these systems requires the same blame-free culture — failure needs to be treated with curiosity and a growth mindset so that the system can be improved and the people that manage it can learn and grow.
Moving at Pace Safely with DevOps Practices
As the DevOps Research and Assessment (DORA) research has shown, it is possible to move at pace whilst meeting very high levels of quality if you have the right tooling and engineering practices. These include some practices and tools that, thankfully, are fairly common nowadays:
- CI/CD
- Shifting left on quality and security
- Test automation
You can assess how you are performing using the DORA metrics:
- Lead time for changes
- Deployment frequency
- Mean time to recovery
- Change failure rate
So no business in this day and age has any reason to continue with outdated release processes such as change control boards and big bang releases that need to be scheduled during outage windows. These practices hurt the ability to deliver business value and have a negative impact on the people and teams that have to deal with them.
Go forth at pace and without fear!
How I Learned To Stop Fearing Deployments was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.