March 20, 2020

Learning from failure - Postmortems at Cleo

It is a truth universally acknowledged that if you develop software, things will break and go wrong. Here at Cleo, we accept that this is a part of life — but it’s important to take the time to learn and improve as a result.

As our business scales, it’s imperative that we continue to provide the best possible experience to our ever-growing user base. That means continually iterating and improving the stability of our product. With that in mind, following every incident and outage, we run a blameless Postmortem. Basically, we go over the entire process to see what went wrong and how we can fix it for next time. And we make sure to include everyone that was involved in debugging, fixing, and communicating during the incident.

We introduced Postmortems here at Cleo back in early 2019 after two particularly gnarly incidents (they always seem to come in pairs).

The first meant Cleo was not able to respond to Facebook Messenger users for just over an hour. This was because an employee accepted a Facebook Business Manager invite, causing our Facebook app access token to be invalidated, meaning we could not connect to Facebook and reply to users — not something you would expect when accepting an email invite!

The second was a full 10 minute outage of Cleo whilst we were stuck deploying a long-running migration. The migration itself was blocked by a number of long running database queries and in turn prevented updates to one of our busiest tables.

During our Postmortems, we seek to uncover the root cause of the issue and what we can do to prevent it from happening again. To provide structure and consistency we aim to answer the following questions:

What went wrong?
What was the impact of the problem?
What was the root cause of the problem?
How did we fix it?
What could we do differently in the future to prevent this from happening (if possible/feasible)?
What could we do to find and fix the issue faster?

All of our Postmortems result in a set of actions aimed at helping us in the future. These typically involve implementing longer term fixes, improving our observability, alerting and self healing.

In the case of our migration woes above, we introduced the ActiveRecord Safer Migrations gem, which adds two new concepts to our migrations: a lock and statement timeout. I won’t explain the ins and outs here, but you can read about them over on the Github page. With this new gem we were able to avoid downtime caused by future migrations causing downtime, whilst improving our best practices around how we perform data migrations at Cleo.

It’s essential that we leave blame at the door during these sessions, instead of trying to point fingers at particular people. Instead, we look to discover why the systems in place allowed them to do this or believe it was the appropriate thing to do. We also take the time to share these learnings with the wider engineering team in order to increase visibility and help level up other developers.

In the coming months we aim to share a number of our Postmortems, particularly those with user impact.

Cleo is hiring! Look here for job opportunities in engineering, product, data science, UX, marketing, and more.

General

Learning from failure - Postmortems at Cleo

Read more

$250 because the pandemic hit you hard – This is Random Acts of Relief

The world’s on fire. How can we help?

The roast of Little Lia