Budgets aren't boring
And here’s exactly why you should get one.
Signing up takes 2 minutes. Scan this QR code to send the app to your phone.
Or head to the app store:
It is a truth universally acknowledged that if you develop software, things will break and go wrong. Here at Cleo, we accept that this is a part of life — but it’s important to take the time to learn and improve as a result.
As our business scales, it’s imperative that we continue to provide the best possible experience to our ever-growing user base. That means continually iterating and improving the stability of our product. With that in mind, following every incident and outage, we run a blameless Postmortem. Basically, we go over the entire process to see what went wrong and how we can fix it for next time. And we make sure to include everyone that was involved in debugging, fixing, and communicating during the incident.
We introduced Postmortems here at Cleo back in early 2019 after two particularly gnarly incidents (they always seem to come in pairs).
The first meant Cleo was not able to respond to Facebook Messenger users for just over an hour. This was because an employee accepted a Facebook Business Manager invite, causing our Facebook app access token to be invalidated, meaning we could not connect to Facebook and reply to users — not something you would expect when accepting an email invite!
The second was a full 10 minute outage of Cleo whilst we were stuck deploying a long-running migration. The migration itself was blocked by a number of long running database queries and in turn prevented updates to one of our busiest tables.
During our Postmortems, we seek to uncover the root cause of the issue and what we can do to prevent it from happening again. To provide structure and consistency we aim to answer the following questions:
All of our Postmortems result in a set of actions aimed at helping us in the future. These typically involve implementing longer term fixes, improving our observability, alerting and self healing.
In the case of our migration woes above, we introduced the ActiveRecord Safer Migrations gem, which adds two new concepts to our migrations: a lock and statement timeout. I won’t explain the ins and outs here, but you can read about them over on the Github page. With this new gem we were able to avoid downtime caused by future migrations causing downtime, whilst improving our best practices around how we perform data migrations at Cleo.
It’s essential that we leave blame at the door during these sessions, instead of trying to point fingers at particular people. Instead, we look to discover why the systems in place allowed them to do this or believe it was the appropriate thing to do. We also take the time to share these learnings with the wider engineering team in order to increase visibility and help level up other developers.
In the coming months we aim to share a number of our Postmortems, particularly those with user impact.
Cleo is hiring! Look here for job opportunities in engineering, product, data science, UX, marketing, and more.
And here’s exactly why you should get one.
The benefits of Typescript in a team environment for building and maintaining production-worthy codebases are nothing new. At Cleo we committed early to developing our web and mobile apps with Typescript to give us the safety of static typing when developing our product.