July 3, 2020
/
Careers

SageMaker and Metaflow: Modern Machine Learning Deployment at Cleo

Everything you need to know, by the Cleo AI Data Science Team 💡

Successful data science projects typically need to have a process for building, improving, and making the actual machine learning models on a continual basis. Organisations often build effective models, but then struggle to follow through with production.  The problem is so acute that some organisations have hired engineers solely to bring machine learning from proof-of-concept to reality.  In this blog post, let's talk about the technologies we’ve been implementing at Cleo to make our machine learning deployments simple yet robust.

For some time, we’ve been using Amazon SageMaker to train models in the cloud and deploy them as micro-services.  For those unfamiliar with it, SageMaker is Amazon’s machine learning training and deployment service.  By making use of Docker and separating the data from the code, SageMaker allows efficient training of models where the training instances are only open for as long as needed to train the model.  Once the model is built, it allows single-click deployment to an endpoint, which our backend services can then query whenever they need an inference.  You can read more about Cleo’s use of SageMaker in this guest blog for AWS.

SageMaker tracks data versions across the later steps of training and deployment, but it doesn’t capture details of the data extraction or preprocessing, and it doesn’t have an easy way to relate each endpoint back to the source code that went into it.  Until recently we were operating SageMaker through an untidy and poorly versioned collection of bash and Python scripts, mixed in with some old-fashioned clicking through the AWS console.

Enter Netflix Metaflow.  Open-sourced and released to the public in December 2019, Metaflow is a Python package that can be used to wrap both the training and deployment workflows of an organisation’s machine learning models.  It arrived just as we were looking to harden our deployment and version control and increase our iteration speed.  We became early adopters.

We were able to get up and running early in the new year with our first model.  We moved the deployment code into a single script — a Metaflow ‘flow’ — with Git and Docker commands handled through their Python SDKs.

With a single command we can now:

  • extract a new dataset
  • perform preprocessing
  • train a number of new models to get the best hyperparameters
  • evaluate the model
  • deploy the model as an endpoint
  • test that endpoint’s performance

Each time the flow runs, it increments the version number and logs all of the parameters, hyperparameters, and datasets of the run.  It also tags our Github and Docker repositories and all our SageMaker cloud objects with the version number, so we can join the dots between them later.

Metaflow also comes with inbuilt AWS integration: instead of local storage, objects can be persisted to Amazon’s data store, S3.  And it comes with a fast S3 client that runs in the background uploading all the logs and data.  That’s a big help for our collaboration — it makes it easy to pick up where a colleague has left off, and allows multiple people to work on the same application without getting in each others’ ways.

Building and deploying a new model with a single command means we can focus on data science rather than engineering, and makes switching between models and testing hypotheses speedy and fun.

Read more

Life at Cleo

A Gen-Z at Cleo

Let's face it, as a typical Gen-Z I'm not gonna be jumping on the property ladder anytime soon. For me, the workplace is less about becoming a baller and more about finding genuine value. Bit biased, sure, but here's why Cleo is just that.

Wednesday, May 13, 2020
Careers

Getting twice as much out of half as many stand-ups

At Cleo, we're very much about iterating on our agile process and finding the right approach for each team. As we entered Q3 of 2020, we significantly changed the makeup of one of our teams. The team was operating on auto-pilot since the change and was continuing to follow the agile processes that the team had been following in Q2. We noticed that the cycle times were increasing rapidly not long after, which set off some alarm bells.

Wednesday, December 2, 2020
Seen enough?
Download Cleo
Screenshot of the chat screen and paycheck breakdown feature