How to scale deep learning experimentation in production tags: MLOps

Since the rise of the deep learning revolution, springboarded by the Krizhevsky et al. 2012 ImageNet victory , people have thought that data, processing power and data scientists were the three key ingredients to building AI solutions. The companies with the largest datasets, the most GPUs to train neural networks on, and the smartest data scientists were going to dominate forever.

This, however, was only part of the truth.

While it’s true that companies with the most data are able to build better predictive models, the improvements to model quality are not linearly proportional regarding dataset size. At the same time, most companies today either have Big Data™ or are built with data gathering at their core – so one could argue that while more data is a competitive advantage it’s not as big an advantage anymore. The challenges with data are more in having meaningful data lakes with labeled and structured data than the sheer amount of it. There was correlation between better AI solutions and data, but not necessarily causality.

Equally, while it’s true that the need for processing power rises in proportion to the amount of data a model is trained on, virtually every company today has access to practically unlimited processing power. Powerful on-premises server farms and big cloud operators offer everyone access to thousands of GPUs over the web. The challenge with processing power is more about making use of these resources in an efficient way than having access to them. Just like airlines optimize the time planes are in the air, effective data science is about optimizing how cloud GPUs are used.

While there is a shortage of data scientists, shown by their rising salaries ( 30% higher salaries than their Software Engineer counterparts ) the need for algorithmic development is not as imperative as cobbling together models built on pre-researched best practices. AI Expert, Author and Venture Capitalist Kai Fu Lee called this shift in competence “a shift from thinkers to tinkerers”. We’ve moved along from research to engineering when it comes to AI, and that requires a different set of skills.

Through this transformation in data, processing power and competence, deep learning has in the past five years matured from the question “how can it be applied?” to the more down-to-earth question “how can we scale to production quickly?”. Building production scale solutions quickly requires a new set of tools than those required by research or exploration.

Let’s have a look at what that means in practice.

AI Tools and Frameworks to the rescue

The combination of AI hype and need for more skilled people has drawn in people from different fields into data science. Software engineers, mathematicians and statisticians all have different backgrounds and different ways of working. Software engineers are probably the only ones that have worked in large groups together under tight deadlines. Engineers are by definition the tinkerers of solutions proposed by the thinkers, whereas mathematicians and analysts are more one-person jobs.

But software engineers have not always been working together either. In the early 1990s and earlier, software development tended to be a one man job where hero coders hacked together solutions that nobody else understood or could collaborate on. There was no version control that supported real collaboration (files on your desktop, anyone?), not to mention unit testing (println() is testing right?), continuous integration, cloud computing or Scrum (are UML diagrams and use-case specifications still a thing?). The methodologies have been iteratively developed over the past 30 years to accommodate the needs of accelerating software development and highly efficient teams.

However, we’re still lacking these tools in the field of data science today. Instead of using standard tools, people are engineering their own workflows, tools and frameworks. And it doesn’t help that, this time around, these people come from vastly different backgrounds. Quick and dirty solutions are once again amass.

People are sharing “version controlled experiments” in Excel sheets over Slack with links to Jupyter notebooks stored in Dropbox. Have we learned nothing from the last 30 years? Jupyter notebooks in Dropbox shares are like the Scientology of Deep Learning – they have their fanatic supporters, but most of us aren’t throwing our money at it. And it’s not the tools that are wrong – it’s what you use them for.

Before we select our tools, we have to agree on the problems we want to solve. I think there are three main goals we want to achieve:

Quick Experimentation – we want data science to be quick and agile. People should be able to test out things without spending time on boilerplate code or DevOps work.
Reproducibility – we want to ensure reproducibility and an audit trail for every experiment that we conduct. Teams should learn from each other and previous experiments so that we don’t have to reinvent the wheel over and over and over and over again.
Standardized ways of working – we want work to be standardized. So when new people join they know how things work. And when somebody leaves, we know what they’ve done previously.

Let’s have a look at these one at a time.

Quick Experimentation

In Deep Learning the core of quick experimentation depends on what stage of your model building you are. In the beginning you need to be able to explore your data, visualize it and get to know it. A tool such as H2O is superb for getting to know your data and building your first hypotheses.

When you get further, you might want to experiment a bit in a Jupyter notebook with Pandas. But the further you get to building production scale models, and especially if your team consists of more than just you, you definitely want to move over to an IDE with proper autocompletion and a way to run your experiments on clusters of more powerful GPUs than the one on your local machine.

The key to quick experimentation is thus fully automated machine orchestration that is as transparent for the single data scientist as possible. Clicking one button or running one command on the command line would be optimal.

Reproducibility

The key to reproducibility in any scientific work is rigorous and complete bookkeeping, i.e. version control, for every experiment. Having to do version control manually is not an option as that would not be your primary focus during model development, leading to random snapshots instead of full reproducibility.

But unlike with software engineering, reproducibility should not be limited only to your training code but must also include your training & test data, external hyperparameters, software library versions and more.

In this case automatic version control for every part of each and every training run is the optimal solution.

Standardized Pipeline Management

Having the whole team work the same way, with some given degrees of freedom is a necessity. How do you store data? Where to you deploy models and code? Where do you train them? Which frameworks do you use and how do you chain together feature extraction with model training?

If everyone in the team solved these by themselves, not only would it be a huge waste of time, it would also make collaboration virtually impossible.

The solution is decoupled pipeline steps with standard ways of chaining and orchestrating them. As a trivial solution, it could just be a script that calls every steps of the pipeline in order. But the central part is that it’s standardized within the company and within the team.

The Rebirth of Deep Learning: AI Platforms

To bring clarity and structure to machine learning, the technology unicorns have been building their own overarching platforms that tie solutions to all of the above challenges together, usually in the form of an AI platform that includes libraries, machine orchestration, version control, pipeline management and deployment.

FBLearner Flow is Facebook’s unified platform for orchestrating machines and workflows across the company. BigHead is AirBnB’s Machine Learning platform for standardizing the way to production, built largely on Apache Spark. Michelangelo is Uber’s machine orchestration and ML training platform for quickly training ML models.

The same is true for Google, Netflix and pretty much all larger players that have understood the competitive advantage gained through quick model building.

But what about the rest of us? What about those who can not invest 10 man-years into building our own orchestration but need results today?

Valohai is what FBLearner Flow, BigHead and Michelangelo are for the tech unicorns, but built for the rest of us. It is a cloud based service that can run on AWS, GCP, Azure or your on-premises server farm. Valohai lets you run your models in the cloud, just as if you were running them on your local host as individual steps or as pipelined workflows. It automatically snapshots every training run so that you can always grab a model running in production (on Valohai’s scalable Kubernetes cluster), click a button and trace back to how it was trained, by whom, with which training data, which version of the code and more.

Valohai is however not your only alternative – you could build much of it yourself. The important thing is you ensure quick experimentation, reproducibility of your experiments and standardized ways of working. But the real question is, do you want to hit the road running or do you want to start by designing and building your own running shoes?