Build vs. Buy – A Scalable Machine Learning Infrastructure
Toni Perämäki / March 19, 2019
In this blog post we’ll look at which parts a machine learning platform consists of and compare building your own infrastructure from scratch to buying a ready-made service that does everything for you.
The end-goal in both buying and building an infrastructure is that you’ll make your data science teams more efficient, by letting them spend as much of their time as possible on understanding their data, building models and moving them to production. This in contrast means that they’ll spend as little time as possible on infrastructure and boilerplate code. In order to achieve this most ML infrastructure companies (such as Uber’s Michelangelo ) have been built on the following principles:
- Integrate with tools your data scientists already love to use
- Build the platform for fast iterative experimentation
- Automate data management as far as possible for both input data as well as output data
- Give a way to roll back experiments by providing an automatic audit trail
- Support the environments your company uses, from different cloud providers to on-premises hardware
- Build on abstractions, so that you can support new technologies as they emerge
- Design for scalability, both from a team point of view, as well as from a resource point of view so that you can support hundreds of projects, people, data sources and computation units.
- Integrate with everything, everywhere so mundane tasks can be automated.
With these principles in mind, let’s have a look at how to achieve this by building it from scratch.
Building a Scalable Machine Learning Infrastructure
An effective Machine Learning Infrastructure requires a wide variety of competencies from the team building it. You will be solving specific needs for you Data Science team so let’s have a look at what a typical Machine Learning infrastructure consists.
Machine Orchestration is the backbone of your machine learning infrastructure. The orchestration engine materializes itself in the UI of your machine orchestration part, be it then through integrations in IDEs, Notebooks, a CLI, an API or a UI. The most advanced engines optimize resource usage, predict need, handle queuing and more. You could base your engine on Virtual Machines, platforms such as Kubernetes or by directly orchestrating your own hardware.
Recordkeeping, Logging and Version Control
Not strictly a part of a traditional ML platform, version control is however becoming more and more important due to regulations such as GDPR. For your data scientists to be able to reproduce experiments at a single click you need to version control and log the data, parameters, hardware and software environment, logs and more. A good starting point for this is using Git for code versioning and Docker for the environment. A starting point on what to version can be found in the picture below.
Project and user management
You’ll want to sandbox experiments into projects and manage privileges for users and teams. Use access control provided by Azure Active Directory or Hadoop Ranger or build something yourself. Access control should also define which resources users are able to use, how to queue jobs by different users, which data is available to whom and more. You’ll want to remove silos and enable transparency so that your data scientists can learn from each other.
Automated ML Pipelines
Machine Learning Pipelines are a concept where work is split up into different stages, starting from exploration and continuing with batch processing, normalization, training, deployment and many other steps in between. You need to build in support for handing over the results from one step onto the next one. Some advanced triggers might also be in place for automatic rerunning of parts of the pipeline when e.g. data in your data lake changes significantly. There are several solutions from lower level Git-based approaches to Apache Airflow and ML Flow Pipelines that you’d want to explore and stitch together.
Integrations, integrations, integrations!
Data scientists should be able to continue using the tools and frameworks they love, be it their favourite IDE, Notebook or anything else. There should also be different ways of interacting with the infrastructure depending on the situation. A best practice is to build everything on top of an open API that can then be accessed from a CLI as well as a web UI or even directly from Jupyter notebook. For hyperparameter tuning you might want to consider an external optimizer such as SigOpt .
Visualize ongoing executions
As data scientists are working with the models they need to understand how the individual steps are progressing, be them then training accuracies or data preparation. For this you need to build a way to show real-time training data in a meaningful way. The solutions could be a customized visualization library or e.g. TensorBoard or Weight & Biases .
The final piece of your ML pipeline is deployment. You might not want to make it possible to deploy in a live production environment, but make sure your data science teams are self sufficient when it comes to deploying new models for the software teams that will be integrating the predictive models into your business apps. The old way of shipping the model to IT is not going to work in the long run due to overhead and a slowed model development cycle. Kubernetes clusters is a de facto standard in the industry for inference deployment in the cloud. Also look into AWS Elastic Inference, Seldon core or Azure inference engine with ONNX support.
Buying a Scalable Machine Learning Infrastructure
Not only does it take several man-years to build the infrastructure yourself (believe me, we’ve got the history to back that up), but you’ll also need to continuously develop the platform as technologies evolve and new ones need to be supported.
Valohai is a complete Scalable Machine Learning Infrastructure service that scales for your team, from 1 to 1000 data scientists. Everything in Valohai is built around projects and teams and it scales from on-premises installations to hybrid clouds and full cloud solutions in Microsoft Azure, AWS and Google Cloud. Valohai takes care of the machine orchestration for you and automatically keeps track of every experiment you or anybody in your company has ever conducted – from the data, code, hyperparameters, logs, software environment and libraries, hardware and more.
Valohai is build on an open API and integrates to your current workflow through a ready-made CLI, an intuitive Web UI, Jupyter notebooks integrations. And with automated pipelines you’re able to kick off a batch job and run every step of your training process without ever touching it manually. The interactive visualization of your metadata help your team see what everyone else is doing and follow their models converge until they’re ready to push them into production themselves.
One thing is for sure however, ML infrastructure is the most important part of your team’s productivity. At its best, it will cut down model development time by as much as 90%.