MLOps Platform: Build or Buy
Congratulations, you’ve decided to invest in MLOps. You might be in a situation where you already have machine learning models in production, or you know you’ll go to production soon. Either way, you know machine learning will be key for your success in the future, and anything that’ll accelerate your speed to market is worth investing in.
How to invest follows the decision to invest: should we build our own MLOps infrastructure, or should we buy a managed MLOps solution? The right answer depends on your team and your use case. Most technical decision-makers tend to skew towards building because it is their bread and butter, but we want to encourage critical thinking around the subject.
For the purpose of this article, at a minimum, an MLOps platform should offer:
- Tracking and versioning
- Machine learning pipelines
- Model deployment
Option 1: Build your own MLOps platform
Generally, the option of building your own MLOps platform means that you’ll set up an open-source solution and customize it to suit your needs. The most popular open-source options available are KubeFlow and MLFlow. These platforms are supported by large backers (KubeFlow by Google and MLFlow by Databricks) and boast a significant community. There are also a few more obscure alternatives out there, such as Netflix’s MetaFlow and Lyft’s Flyte.
The most significant benefit of building your own MLOps platform is that the possibilities are endless. If you, for example, decide to implement KubeFlow and a particular feature is not to your liking, you can simply fork the repository and customize the application to suit your needs.
Popular open-source solutions also tend to have plenty of available open-source extensions to fill in gaps. For example, you can extend KubeFlow with KFServing for serverless inferencing and with Katib for hyperparameter optimization.
Finally, the community is what makes open-source powerful, and both MLFlow and KubeFlow have a healthy community that produces tutorials, how-tos, case studies, and other content to help tackle issues.
Even with all the ready-made pieces we could use to build our solution; it just becomes an unreasonable budget and resourcing request to build and maintain our own custom MLOps solution. — Thilo Huellmann, CTO & Co-Founder at Levity
The most significant benefit of a custom, open-source based solution is also it’s the biggest drawback. The speed of adoption can be excruciatingly slow. For example, with KubeFlow, the most common complaint is the steep learning curve to set up the application in your environment and then adapt your projects. As the name implies KubeFlow works on Kubernetes, and for companies that have not yet jumped on that train, implementation can be tricky. The other popular open-source alternative, MLFlow, is praised for a more straightforward setup, but it is feature-wise much more limited.
That said, from an ease-of-use perspective, Kubeflow doesn’t feel mature enough, particularly for such a complex system. Moreover, it assumes a lot of competency with Kubernetes and/or containers, which frankly is great if you have that and disappointing if you don’t — not every data science team will. Kubeflow is a tool for a grin-and-bear-it intermediate or truly advanced team of ML engineers. — Byron Allen
In addition to the cost of setting up a custom MLOps solution, you’ll be invested in maintaining it for the foreseeable future. For organizations with many teams working on machine learning solutions and a dedicated operations team, the cost of maintenance is manageable. For smaller organizations without such dedicated resources, backend issues quickly become blockers for the whole data science team.
- Speed of adoption
Option 2: Buy a managed MLOps platform
Many startups and established players have stepped into the MLOps market to offer their managed platforms. There are niche platforms that focus on a single aspect of MLOps - for example, Seldon for deployment - and can be integrated with other solutions. This article will concentrate on MLOps platforms that cover the three minimum criteria we set in the beginning. Naturally, we are biased towards our platform, Valohai, but there are plenty of commercial MLOps platforms, including Allegro Trains, Algorithmia, cnvrg.io, Dataiku, and Iguazio. The cloud vendors also have various offerings for machine learning, such as Azure ML and AWS SageMaker.
From our experience, speed of adoption is the first and most apparent benefit of buying an MLOps platform rather than building one. While purchasing decisions can be tricky – depending on your organization – they pale compared to the project planning involved with building a custom solution.
To make the comparison a bit more concrete, Valohai offers a commitment-free, two-week proof-of-concept where our experts get the platform set up on your environment and walkthrough implementing your first project. With most open-source solutions, we are talking about weeks of work by one or more engineers before even getting to implementing actual machine learning projects.
The second added benefit of a managed solution is that you’ll get new features without extra effort on your part. With custom platforms, you’ll generally implement a new feature yourself or add an existing open-source addon, making new updates trickier because of dependency management. Albeit the platform roadmap is not in your direct control, the MLOps space is very competitive, and the speed of improvements across all mentioned platforms is rapid.
This brings us to the final benefit; with a managed platform, you partner with the company building the platform. Again, using Valohai as an example, we talk to our customers daily. We share best practices. We help with issues even outside our platform and actively implement features our customers request.
One of the drawbacks we generally see with a managed platform is the actual selection and purchasing process. Data science experts and technologists don’t tend to be expert purchasers, which can make decision-making difficult. Jumping into implementing an open-source solution seems second nature while getting approval for five or six-figure purchase is not. How nascent the MLOps market and how the platforms’ differences can only be seen through actual usage makes decisions more difficult. As mentioned above, we’ve tried to mitigate this drawback through the commitment-free POC.
A platform vendor is a strategic partnership, and if your future views don’t align with your partners, it can become a burden. For example, the platform you adopt is missing some key features that your competitors can access through different platforms. While lock-in is a risk that also exists in the open-source route, it can be more detrimental in the managed platform route.
Switching platforms can be challenging, depending on how much history you have with it. There are, however, certain acid-tests you can do to evaluate how deep the vendor lock-in is on the platform of choice, namely:
- Are all file formats used nonproprietary?
- Does my ML code stay relatively unchanged?
- Does the platform integrate through APIs?
- Speed of adoption
- Features without extra investment
- Strategic partner
- Vendor selection
- Vendor lock-in
Which path is right for you?
We’ve read and written about large, trailblazing teams having success with building custom MLOps platforms from scratch, such as Uber has had with Michelangelo or Netflix with MetaFlow. And we’ve heard of companies like Spotify using KubeFlow to a great extent in creating a shared platform for ML across teams.
There is, however, a key thing here, these are extraordinarily large and tech-focused organizations. Spotify has, according to LinkedIn, 100 machine learning engineers and 250 data scientists. Most organizations simply don’t have these resources.
For most organizations, the implementation and maintenance risks involved in a custom solution outweigh the possible benefits. Generally speaking, it’s not beneficial to reinvent the wheel and doubly so if it involves significant risks.
Kubeflow pipelines does a great job of breaking down the ML process into discrete pieces (generate data, transform features, train model, verify model, etc). You just have to spend 1000 hours and a TBD dollar amount to get it configured the first time. — Adam Laiacano
We’ve written a few case studies with our customers that tackle this same topic.