A Comprehensive Comparison Between Metaflow and Amazon SageMaker
by Henrik Skogström | on December 05, 2021In data science, it is critical to have a viable and continuous process of ML model development and improvement. Building an effective ML model is not a major problem here. Following through to production is. Many companies struggle at this stage.
To address this challenge, some organizations occasionally hire engineers to help them with machine learning deployments and bringing models from ideas to real-life use cases. However, there are technologies on the market that have been built specifically to make ML deployments easy: from orchestration tools to MLOps platforms, to data management tools and cloud platforms.
You may be wondering which one is best suited for your team. We want to help you make a well-informed decision. Thus, we consider it necessary to make a comprehensive comparison between popular ML tools.
In this article, we will compare the major similarities and critical differences between Metaflow and Amazon SageMaker.
Metaflow and its Components
Metaflow is a Python library that helps teams who build and maintain different types of machine learning models. It was developed at Netflix initially to improve the productivity of data scientists who build and maintain different types of machine learning models. The major components of the Metaflow architecture are discussed as follows:
Flow: A flow is simply the smallest unit of computation that can be scheduled for execution. It defines a workflow that pulls data from an external source as input, processes it, and produces output data. To implement a flow, users need to subclass FlowSpec and implement steps as methods, parameters, or data triggers. The flow code and its external dependencies are encapsulated in the execution environment.
Graph: Metaflow deduces a directed acyclic graph (DAG) based on the transitions between step functions. These transitions are necessary to ensure that the graph is parsed statically from the source code of the flow.
Step: A step can be defined as a checkpoint that provides fault tolerance for the system. Metaflow typically takes a snapshot of the data produced by a step and uses it as input to the subsequent steps. Therefore, if a step fails, it can be resumed without having to rerun the preceding steps. Decorators can be used to modify the behavior of a step. The body of a step is known as the step code.
Runtime (Scheduler): The runtime or scheduler executes a flow; that is, it executes and orchestrates tasks defined by steps in topological order. You can use the metaflow.client, a Python API, to access the results of runs.
Datastore: This is an object store where both data artifacts and code snapshots can be persisted. It can be accessible in all environments where the Metaflow code is executed.
AWS SageMaker and its Components
The built-in SageMaker Studio is a core part of the SageMaker experience.
SageMaker is a fully-managed service that provides data scientists and machine learning engineers with the ability and resources to prepare, build, train, and deploy ML models seamlessly. It was created by Amazon as a service running on AWS. Amazon SageMaker components can be described under four major categories as follows:
Collect and Prepare: Under this category, SageMaker Data Wrangler helps you to connect to data sources, prepare data and create model features. Amazon SageMaker Clarify allows you to improve model quality via bias detection during data preparation and after training. SageMaker Ground Truth console helps you to develop accurate training datasets for ML and use built-in data labeling workflows to label your data quickly. With SageMaker Feature Store, you can securely store, discover, and share ML serving features in real-time or batches. Finally, Amazon SageMaker Processing allows you to connect to the existing storage, run your job, and save the output to the persistent storage.
Build: Here, SageMaker Studio Notebooks, which are one-click Jupyter notebooks, enable you to spin up or down any available resources. Amazon SageMaker JumpStart empowers you to get started with ML using pre-built solutions that can be easily deployed. Amazon SageMaker Autopilot automatically builds, trains, and tunes machine learning models for you.
Train and Tune: Use Amazon SageMaker Experiments to track any iteration made to machine learning models by capturing the input parameters, configurations and results, and storing them as 'experiments'. Amazon SageMaker Debugger allows you to capture metrics and profiles training jobs in real-time. With SageMaker, you can automatically tune your model by altering several combinations of parameters to get the most accurate predictions.
Deploy: SageMaker Pipelines make CI/CD easy as you can build fully automated workflows for your ML lifecycle. SageMaker Model Monitor automatically detects any concept drift in your deployed models and gives alerts to identify the problems as well as improve model quality. Amazon Augmented AI offers built-in human review workflows for the most common ML use cases.
Similarities between Metaflow and SageMaker
Here are some of the core features that allow us to put both tools in the same bucket:
Both platforms leverage Python. Metaflow is built as a Python library while Amazon SageMaker has a Python SDK which is an open-source library that provides Python APIs for training and deployment of ML models on AWS SageMaker.
The two platforms work with Kubernetes and AWS. Metaflow, for instance, provides built-in integrations to store, compute, and carry out machine learning services for AWS and Kubernetes. Similarly, Amazon SageMaker Operators for Kubernetes make it easier for developers and data scientists using Kubernetes to train, tune, and deploy ML models in AWS SageMaker.
Both can be used for model tracking, versioning, and pipeline orchestration.
Differences between Metaflow and SageMaker
Metaflow is a Python library that empowers teams to build and manage production machine learning while SageMaker is a fully-managed service that offers an IDE for ML model deployment and workflow management. Based on this fundamental difference, the core differences between Metaflow and SageMaker include the following:
The first major difference between Metaflow and SageMaker lies in their origins. Metaflow was created by Netflix to improve the productivity of data scientists, while SageMaker was created by Amazon as a service running on AWS.
Metaflow is open source. That is, it is accessible and free to use by anyone. Conversely, SageMaker is a fully-managed service. Although you can access a few components like the SageMaker Studio for free, you have to pay to use the AWS services within the studio.
Metaflow is primarily focused on pipelines as it helps organizations track models, version models, and orchestrate pipelines. On the other hand, SageMaker can do more than track and version models or orchestrate pipelines. It covers several other ML tasks including machine learning model deployment.
Metaflow doesn't offer any ML specific features or algorithms. SageMaker on the other hand comes with a bag of algorithms, pre-built models, and AutoML abilities built into the product.
The two platforms have a user interface. However, it serves different purposes. SageMaker Studio is (trying to be) a full IDE with all the bells and whistles, where you write code, inspect data, and execute pipelines. Like PyCharm for ML folks. In Metaflow, a user interface has just been added as a separate add-on service for passive monitoring of your Metaflow executions.
Summary
No doubt, machine learning is a continuous cycle. You fetch, clean, and prepare the data; train, tune and evaluate the model; and deploy it to production. After this, you need to continue motoring inferences to collect your 'ground truth', identify any drift, retrain the model with new data, and increase accuracy. The choice of an ML platform for these tasks largely depends on the focus, interest, and strength of your organization.
Based on the comparison provided in this article, it is clear that Metaflow is a viable option if you need a specialized platform to build pipelines. In that case, if you eventually require an end-to-end MLOps solution, you will need to get other product(s) to complement Metaflow. However, if you are more interested in a platform that can do various ML tasks, you can settle for an MLOps platform that comes with a direct end-to-end approach like SageMaker.
Valohai as an Alternative for Metaflow and SageMaker
[CAUTION: Opinions ahead] SageMaker and Metaflow are currently one the most popular MLOps platforms in the market. However, depending on your use-case Valohai might be a better option to cover up for some weaknesses that both of them have.
Metaflow is an open-source platform that is one of its strengths and one of its core weaknesses. Due to the complexity of the platform, it can be a pain to set up and manage without a dedicated team. Unfortunately, the free price tag comes at a cost.
SageMaker, on the other hand, is a managed platform, and if you are already in the AWS ecosystem, utilizing it is simple and free for the first two months. The SageMaker price tag increases with the services you use. Enias Cailliau from radix.ai points to a 40% cost increase compared to running purely on EC2 instances. This can make costs rise rather rapidly once you use SageMaker for production pipelines.
Valohai is also a managed platform with a 2-week free trial period.
If you are interested in learning more, check out:
This article continues our series on common tools teams are comparing for various machine learning tasks. You can check out some of our previous Metaflow and SageMaker comparison articles: