Blog / Stop Making Your Data Scientists Learn AWS: The True Cost of SageMaker

Stop Making Your Data Scientists Learn AWS: The True Cost of SageMaker

by Drazen Dodik | on May 21, 2025

When you start building machine learning workflows on AWS, the first impression is dazzling. SageMaker offers clean services for training, deployment, tracking, and more—all modular, all powerful.

But as your projects grow, the reality hits you like a surprise AWS bill at the end of the month: Success depends less on how powerful the tools are—and more on how many hours your data scientists spend teaching themselves cloud engineering on weekends.

Data scientists aren't just building models anymore. They're expected to be:

Cloud infrastructure experts
IAM policy wizards
ECR credential detectives
CloudWatch log excavators

And even if you do have a DevOps team dedicated to ML—is this really where your expensive engineering talent should be spending their days? Debugging permission errors instead of building resilient infrastructure that moves your business forward?

The real limit to scaling ML on AWS isn't infrastructure capacity. It's the unreasonable expectation that forms in practice. We consistently see teams where data scientists are somehow expected to master multiple specializations simultaneously. It's simply not realistic for someone to be an expert in neural architectures, distributed training, and the intricacies of IAM policies, ECR registries, and CloudFormation templates—yet that's exactly what ends up happening in many SageMaker implementations.

The Real Complexity Tax: Everyone Becomes IT Support

Sure, AWS complexity can be managed—with the right templates, IaC setups, and tribal knowledge passed down through generations of engineers like ancient folklore.
But every additional moving part—every extra role assumption, storage policy, encryption setting—raises the baseline for what it takes to contribute.

And if you're serious about scaling ML?
You're not just scaling models and datasets—you're scaling a support desk:

More onboarding time teaching AWS quirks instead of your actual business problems and data domain
More cross-team support requests ("Can someone help me figure out why my SageMaker endpoint is stuck in 'Creating' for the third time this week?")
More Slack channels dedicated to AWS troubleshooting than to actual machine learning discussions
More hidden friction as your team silently battles the same cloud configuration demons over and over

The dream is self-serve ML workflows. The reality often looks more like DevOps becoming the ML team's personal IT helpdesk, AWS Slack channels overflowing with permission errors, and senior engineers quietly spending their afternoons debugging IAM policies instead of shipping features that actually generate revenue.

The Hidden Cost of Owning the Entire AWS Stack for ML

Let's be clear: AWS is fantastic infrastructure. It powers much of the modern internet—including parts of our own platform. But the experience of building machine learning workflows inside AWS—specifically with SageMaker—is a different beast entirely.

SageMaker promises a unified ML platform. One SDK, one place for training, deployment, tracking, and orchestration.
And it works. Until it doesn't.

Once you start to scale, the cracks appear:

You hit arbitrary limits: Max 50 steps per pipeline. Max 10 concurrent HPO jobs per tuning run. These aren't edge cases—they're normal for any serious experimentation loop.
You chase missing features: Want end-to-end preprocessing + training + post-processing in one workflow? Suddenly the answer is "Use SageMaker… plus Glue… plus Batch… and maybe Step Functions." Now you're back to stitching services together like a cloud quilt.
You juggle disjointed UX: Separate UIs, permissions, and workflows. Different pricing models for notebooks vs training jobs vs batch transforms. All glued together by tribal knowledge and internal wikis that are already outdated by the time they're written.
You get locked in without realizing it: Code starts to depend on sagemaker.pytorch, not just pytorch. Output formats, artifact management, and metadata tracking become SageMaker-specific. Want to run just one part of your workflow somewhere else? Good luck extracting that single piece from the SageMaker monolith.
And yes, it's expensive: SageMaker training instances are 20%+ more expensive than equivalent raw EC2. Even tracking experiments can cost hundreds—a single hosted MLflow server for basic tracking and access control? Try $438–$759/month. Per environment (which could mean per team, per project, or both depending on how your security team decided to slice things up).

Scaling People, Not Just Compute

When teams assume SageMaker is a one-stop-shop for ML, they don't expect to invest this much in platform engineering.

But every new team member now needs:

A crash course in your AWS setup
Familiarity with SageMaker's unique SDK patterns and deployment behaviors
Access to internal runbooks for how to navigate limits, workarounds, and cross-service configs

In practice, you're not just hiring data scientists—you're hiring people who can also navigate DevOps, architecture, and SageMaker's platform-specific quirks. Or you're building an internal team to abstract that complexity away from them—which usually means bottlenecks, gatekeeping, and long lead times for even simple workflows.

Either way, the cost is real. And it has nothing to do with your EC2 bill.

The Real Trade-Off: Control vs Focus

Yes, AWS gives you full control. And for some teams, that's non-negotiable.
But full control means full responsibility:

When features are missing, you work around them with more custom code that someone will have to maintain
When usage scales, you hit limits you didn't know existed until production is already down
When hiring, you're fishing in a tiny talent pool of unicorn engineers who are somehow both cloud architecture experts AND cutting-edge data scientists AND domain experts in your specific business vertical

The question isn't whether SageMaker can scale—it's whether you can find enough mythical creatures to operate it at scale before your competitors ship their next ten features.

Bridging the Gap: A Different Operating Model

What if your data scientists could focus on models, not middleware?
What if your ML engineers didn't have to relearn SageMaker's SDK quirks before writing their first training job?
What if your best infra people could focus on scaling and hardening your platform—not spending their week debugging why someone's pipeline hit step limit 51?

Wait, I know what you're thinking… "Here comes the pitch where they say their product solves everything!" Well, yes, this is a Valohai blog. But hear me out—because your team's sanity might actually depend on it.

That's the idea behind using orchestration layers built specifically for ML workflows—like Valohai.

And I want to be clear here: Valohai isn't a replacement for AWS. It runs inside your AWS. Your VPCs, your storage, your network rules—all stay intact. Your infra team retains full control.

But what it does do is let your DS/ML teams:

Write, version, and track experiments without custom infra
Run (distributed) jobs without diving into IAM or ECR configs
Chain preprocessing + training + validation without 3 different services
Avoid vendor-specific SDKs—just run your containers

In short: It gives you operational simplicity without giving up architectural control.

Because in most ML orgs, the bottleneck isn't compute—it's people drowning in cloud configuration.

There Has to Be a Better Way (And It's Simpler Than You Think)

If your team is happily running high-scale ML in SageMaker with minimal friction—honestly, hats off. That's not sarcasm. You've likely invested deeply in infra maturity, hiring, and internal tooling. And that's a valid strategy.

But if you're hearing grumbling from your data scientists,
If your experiments are slowing down instead of speeding up,
If your DevOps lead is moonlighting as SageMaker support,
If your costs are skyrocketing due to compute markup and inefficient use of parallelism, caching, or inefficient Spot usage…

Then maybe it's time to ask a different question:

"Do we really want our smartest people spending their time navigating SageMaker limits and debugging AWS configs? Or do we want them building models that ship faster and scale better?"

There's a different operating model out there—one that keeps the power of AWS, but gives your team breathing room to focus on what they were actually hired to do.

Try Valohai Without the Migration Nightmare

Nobody wants to kick off a "strategic platform migration" only to realize halfway through that the tooling doesn't fit—or worse, that it demands more change than it's worth.

That's why we keep the Valohai evaluation process simple, technical, and surprisingly low-risk:

Start with a meaningful exploration: We begin with a technical demo and a hands-on discussion. Not just "here's what Valohai does," but: What's slowing your team down? Where are the workflow gaps? What does better actually look like for you?
Live dry-run with your code: Next, we fire up a screen share and take one of your real projects—code, infra, quirks and all—and run it on Valohai. No massive rewrites. Just change where you save outputs, plug in your config file, and hit go. In about two hours, you'll have your pipeline running with full versioning, lineage, and reproducibility.
Structured team trial: If that first run looks good, we plan a two-week trial with your broader team. We define what success looks like together, help with onboarding, and support your specific workflows—not just "how Valohai works," but how it fits into how you work.
Decide with real results: At the end, we look at what changed. Did you get more visibility? Faster iteration? Fewer bugs or "ghost runs"? We'll help quantify the impact—on efficiency, collaboration, and sanity—and you can decide if the fit makes sense.

We do all this to avoid what everyone hates: long-winded trials, heavy planning cycles, or high-risk "let's just migrate and hope" moments. The whole process takes just two weeks from start to finish, and you get to experience Valohai in your environment with your actual code and data. It's not a sterile demo—it's your real workflows running on a platform that lets you test before you commit.

Ready to see if there's a better way for your team?
Let's do a quick proof-of-concept with your actual code and workflow.
Drop me a line on LinkedIn or book a slot for a technical deep dive that focuses on your specific challenges, not just our features.

Start your Valohai trialTry out the MLOps platform for 14 days