Your infrastructure

ML infrastructure that runs where your data lives.

Many ML teams can't move their data to a managed service, either because the data is on-prem, regulated, or already sitting in a cloud setup that works. Valohai runs on your existing AWS, Azure, GCP, or on-prem infrastructure and gives you compute orchestration, dataset management, and pipeline automation on top of it. The platform comes to your data, not the other way around.

01 · Beyond managed ML services

Managed ML services were built for the general case.

SageMaker, Vertex AI, and Azure ML are built for breadth: every customer, every use case, the same shape. When your workloads get specific (massive datasets, long training runs, custom environments, mixed GPU types), the platform that fit everyone starts to fit you less.

Valohai runs on your existing AWS, GCP, Azure, OVH, Scaleway, or Oracle account. Your compute, your storage, your cloud pricing all stay where they are. Valohai adds the orchestration, experiment tracking, and pipeline automation that an ML team needs at depth, not just at breadth.

Side-by-side view of a managed ML service vs the Valohai orchestration layer running on your own cloud

02 · Across clouds and on-prem

Project A runs on AWS. Project B requires OVH. Project C is on-prem. You need one platform for all three.

Different clients, different compliance requirements, different clouds. Some projects land on the big three. Others need Scaleway, OVH, Oracle, or on-prem hardware. The infrastructure changes; your team's workflow doesn't have to.

Valohai orchestrates across all of them. Same pipeline definitions, same experiment workflows, same tracking. Different compute underneath. Move a project from one cloud to another without rewriting your pipelines.

Valohai orchestrating ML workloads across AWS, Azure, and GCP from one control plane

03 · Clouds without a platform

You have reserved instances on Scaleway. You have no ML platform on Scaleway.

Scaleway, OVH, Verda, and other regional providers are good for compute. They give you GPUs, fair pricing, and often better data sovereignty than the big three. What they don't give you is an ML platform. The compute is there. The workflow layer isn't.

Valohai brings the full platform to whatever cloud you run on. Experiment tracking, pipeline automation, dataset management, and GPU orchestration, even on clouds that don't offer any of that natively.

The Valohai platform stack rendered as a layer above Scaleway, OVH, and Oracle compute

04 · On-prem and hybrid

You have GPUs in a rack. You don't have a platform to run them.

Your team has on-prem hardware for data residency, for cost, or for both. The GPUs are there. The orchestration around them isn't, which means SSH-ing into machines, managing queues by hand, and tracking experiments in spreadsheets.

Valohai treats your on-prem cluster like any other compute environment. Run experiments, schedule pipelines, manage queues, and track results the same way you would on any cloud. When the on-prem cluster is full, jobs can burst to a cloud you've configured. When the cloud is too expensive for daily work, run it on-prem and keep the cloud for spikes.

An on-prem GPU cluster and a cloud-burst environment treated as one orchestration target

05 · GPU efficiency

The expensive problem isn't idle GPUs. It's GPUs you're using badly.

Idle GPUs between training runs are the obvious cost problem, and autoscaling solves that part. The harder problem is the GPUs you're using right now: the A100 running a job that fits comfortably on an L40, the long-running training that holds a GPU during an hour of data loading, the inference job that barely cracks 30% utilization.

Most teams can't see the problem because the tools they have don't show it. Valohai tracks GPU memory, utilization, and runtime patterns across every job, so you can see where compute is actually going. Then it autoscales, mixes GPU types per job, and lets you right-size the workload to what it actually needs.

1

Live GPU memory and utilization across every job

2

Right-size GPU choice based on what jobs actually use

3

Autoscale up for big runs, down between jobs

4

Mix GPU types per job, A100s for training, smaller cards for inference

5

Track utilization and spend per project

GPU efficiency dashboard with utilization, autoscale events, and right-sizing recommendations

06 · Dataset caching

Why is your training data being downloaded for the hundredth time today?

When your training data is in terabytes, and in CV it always is, redundant downloads are one of the biggest hidden costs. Cloud egress fees on every transfer. Engineer time waiting for data to land. Storage paid for twice because the cached copy doesn't persist between jobs.

Valohai caches datasets across experiments, machines, and environments. A dataset downloaded for one experiment is immediately available for the next, on the next machine, in the next pipeline run. No duplicate transfers, no waiting.

Valohai Data tab showing a versioned corpus of files with a document preview pane

07 · Data stays where it belongs

The ML platform that doesn't see your data.

Healthcare data that can't leave a region. Financial data locked to a specific provider. Government contracts with sovereignty requirements. These rules aren't suggestions, and the platform you choose has to work within them.

Valohai runs on your infrastructure, in your cloud account, in your VPC, on your on-prem hardware. Your data stays where it is. Define where compute runs and where data lives, and the platform enforces those boundaries by design.

Regional compute and data residency boundaries enforced by where each project's data and compute live

Built in 2016. Survived every hype cycle. Your setup will too.

Run anywhere. Ship anything.

Start a free trial or talk to an engineer about your infrastructure.