Calling the LLM
is the easy part.

Your AI product needs a context pipeline, specialized models, and a way to know if it's giving the right answers.

That's what we do.Since 2016.

A context pipeline feeds an LLM, which calls specialized models and tool calls and produces your AI feature. Valohai provides the context pipelines, model training, LLM selection, and evaluation underneath.

AI products are changing shape.

Here's what shipping them actually takes.

01

LLMs do the talking. Specialized models do the work that decides the answer.

Classifying 100,000 support tickets a day, detecting anomalies in real time, predicting churn before renewal. These run on models trained on your data, working alongside the LLM. The LLM stays in front. The specialized models bring the precision your domain needs.

02

You'll end up with more models than you planned. Built by more people than you expected. Not just your ML team.

A year ago, training a model meant a dedicated ML engineer. Today, any software engineer with a coding assistant can build and ship one. More features, more use cases, more models. All of them need to be evaluated and managed alongside the LLMs they support.

03

AI products aren't shipped once. Models need retraining. Data drifts. Costs change.

Cloud bills can climb quietly when nobody tracks what each request costs. Quality can drift down without triggering uptime or latency alerts. A model can get retrained by another team and deployed without a clear changelog. The infrastructure around your AI determines whether these stay manageable or become surprises.

Want the 2-minute version?

A quick executive summary of why this matters and what to ask your team.

Read the summary →

The "just ask the LLM" phase

Question

"Projected yield?"

LLM

Guesses from training data

Answer

"Exactly 14,847 units"

Hallucinated Expensive Inconsistent

With specialized models

Question

"Projected yield?"

LLM

Routes to the right model

Forecast model

Trained on your data

Answer

"12,400 units, ±8%"

Faster Cheaper More accurate Consistent
Free course LLMs Applied Certification 6 modules Start learning →

Platform

Agent skills

Claude Code Cursor Copilot Codex Gemini Zencoder

Files read and written in a local directory, parameters via argparse, metrics printed as JSON. Common Python conventions, nothing Valohai-specific. The platform syncs your files with cloud storage, versions every run, and stays out of your code.

That's why the upskilling path is gentle, and why Agent Skills work so well. Migrating is rinse and repeat: rewrite file paths, lift parameters into argparse, print metrics as JSON. Claude Code, Cursor, and Copilot apply those changes across your scripts. You save the time. Your code stays portable.

See how Agent Skills work →
claude-code   ~/my-ml-project

> Use the Valohai skills to migrate this project.

Scanning project structure...

Detected: PyTorch, transformers, scikit-learn

Rewrote file paths → /valohai/inputs/, /valohai/outputs/

Lifted parameters into argparse (7 found)

Added JSON metric printing (train.py, eval.py)

Generated valohai.yaml (3 steps)

vh lint passed

Ready to run: vh execution run train-model --adhoc

Install npx skills add valohai/valohai-skills --all

One platform. Every model.

Pipelines

Build it step by step. Run it end to end.

One pipeline definition covers experimentation and production. Define your workflow as connected steps. Each one runs a script, compares results, calls an API, or whatever your process needs. Each step caches independently, so only what changed reruns.

1

Conditional logic & quality gates

Control what happens next based on results. Accuracy didn't improve? Skip the next step. Cost per query too high? Branch to a cheaper model. New version doesn't beat the baseline? Block the deployment.

2

Parallel execution

A single task node spawns 4 to 100+ parallel executions, whether that's training across architectures or sweeping across model and prompt combinations. Results flow into the next step for comparison and ranking.

3

Human-in-the-loop approvals

Pause the pipeline until someone signs off. Review results before promoting a model. Approve a dataset before a large training run. Gate a deployment behind a manual check. Failed pipelines restart from the last successful step.

Valohai pipeline graph: ingest fans out into chunk, embed, index, evaluate, compare, and gate steps
Valohai Data tab showing a versioned corpus of files with a document preview pane

Datasets

18TB? Download once.

Versioned, immutable, and cached at any scale. New versions track which files were added or removed, not duplicate the entire dataset. Updating your evaluation corpus with this month's production samples? Only the new files get stored.

1

Cached across every execution

Download a dataset once. Every execution reuses the cache, whether that's local to a machine or on shared storage. When a new version adds files, only the new files download.

2

Smart versioning without duplication

Each version references existing files plus additions. Remove mislabeled samples or add new evaluation examples, and a clean version appears instantly.

3

Aliases for promotion

Point production-data to any version. Update the alias, not your code. Works the same whether your dataset is satellite imagery or a document corpus for RAG evaluation.

RAG AND LLM EVALUATION

Stop tuning by hand.

Sweep across chunk sizes, embedding models, retrieval strategies, and prompts systematically. See what actually works on your data, with cost right next to quality.

1

Multi-model comparison

Evaluate 3+ models against your datasets with one call. See quality, latency, and cost per token across every combination.

2

Run anywhere, even locally

pip install valohai-llm, set an API key, and start posting results. No infrastructure lock-in.

3

Deep tracing with Langfuse

Every evaluation links to a full trace: prompt chains, token counts, latency. Click through for root cause analysis.

Valohai LLM Compare view with radar and bar charts ranking Claude, Llama, and GPT across accuracy, cost per call, faithfulness, latency, and relevance
Lineage trace connecting training executions through best-model promotion to the deployed Model Dataset

Lineage & traceability

Trace any model back to the data that built it.

Every file, every execution, every dataset version, automatically tracked. No manual logging. No detective work when a teammate asks what changed between the last two model versions, or which prompt version is running in your support chatbot.

1

Which execution created this model?

Click any model and see the full execution trace, parameters, and code version.

2

What data was it trained on?

Trace backward through pipeline steps to the exact dataset version and preprocessing.

3

Where is it deployed?

Trace forward and see which deployments and environments use this model version.

Operations Dashboard

Stop stitching dashboards together.

Costs in your cloud console, queue times in your infrastructure dashboard, model performance in your experiment tracker, LLM evaluation results in yet another tool. One dashboard pulls compute spend, quality metrics, and infrastructure utilization together, automatically.

1

Cost and time savings, quantified

Reused compute, cached datasets, parallel evaluations. See exactly what your infrastructure saved you. Actual numbers, from your actual workloads.

2

Quality vs. cost, visualized

Which configuration meets your quality bar at the lowest cost? Scatter plots, Pareto frontiers, constraint filtering. Stop paying GPT-4 prices for tasks a cheaper model handles just as well.

3

Find the bottleneck, not the excuse

Peak wait times by environment, GPU utilization across your fleet, workload distribution. Your team is queuing for A100s while suitable alternatives sit idle. Now you can see it.

Valohai productivity dashboard with Time to Value, Job Reuse Savings, pipeline success rate, project cost breakdown, and 92% GPU utilization gauge

More platform capabilities

Systematic sweeps

Run hundreds of configurations in parallel. Hyperparameters, prompts, models, retrieval strategies. Find what works, systematically.

Docs →

Distributed training

Multi-GPU, multi-node. PyTorch Distributed, DeepSpeed, Horovod, Accelerate. Scale without rewriting your training code.

Docs →

Model registry

Versioned catalog with approval workflows. Every entry carries its performance history and eval metrics. Nothing ships without passing its quality gate.

Docs →

Deployment and serving

HTTP endpoints on Kubernetes. Alias-based routing for model promotion. Built-in support for batch inference and scheduled jobs.

Docs →

Experiment tracking

Every run, every config, every result. Real-time graphs, image comparison, confusion matrices. Sort and filter across any metric.

Docs →

SSH debugging

SSH into running executions. Attach VS Code or PyCharm, set breakpoints, forward ports to TensorBoard. Debug on cloud GPUs like they're local.

Docs →

Your data and compute
always stay where you need it to be.

Run on any cloud, any region, your own hardware, or all of them at once. Regional providers, alternative clouds, and on-prem hardware each have their own advantages, and Valohai works the same way on all of them. Your data and compute stay in your environment. For teams that need full control, the entire platform can be self-hosted. Kubernetes optional.

AWS Azure Google Cloud Oracle Cloud OVHcloud Scaleway On-prem Self-hosted Air-gapped

What teams running Valohai say

G2 4.9/5
  • “Most responsive vendor we've used.”

    Tens of thousands of executions across CPU and GPU instances. The computational power to analyse thousands of satellite images.

    Tapio F.
    Senior ML Engineer · Aerospace
  • “Daily go-to platform for ML.”

    Enables collaboration by ensuring transparency and traceability of data and models across the team.

    Claudia L. P.
    Data Scientist · Enterprise
  • “Backbone for our medical AI work.”

    Seamless workflow integration plus the ability to use our own compute infrastructure for radiology imaging.

    Maximilian M.
    Floy · Medical AI

Build AI products that stay accurate.

Free to start. No credit card required.