• A diagnostic

For executive teams

The honest stages of building with LLMs.

Three stages, the honest mid-points where most teams sit, and the questions to bring to your next meeting.

01 – 03 Three stages, in order

Stages & challenges

Stages and challenges.

Free tools cover stages one and two. Stage three is where the paid platform earns its keep.

01

Experimentation

Prototyping LLM features, comparing models, figuring out what works.

This is you if

You're thinking: should I use OpenAI, Claude, or something else?

Start here
LLM Evaluation

Next when Volume starts driving up cost.
02

LLM features in production

Users are interacting with the feature, but quality and cost are hard to pin down.

This is you if

You've chosen one, but the API bill went up and the answers aren't good enough.

Start here
LLM Cost & Performance

Next when Hallucinations and inconsistency persist, even with prompt engineering.
03

LLMs and specialized models

The LLM handles intent. Specialized models do the heavy lifting behind it.

This is you if

Your LLM isn't good at some tasks. You're exploring custom agents and models.

Start here
Start a Valohai trial

¶ Deep dive

Open a stage for the full picture

The mid-points, in detail.

Each stage with the honest version: what the team is actually doing, where it breaks, and what to focus on next.

The demo works. But nobody has compared models on your actual data. Results live on someone's laptop. And nobody has calculated what this costs at scale.
Challenges at this stage
- 01 No standardized comparison. One person tested GPT-4o, another tried Claude. Results are scattered.
- 02 The demo doesn't represent production. Curated examples work. Real user inputs are messy and unpredictable.
- 03 No path from prototype to product. The team can call an API. They don't have evaluation, cost tracking, or a plan for wrong answers.
Priorities
- Compare models on your actual data, not benchmarks
- Get evaluation results into a shared workspace
- Know the cost per query before you commit
What good looks like
- At least three models compared on your actual product data
- Evaluation results visible to the whole team, not on one laptop
- Cost per query calculated before choosing a model
Take the free LLMs Applied course. Start comparing models with Valohai Evaluations (free, no credit card).
Start the free course
The API returns 200 even when the response is wrong. Quality can drop without any code change on your side. Your costs doubled last month, and by the time you noticed, the spend was already locked in.
Challenges at this stage
- 01 Quality problems are invisible. The API is up, latency is normal, but the model started giving worse answers two weeks ago.
- 02 Costs surprise you. Token usage varies per request. Nobody tracked which features cost what until the bill arrived.
- 03 No way to measure quality. You're relying on user complaints instead of your own evaluation.
Priorities
- Measure output quality systematically
- Get cost visibility per feature before the bill arrives
- Find tasks where a cheaper model meets the quality bar
What good looks like
- Quality measured continuously, not just when users complain
- Monthly spend broken down by feature and model
- At least one task identified where a smaller model performs just as well
Install Conduit to track LLM costs (free, open source). Use Evaluations to compare models systematically.
Get cost visibility
More features mean more models. More models mean more things to evaluate, version, and keep running. The LLM provider released a new version. Your team trained a better classifier. How do you know the new version is actually better?
Challenges at this stage
- 01 Hard to evaluate updates. A new model looks better on benchmarks. Does it actually improve your feature on your data?
- 02 Nobody has the full picture. Multiple models, versions, and environments. No single source of truth.
- 03 Growing maintenance. Every new feature adds models, data pipelines, evaluation runs, and operational overhead.
Priorities
- Standardize evaluation across all model types
- Version everything from data to deployment
- Automate pipelines so updates don't depend on manual steps
What good looks like
- Every model versioned and traceable to the data that built it
- Evaluation runs automatically when a model or dataset changes
- You can tell your VP exactly which model version is in production right now
See how pipelines, evaluation, and lineage work together for your setup.
Book a demo

“ In their words

From verified G2 reviews

Allows our AI team to focus on developing DL models into production without heavy collaboration with DevOps. Centralized place for all of our data science experiments, models, and metrics.

Edward K.

Machine Learning Engineer · Mid-Market

Their product works great and even for edge cases that need investigation, their team is awesome at supporting us. We can rely on their tools within our own infrastructure.

Michael S.

CTO

? A checklist

Screenshot · share · ask

What to ask your team.

These get at the real gaps without putting anyone on the defensive. Bring them to your next standup, or paste them into Slack.

01

Experimentation

How did we pick the model we're using? Did we compare alternatives on our data?
If I wanted to see the evaluation results, where would I look?
What does each API call cost us? Do we have a breakdown by model?
What happens when the model gives a wrong answer? How often does that happen?

02

LLM Features in Production

How do we measure whether the AI responses are good enough? What's our quality bar?
What's our monthly spend on LLM APIs? Which feature costs the most?
Are there tasks where the LLM is overkill? Where something simpler would be faster and cheaper?
If the model provider updates their model tomorrow, how would we know if our feature got better or worse?

03

LLMs + Specialized Models

If the classifier's accuracy dropped, how would we find out? How fast?
When we update a model, how do we know it's better than what's currently running?
Can we trace any production model back to the exact data and code that produced it?
How much time does the team spend on manual evaluation and deployment tasks?

04

Multiple Systems at Scale

Do we have a single view of what our entire AI operation costs and delivers?
If a new team member joins, how long until they can run and evaluate models independently?
Are different teams duplicating infrastructure, or sharing a platform?
When was the last time an auditor or compliance review asked about model provenance? How long did it take to answer?

§ Brief yourself

Free · Six modules · Self-paced

Four modules every leader should read.

Our free Applied LLM course has six modules. These four cover the questions you’ll keep facing, regardless of stage.

01

Understanding LLMs for product development.

When to reach for an LLM, when traditional ML, rules, or a database query is the better tool.
04

Building LLM-powered features.

Validation layers, retries, provider fallbacks, when an agent helps and when it hurts.
05

Evaluating LLM outputs.

Building eval sets, layered evaluation, regression suites that block PRs.
06

Going to production.

Cost tracking, model routing, prompt caching, quality drift, the weekly improvement rhythm.

Open the course on Valohai Academy

The honest stages of building with LLMs.

Stages and challenges.

Experimentation

LLM features in production

LLMs and specialized models

The mid-points, in detail.

Experimentation

LLM features in production

LLMs and specialized models

What to ask your team.

Experimentation

LLM Features in Production

LLMs + Specialized Models

Multiple Systems at Scale

Four modules every leader should read.

Understanding LLMs for product development.

Building LLM-powered features.

Evaluating LLM outputs.

Going to production.

Ready to move to the next stage?