A diagnostic

The honest stages of building with LLMs.

Three stages, the honest mid-points where most teams sit, and the questions to bring to your next meeting.

01 – 03   Three stages, in order

Stages and challenges.

Free tools cover stages one and two. Stage three is where the paid platform earns its keep.

  1. 01

    Experimentation

    Prototyping LLM features, comparing models, figuring out what works.

    This is you if

    You're thinking: should I use OpenAI, Claude, or something else?

    Next when   Volume starts driving up cost.

  2. 02

    LLM features in production

    Users are interacting with the feature, but quality and cost are hard to pin down.

    This is you if

    You've chosen one, but the API bill went up and the answers aren't good enough.

    Next when   Hallucinations and inconsistency persist, even with prompt engineering.

  3. 03

    LLMs and specialized models

    The LLM handles intent. Specialized models do the heavy lifting behind it.

    This is you if

    Your LLM isn't good at some tasks. You're exploring custom agents and models.

  Deep dive

The mid-points, in detail.

Each stage with the honest version: what the team is actually doing, where it breaks, and what to focus on next.

  1. The demo works. But nobody has compared models on your actual data. Results live on someone's laptop. And nobody has calculated what this costs at scale.

    Challenges at this stage

    • 01 No standardized comparison. One person tested GPT-4o, another tried Claude. Results are scattered.
    • 02 The demo doesn't represent production. Curated examples work. Real user inputs are messy and unpredictable.
    • 03 No path from prototype to product. The team can call an API. They don't have evaluation, cost tracking, or a plan for wrong answers.

    Priorities

    • Compare models on your actual data, not benchmarks
    • Get evaluation results into a shared workspace
    • Know the cost per query before you commit

    What good looks like

    • At least three models compared on your actual product data
    • Evaluation results visible to the whole team, not on one laptop
    • Cost per query calculated before choosing a model

    Take the free LLMs Applied course. Start comparing models with Valohai Evaluations (free, no credit card).

    Start the free course

  In their words

Allows our AI team to focus on developing DL models into production without heavy collaboration with DevOps. Centralized place for all of our data science experiments, models, and metrics.

Edward K.

Machine Learning Engineer · Mid-Market

Their product works great and even for edge cases that need investigation, their team is awesome at supporting us. We can rely on their tools within our own infrastructure.

Michael S.

CTO

?   A checklist

What to ask your team.

These get at the real gaps without putting anyone on the defensive. Bring them to your next standup, or paste them into Slack.

01

Experimentation

  • How did we pick the model we're using? Did we compare alternatives on our data?
  • If I wanted to see the evaluation results, where would I look?
  • What does each API call cost us? Do we have a breakdown by model?
  • What happens when the model gives a wrong answer? How often does that happen?
02

LLM Features in Production

  • How do we measure whether the AI responses are good enough? What's our quality bar?
  • What's our monthly spend on LLM APIs? Which feature costs the most?
  • Are there tasks where the LLM is overkill? Where something simpler would be faster and cheaper?
  • If the model provider updates their model tomorrow, how would we know if our feature got better or worse?
03

LLMs + Specialized Models

  • If the classifier's accuracy dropped, how would we find out? How fast?
  • When we update a model, how do we know it's better than what's currently running?
  • Can we trace any production model back to the exact data and code that produced it?
  • How much time does the team spend on manual evaluation and deployment tasks?
04

Multiple Systems at Scale

  • Do we have a single view of what our entire AI operation costs and delivers?
  • If a new team member joins, how long until they can run and evaluate models independently?
  • Are different teams duplicating infrastructure, or sharing a platform?
  • When was the last time an auditor or compliance review asked about model provenance? How long did it take to answer?

§   Brief yourself

Four modules every leader should read.

Our free Applied LLM course has six modules. These four cover the questions you’ll keep facing, regardless of stage.

  1. 01

    Understanding LLMs for product development.

    When to reach for an LLM, when traditional ML, rules, or a database query is the better tool.

  2. 04

    Building LLM-powered features.

    Validation layers, retries, provider fallbacks, when an agent helps and when it hurts.

  3. 05

    Evaluating LLM outputs.

    Building eval sets, layered evaluation, regression suites that block PRs.

  4. 06

    Going to production.

    Cost tracking, model routing, prompt caching, quality drift, the weekly improvement rhythm.

Open the course on Valohai Academy

  Next step

Ready to move to the next stage?

Most teams start with the free tools. No sales call, no credit card.