
Which LLM Should Power Your Support Bot? How Systematic Evaluation Turned a Gut Feeling Into a Data-Backed Decision
by Petteri Raatikainen | on March 03, 2026Introducing Valohai LLM, a purpose-built tool for running, tracking, and comparing LLM evaluations.
When you're building a product on top of an LLM, there's a moment everyone hits eventually: you've got a working prototype, it feels pretty good, and now you need to decide which model to ship with.
GPT-5? Opus 4.6? Something open-source and self-hosted? They all work. But which one should you use, for your use case, with your data?
Most teams answer this question with a combination of intuition, a few test runs, and a shared Slack thread where someone pastes screenshots of outputs. It works, but it's not systematic.
Here's what a more systematic approach looks like, and what it can reveal.
The Scenario: A Support Bot That's Ready to Ship
You're the engineering lead at a project management SaaS. Your team has built a RAG-powered support bot that retrieves relevant documentation and feeds it to an LLM to generate answers to customer tickets automatically.
The prototype is working, running on GPT-5.
Why GPT-5? Your team already uses OpenAI's API for another feature, so the integration was trivial. It performed well in the initial tests. When you needed something that "just worked" to validate the concept, it was the obvious choice.
But now you're planning the production rollout, and the questions are starting to surface:
- Finance wants to understand the cost model. What happens when you're processing 10,000 tickets a month instead of 100?
- Someone suggested trying Opus 4.6. Would the better reasoning quality reduce follow-up tickets enough to justify the higher cost and latency?
- Your infrastructure lead is pushing for self-hosted. You already run Llama models for another service. Could Llama 3.3 70B handle support tickets at a fraction of the API cost?
GPT-5 works. But before you commit to it at scale, you want to know: is it actually the right choice for this use case? Or did you just pick what was convenient?
To answer that, you need more than a handful of test cases and gut feeling. You need data you can defend in front of your CFO.
Setting Up the Evaluation
The first step is assembling a dataset that represents your production traffic.
The starting point: 20 real support tickets drawn from the help desk history, distributed across four categories:
- Billing (5 tickets): upgrade questions, refund requests, unexpected charges
- Features (5 tickets): "how do I set task dependencies?", recurring task configuration
- Troubleshooting (5 tickets): login issues, stuck uploads, sync failures
- Account (5 tickets): 2FA setup, project ownership transfers
Each ticket includes three things: the documentation context the RAG system retrieves, a human-written expected answer, and a requires_reasoning flag. That flag marks the harder tickets, the ones where the model needs to reason through a problem rather than quote documentation. Diagnosing why a customer was charged twice, for example, requires working through the possible scenarios.
This flag turns out to be important. More on that shortly.
What to Measure
Each model response is scored on five metrics:
Answer Relevance (0–1): Did the bot answer what the customer actually asked? A response can be factually correct and still miss the point entirely.
Faithfulness (0–1): Did the bot stick to the documentation, or did it hallucinate? A support bot that invents refund policies or makes up settings pages destroys customer trust faster than no bot at all. This is the metric you don't want to be wrong about.
Completeness (0–1): Did the response cover everything the customer needs? "Go to Settings" is technically correct. "Go to Settings > Billing > Change Plan, then select your new tier and confirm" is useful.
Latency (ms): Support chat breaks down after about two seconds. That includes retrieval time and rendering. A model taking 600ms per response leaves room to work with. One taking 2,000ms doesn't.
Output Tokens: At scale, token count is a direct line to your monthly API bill. A model that's 10% better in quality but 3× more verbose might not be the right call for the majority of your traffic.
Running 60 Evaluations in One Go
With the dataset defined and metrics decided, the evaluation itself is straightforward. A task is created in Valohai LLM with the parameter model: ["gpt-5", "opus-4.6", "llama-3.3-70b"], the dataset is attached, and the eval script runs.
Three models × 20 tickets = 60 evaluations. The task runner handles the iteration automatically. No loops to write, no results to manually collect. Results stream into the dashboard in real time as each evaluation completes. You can spot problems early instead of waiting hours to discover an eval script failed.
The code for posting a single result looks like this:
valohai_llm.post_result(
task="support-bot-eval",
labels={
"model": "gpt-5",
"category": "billing",
"requires_reasoning": False,
},
metrics={
"relevance": 0.91,
"faithfulness": 0.96,
"completeness": 0.88,
"latency_ms": 420,
"output_tokens": 187,
},
)
When all 60 results are in, you have a complete comparison matrix covering every model, every ticket category, and every metric, ready to analyze.
The Headline Numbers (Which Miss the Real Story)
Grouped by model, the aggregate results tell the expected story:
- Opus 4.6 leads on all three quality metrics: highest relevance, highest faithfulness, best completeness
- GPT-5 is close behind, with meaningfully lower latency and cost
- Llama 3.3 70B scores lower across quality metrics, but is 3× cheaper and significantly faster
At this level, the obvious choice looks like GPT-5: slightly lower quality than Opus, much more practical on speed and cost.
But this view flattens something important.
Where It Gets Interesting: Breaking Down by Category
Add a second grouping dimension (model × category) and a different picture emerges.
Llama 3.3 70B handles billing and account tickets almost as well as GPT-5. The quality gap narrows to 2–3 points on relevance and completeness. These tickets tend to be straightforward: a customer asks how to change their billing cycle, the RAG system retrieves the relevant docs, and the model summarizes what's there. You don't need the most capable model on the market for this.
But on troubleshooting tickets, the gap widens. Relevance and completeness drop by 10–15 points. Faithfulness holds roughly steady. The model isn't hallucinating, but it's skipping steps. These tickets require the model to reason through a problem, to figure out what's going wrong and determine the right sequence of steps, not just quote documentation.
The requires_reasoning filter makes this even clearer. Filter the results down to only the harder tickets, and the gap between Llama 3.3 70B and GPT-5 opens up sharply: 20+ points on completeness. This is where model capability is load-bearing.
The key question for your support bot isn't "which model is best?" It's "what percentage of your tickets need the best model?"
The Decision: Route by Complexity
Looking at the ticket distribution, roughly 70% of the volume falls into billing, account, and straightforward feature questions. The remaining 30% (troubleshooting, edge cases, anything flagged requires_reasoning) is where quality matters.
The distribution makes the routing strategy obvious:
- Simple tickets → Llama 3.3 70B. Comparable quality on the things you're routing here, dramatically lower cost, faster response times.
- Complex tickets → Opus 4.6. Use the best model where it earns it, not on tickets where a much cheaper model performs nearly as well.
Compared to running GPT-5 for everything, this routing approach cuts API costs by roughly 60% while maintaining, or in some cases improving, quality where it matters most.
That's the decision. Not a guess, not a spreadsheet with manually pasted outputs from a few test runs. A comparison matrix built from 60 evaluations, with statistics that tell you whether differences are meaningful or noise.
This Isn't a One-Time Decision
You've shipped the routing strategy. GPT-5 was the right call when you started prototyping. Llama 3.3 70B for simple tickets and Opus 4.6 for complex ones is the right call now, based on your data.
Three months from now? Six months? The right answer might be different.
New models launch constantly. GPT-5.5 might close the quality gap with Opus while maintaining lower latency. Llama 4 might handle reasoning-heavy tickets well enough to eliminate the need for routing. Gemini 3.0 might offer a better price-performance ratio than anything you tested.
Models get deprecated. OpenAI announced in January 2026 that GPT-4o API access would end in February. If you'd built your product on GPT-4o and treated the model choice as a permanent decision, you'd be scrambling to re-evaluate and migrate under deadline pressure. If evaluation is part of your workflow, it's just another scheduled task.
Model variants multiply. Fine-tuned versions, distilled models, regional deployments, extended context windows — the same base model can fork into a dozen variants, each with different performance characteristics and cost structures. What worked for your use case six months ago might not be the best fit today.
Your constraints change. Maybe finance tightens the budget and cost becomes the dominant factor. Maybe you expand beyond Europe and latency to regional users becomes critical. Maybe ticket volume triples and you need to optimize for throughput. The model that was optimal under your old constraints might not be optimal under your new ones.
The evaluation you ran today isn't "done." It's the first data point in an ongoing process. The question isn't whether you'll need to re-evaluate, it's whether you have the infrastructure to do it without friction when the time comes.
Evaluation Shouldn't Live in Notebooks
The reason most teams don't do this kind of analysis isn't lack of interest, it's friction and ease of use. The work disappears into notebooks no one else can access. The comparison you need takes hours of manual data wrangling. And when someone asks "which model should we use?" two weeks later, you're starting from scratch.
You run a few evals locally. The results end up in a notebook that someone else can't access. You run them again the next week with a different prompt, and now you're not sure which set of numbers you're comparing. Stakeholders ask which model performs better; you struggle to give a confident answer because your "data" is spread across terminals, spreadsheets, and screenshots in a Slack thread.
Valohai LLM is built to remove that friction. Post results from wherever your evaluations already run (your laptop, a CI pipeline, a notebook, a script) and they land in a shared dashboard where you can filter, group, and compare them. No infrastructure to manage. No YAML files to write. No shared server to babysit.
The comparison view (radar charts, bar charts, scorecards for up to six configurations) is designed to make the right call obvious, and to be shareable with people who don't need to understand what a "span" is to understand which model they should bet the product on.
The Valohai LLM Compare view. Filter by model, category, or any label, and the radar and bar charts update instantly.
Getting Started Takes About Five Minutes
pip install valohai-llm
import valohai_llm
valohai_llm.post_result(
task="my-first-eval",
labels={"model": "gpt-5", "dataset": "support-tickets"},
metrics={"relevance": 0.91, "latency_ms": 420}
)
That's enough to get your first result into the dashboard. From there: upload a dataset, create a task with the models or parameters you want to compare, and run it. The comparison view builds itself as results come in. No manual spreadsheet work, no copying numbers between tools.
Free to start. No credit card required. No infrastructure to set up, just an API key.
Start tracking your evals at llm.valohai.com
Where This Is Heading
The scenario above (three models, 20 tickets, a routing decision) is what Valohai LLM handles today. But Valohai LLM is built on top of Valohai, a platform built for production AI systems. As your evaluation needs grow, you'll naturally grow into capabilities designed for production AI systems.
Automated pipelines triggered by new model versions. Evals run when something changes, not when someone remembers to kick them off. You catch regressions before they reach production, not after. Managed compute that provisions and shuts down without leaving idle machines running. Approval gates for human review before a new configuration reaches production. Dataset versioning so you can rerun the same evaluation against next month's model and know the comparison is apples-to-apples. No "wait, did we change the test data?" confusion when benchmarking new releases.
You start with pip install and three lines of Python. As your team and requirements evolve, you'll grow into what we call an AI Factory, a systematic approach to building, evaluating, and deploying AI at scale.
The question of which LLM to ship with doesn't have to be a gut call. It can be a decision. Valohai LLM is how you make that decision, and keep making it as your models, prompts, and requirements evolve.
Valohai LLM is free to start. Sign up at llm.valohai.com and run your first evaluation today.