LLM workflows

The right model, the right retrieval setup, the right price. Test them all.

Test any model, any prompt, any retrieval setup against your real data. Track quality alongside cost. Run it on a schedule, not when someone remembers.

Built on two tools: Valohai LLM Evaluations and Conduit (open source).

Side-by-side comparison of LLM configurations across quality and cost

01 · Model comparison

Public benchmarks tell you which model is best in general. Not which one is best for you.

The model that tops the leaderboard might not be the best fit for your product. Public scores don't see your data, your edge cases, or your quality bar. The only way to know is to compare your options against your real inputs.

Valohai lets you test combinations of model, prompt template, temperature, and system message against your data. You see accuracy, latency, and cost side by side, and pick the configuration that meets your bar.

1

Test across GPT-4o, Claude, Gemini, open-source models, or your own

2

See quality metrics next to cost per request

3

Compare hundreds of combinations in one run

Model comparison view with quality, latency, and cost per request shown side by side

02 · Context pipelines

When answers are wrong, you can't tell if it's the retrieval, the prompt, or the model.

A RAG pipeline has a dozen knobs: chunk size, overlap, retrieval depth, embedding model, reranking strategy, prompt structure. Testing them one at a time hides how they interact.

Valohai runs configurations side by side against real queries with known good answers, so you can trace a failure to its source.

Your corpus changes too. New docs, updated policies, refreshed knowledge bases. Valohai handles the reindexing pipeline, full or incremental, so your context stays current.

1

Test chunk sizes, overlap, retrieval depth, and embedding models systematically

2

Measure retrieval precision and recall alongside answer quality

3

Automate full and incremental reindexing as your corpus changes

4

See context length and cost per query

RAG context pipeline with chunking, embedding, indexing, and reranking steps connected end to end

03 · Regression testing

You changed the prompt last Tuesday. Did it make things better or worse?

Small changes can affect quality in ways you don't expect. A prompt update, a knowledge base refresh, a new model version, a config tweak. Without a system to compare before and after across your full dataset, you're guessing whether things got better.

Valohai compares any set of configurations across your complete test set. You see what improved, what regressed, and whether the difference is statistically meaningful. Not just "it seems different".

1

Compare any two configurations against the same dataset

2

Statistical significance testing, not gut feeling

3

Run on schedule to catch changes from model provider updates

Side-by-side prompt comparison showing the same dataset run against two prompt versions

04 · Cost analysis

The cheapest model that meets your bar is rarely the one you're using.

LLM costs scale with usage. A model at $0.03 per request feels cheap until you're running 500,000 requests a month. A model at $0.004 per request might hit the same quality for your use case, but you can't tell without comparing them side by side.

Valohai shows cost per request alongside quality metrics in your evaluations, so you can filter by budget and quality threshold and pick the configuration that meets your bar.

1

Cost per request alongside accuracy, latency, and quality scores

2

Filter: show me everything under $0.02 per request that meets my quality threshold

3

Track cost trends across evaluation runs

Cost per request charted against quality, with each configuration plotted for budget vs accuracy trade-offs

Tools we built for this

Two tools that work together — and on their own.

The work above needs two pieces of plumbing: a shared place to store and compare evaluation results, and a way to see what every LLM call actually costs. We built one of each. Use them with Valohai, or standalone.

Valohai LLM Evaluations

Know which configuration wins.

Most teams compare eval runs in spreadsheets, Slack screenshots, or a notebook only one person has open. There's no shared place to see whether today's prompt actually beats yesterday's.

Valohai LLM Evaluations is a hosted tracker for evaluation results. pip install valohai-llm, post results from your eval script, and compare models, prompts, and retrieval configs side by side with radar charts and scorecards. Ship the change you can defend with numbers.

Conduit · Open source

Know what you're spending.

Provider dashboards show monthly totals. They don't tell you which feature, endpoint, or user is driving the bill, or flag the deploy that quietly tripled it.

Conduit is a local Rust proxy that sits between your code and your LLM providers. Point OPENAI_BASE_URL at it and every call is logged with model, tokens, and cost. No SDK wrapping, no code changes. Catch runaway costs before the invoice does.

Ship LLM features you can stand behind.

Free to start. No credit card required.