Guide
Running Evals in Your AI Workflows
A practical guide to building evals for AI-powered workflows using Claude Code. Why most teams skip evals, what to measure, and how to wire them into your dev loop.
June 2, 2026 · 12 min read
Why this matters
You shipped an AI feature. It worked on the five prompts you tried. Two weeks later, a user pastes in something slightly off-distribution and the whole thing falls apart. You patch the prompt. Now a different case breaks. You’re playing whack-a-mole with no scoreboard.
Evals are the scoreboard. They let you say “version 4 of this prompt is 12% better than version 3 on the cases that matter” instead of “feels better, I think.”
Most teams skip evals because they sound like infrastructure work. They are not. A first pass takes an afternoon. The version that catches real bugs takes a week. Compared to one production incident, this is cheap.
The three eval types
Unit evals. Single input, single check. Did the model extract the right field? Did the JSON parse? Did it refuse the unsafe request? These are fast, deterministic, and the easiest to wire into CI. Start here.
Trajectory evals. Multi-step workflows where the path matters. Did the agent call the right tools in the right order? Did it ask for clarification when it should have? Did it stop when the task was done? These need a transcript, not just an output.
Quality evals. Subjective scoring. Is this summary good? Is this explanation clear? Is this response in the right tone? These need a judge, which is usually another model with a careful rubric. Quality evals are where teams flounder, because the rubric is the work, not the LLM call.
Most teams ship the first type, fake the second, and skip the third. Pretending the third does not exist is the most common eval mistake.
Designing cases that catch real bugs
A good eval case is small, named, and adversarial.
- Small so a human can read it in five seconds and know what is being tested.
- Named so when it fails, the name tells you what broke:
refuses_to_summarize_empty_input, notcase_47. - Adversarial because cases that mirror your happy path will pass forever and tell you nothing.
The cases you need are the ones you would not think to write. Pull them from real user inputs, real failures, and edge cases your tests have not yet seen. A reasonable starting set is 20 cases that span the input distribution, not 200 cases that all look the same.
Claude Code as both author and grader
Claude Code is well-suited to both writing and grading evals, but you have to keep the two roles separate or you will get circular reasoning.
The pattern that works:
- Use Claude Code in one session to generate adversarial cases. Give it your prompt, your tool definitions, and your spec. Ask it to brainstorm inputs that would break the system. Save those as eval cases.
- Use Claude Code in a separate session, with no knowledge of the cases, to run the system. Capture the outputs.
- Use Claude Code with a grading rubric (not the original spec) to score the outputs. The rubric is what makes this non-circular: it tests for properties of the output, not “does this match what the spec writer intended.”
Three sessions, three roles. Generator does not see the grader. Runner does not see the cases beforehand. Grader does not see the spec.
Wiring it into the dev loop
The eval suite that runs only when you remember to run it is the suite that catches nothing.
The minimum useful integration:
- A
npm run evalsscript that runs the full suite and prints pass/fail per case. - A
.github/workflows/evals.ymlthat runs it on every PR that touches the prompt directory or the agent code. - A baseline file (
evals/baseline.json) committed to the repo so regressions show up as diff lines, not as feelings.
The version of this that matters most is the baseline file. Without it, you have a test suite. With it, you have a regression detector.
A working example
Here is a minimal eval runner you can drop into a project today. It assumes your AI workflow exposes a run(input) function and you have a cases.json file with {name, input, expected} entries.
import cases from "./cases.json";
import { run } from "../src/workflow";
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
async function grade(
input: string,
output: string,
rubric: string,
): Promise<boolean> {
const msg = await client.messages.create({
model: "claude-opus-4-7",
max_tokens: 200,
messages: [
{
role: "user",
content: `Rubric:\n${rubric}\n\nInput:\n${input}\n\nOutput:\n${output}\n\nReply with exactly PASS or FAIL on the first line, then a one-sentence reason.`,
},
],
});
const text = msg.content[0].type === "text" ? msg.content[0].text : "";
return text.trim().startsWith("PASS");
}
const rubric = `The output must answer the user's question directly, cite at least one specific number from the input, and not invent facts.`;
let pass = 0,
fail = 0;
for (const c of cases) {
const output = await run(c.input);
const ok = await grade(c.input, output, rubric);
console.log(`${ok ? "PASS" : "FAIL"} ${c.name}`);
ok ? pass++ : fail++;
}
console.log(`\n${pass}/${pass + fail} passed`);
process.exit(fail > 0 ? 1 : 0);
That is the whole pattern. Cases in a file, runner in a script, grader with a separate rubric, exit code for CI. You can dress it up later. The shape is right.
What to do next
- Pick the one AI workflow in your product you are most nervous about.
- Spend an hour writing 10 adversarial cases for it.
- Spend another hour wiring up the runner above.
- Commit the baseline.
- Watch a PR break it within a week.
That is the entire path from “we have no evals” to “we have a scoreboard.” Most teams stop at step 1 because step 2 feels like work that does not ship. The teams that ship reliable AI features all did step 2.
If you build this and want a second pair of eyes on the rubric, send it over. The rubric is where most evals quietly fail.