Your first eval suite with Braintrust

The 60-minute path from 'we ship LLM features on vibes' to 'we block merges on eval regressions'. Datasets, scorers, experiments, and the CI wiring that makes it real.

I’ve watched this pattern play out at maybe a dozen teams in the last year. Someone ships an LLM feature. It works well enough in demo. It goes to production. Three weeks later a customer complaint surfaces — summaries got worse, or classifications flipped, or the tone shifted — and nobody can say which prompt change caused it, because there were seven prompt changes, and nothing in CI was measuring output quality.

That’s the default state of “we test LLM features” in 2026. You test by reading three examples in a staging environment and nodding. Vibes.

This primer is the 60-minute path out of that. Install Braintrust, build a dataset from ten real examples, wire up a scorer, run your first experiment, compare two prompt versions, and put the whole thing in CI behind a branch protection rule. End state: prompt changes that regress output quality can’t silently ship.

The problem, more precisely

LLM features fail differently than regular software. A function that sorts integers either sorts integers or it doesn’t, and you can cover the behavior with twenty unit tests. A prompt that classifies support tickets has a distribution of outputs, and a regression looks like “accuracy on the hard cases dropped from 87% to 82% while easy cases held.” That’s not a failure any unit test catches.

Two things make this worse in 2026 specifically:

Prompts are code now. A real product has dozens of prompts — system prompts for each feature, per-tool descriptions in agent loops, classifier prompts buried in retrieval pipelines, LLM-judge prompts in your own evals. Changing any of them is a deployment. Without eval coverage, you’re deploying untested code.
Models themselves drift underneath you. Even if you never touch your prompts, a provider-side model update can shift behavior. “Pin the model version” helps but doesn’t eliminate the problem — eventually you have to migrate off a deprecated version, and when you do you want to know what changed in your output distribution.

Evals are the only way to keep up. But evals that live in a notebook on someone’s laptop and get run “when we remember” don’t keep up with anything. Evals fix the shipping problem only when they run automatically on every PR and can block a merge.

An eval suite, in three parts

Before any tool, the mental model. An eval suite is three components. That’s it.

Dataset — a collection of (input, optional expected output) pairs that represent the workload you care about. Ten support tickets with their correct categories. Fifty code snippets with their correct summaries. A hundred user questions with known good answers. If your product doesn’t have a dataset like this, you don’t have evals yet; you have sentiment.

Scorer — a function with the signature (input, actual_output, expected_output?) => number in [0, 1]. Zero is worst, one is best. The scorer encodes what “good” means. Exact-match is the simplest scorer. LLM-as-judge is the most flexible. Between them is a spectrum.

Experiment — one run of your model/prompt over the dataset, with each row scored, stored as a single object you can compare against later experiments. An experiment is the unit of “we changed the prompt and here’s what happened.”

Tools give you workflow around these three things. The three things are what matter.

The tool landscape

The eval-tooling space shook out over 2024–2025. As of early 2026 it looks roughly like this:

Tool	Best for	Notes
Braintrust	Hosted, best-in-class UX for comparing experiments, strong CI integration	Closed-source; per-seat pricing above the free tier
Langfuse	Self-hosted, OSS, compliance-sensitive workloads	MIT-licensed; slightly rougher UX but you own the data
Humanloop	Prompt-management-first teams that want evals attached	More product-ops flavored than dev-first
OpenAI Evals	Teams already all-in on the OpenAI stack	Tighter integration, narrower scope
Custom (Postgres + scripts)	Teams with one unusual workload and real engineering capacity	You’ll reinvent 60% of what Braintrust/Langfuse do

I default to Braintrust for most teams starting out. The UI for diffing two experiments side-by-side is genuinely the best in this space, and the CI integration is a few lines of config away. The cost to try it is free; the cost to leave it is low because the dataset/scorer concepts are portable.

For anyone with self-hosting requirements — healthcare, finance, gov — Langfuse is the right default. The mental model in this primer applies to both; the code examples are Braintrust.

Setup

Ten minutes end to end.

pnpm add braintrust autoevals

autoevals is the companion package with the out-of-the-box scorers. Both are published by Braintrust, both are MIT-licensed, both work fine with any model provider.

Create an account at app.braintrust.dev, create a project (call it my-app-evals or similar), grab an API key from settings:

export BRAINTRUST_API_KEY=bt_...

That’s the full setup. You can now import { Eval } from "braintrust" and run experiments.

Your first dataset

This is the step teams overthink the most. You do not need 500 examples. Ten to twenty real examples beat a hundred synthetic ones, every time. The reason is obvious once you think about it: a synthetic dataset measures how well your model handles the kinds of inputs you thought to generate, which is exactly not the distribution of inputs your users actually send.

Where do the ten real examples come from? Production logs, for anything already live. Internal dogfooding transcripts, for anything pre-launch. Customer support tickets that surfaced complaints, for anything that failed. Curate actively — you’re looking for a mix of easy cases (the dataset needs a floor), hard cases (where regressions surface first), and adversarial cases (edge cases you want to pin behavior on).

Two ways to get a dataset into Braintrust.

Via the TypeScript SDK, for programmatic dataset construction from existing data:

import { initDataset } from "braintrust";

const dataset = await initDataset("my-app-evals", {
  dataset: "support-ticket-classification-v1",
});

const examples = [
  {
    input: "Hi, I was charged twice for my subscription this month. Can I get a refund?",
    expected: "billing",
  },
  {
    input: "The app crashes every time I open the reports tab on iOS.",
    expected: "bug",
  },
  {
    input: "Can you add a way to export data to CSV?",
    expected: "feature_request",
  },
  // ... 17 more
];

for (const example of examples) {
  dataset.insert({
    input: example.input,
    expected: example.expected,
    metadata: { source: "production_log_2026_02" },
  });
}

await dataset.flush();

Via CSV upload in the UI, for when your dataset lives in a spreadsheet a PM already curated:

Columns: input, expected, plus optional metadata_* columns.
Braintrust’s dataset view in the UI has an “Import CSV” button at the top right.
Uploaded datasets are versioned. Re-uploading with the same name creates a new version rather than overwriting, so the old version stays around for historical comparison.

Either way, you’re done in a few minutes. Resist the urge to scale the dataset up before you’ve wired up scorers — a large dataset without scorers is a rock you paid to move.

Your first scorer

Scorers are where most of the eval design thinking lives. Start with the cheapest thing that works.

autoevals ships about two dozen scorers out of the box. The two most useful on day one:

ExactMatch — returns 1 if output === expected, 0 otherwise. Perfect for classification, label prediction, structured extraction where the field is a fixed vocabulary.

import { ExactMatch } from "autoevals";

const result = await ExactMatch({
  output: "billing",
  expected: "billing",
});
// { name: "ExactMatch", score: 1 }

Factuality — an LLM-as-judge scorer that asks a model whether output is factually consistent with expected. Good for open-ended QA where wording will differ but the answer should be the same.

import { Factuality } from "autoevals";

const result = await Factuality({
  input: "When did the Apollo 11 landing happen?",
  output: "Apollo 11 landed on the Moon in July of 1969.",
  expected: "July 20, 1969",
});
// { name: "Factuality", score: 1, metadata: { rationale: "..." } }

For a classification task, ExactMatch is all you need. For a summarization task, you need something that tolerates rewording, which usually means an LLM judge.

LLM-as-judge, carefully

When exact-match doesn’t apply and no deterministic heuristic captures the intent, LLM-as-judge is the tool. The pattern: another LLM reads the input, the output, optionally the expected output, and returns a score (or a categorical label you map to a score).

autoevals exposes this as LLMClassifier:

import { LLMClassifier } from "autoevals";

const Helpfulness = LLMClassifier(
  "Helpfulness",
  `You are judging how helpful a support response is.
Given the customer message and the agent response, return one of:
- "excellent" — fully answers the question, clear, actionable
- "adequate" — addresses the question but missing minor detail
- "poor" — misses the point, incorrect, or unhelpful

Customer: {{input}}
Agent response: {{output}}

Return only the label.`,
  {
    excellent: 1.0,
    adequate: 0.5,
    poor: 0.0,
  },
  {
    model: "claude-sonnet-4-6",
  },
);

Three things I’ve learned the hard way about LLM judges.

Judges drift. A judge model that rates your outputs today will rate them differently six months from now, because the judge model itself updates. Pin the judge model to a specific version (claude-sonnet-4-6, not claude-sonnet-latest). Treat a judge model change as a breaking change to your eval suite — you’ll need to re-baseline when you upgrade.

Judges disagree with themselves. Run the same judge over the same (input, output) pair five times and you’ll get 2-4 distinct scores. Temperature zero mitigates but doesn’t eliminate it. For anything high-stakes, run the judge three times and take the median.

Judges are expensive. Every scored row becomes an extra model call. A 500-row dataset with three judges is 1,500 model calls per experiment. Budget accordingly — run judges only where deterministic scoring genuinely doesn’t apply, and keep datasets small enough that judge cost stays reasonable.

Rule of thumb: deterministic scorers for anything you can make deterministic; LLM judges for the qualitative residue that’s left over.

Running your first experiment

Glue the pieces together. Here’s a complete, minimal eval in one file:

// evals/support-ticket-classifier.eval.ts
import { Eval } from "braintrust";
import { ExactMatch } from "autoevals";
import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic();

async function classify(ticket: string): Promise<string> {
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 32,
    system:
      "Classify the support ticket into exactly one of: billing, bug, feature_request, account, other. Respond with only the label.",
    messages: [{ role: "user", content: ticket }],
  });
  const text = response.content[0];
  if (text.type !== "text") throw new Error("expected text");
  return text.text.trim().toLowerCase();
}

Eval("my-app-evals", {
  data: () => ({ dataset: "support-ticket-classification-v1" }),
  task: async (input) => classify(input),
  scores: [ExactMatch],
  experimentName: "baseline-sonnet-4-6",
  metadata: {
    model: "claude-sonnet-4-6",
    prompt_version: "v1",
  },
});

Run it:

npx braintrust eval evals/support-ticket-classifier.eval.ts

Braintrust logs each row as it runs, then prints a summary with the aggregate scorer average and a link to the experiment in the web UI. Click through.

What you see in the UI:

A row per dataset example, with the input, the model’s output, the expected value, and every scorer’s score.
An aggregate score at the top (the mean across rows), plus per-scorer breakdowns.
Filters to slice by score (“show me only rows where ExactMatch was 0”) to find what broke.

That row-level filterable view is where most of the real work happens. You’re not looking at “we’re at 0.84 accuracy” — you’re looking at the specific five rows that failed and asking why.

Deterministic custom scorers

Built-in scorers cover the common cases. For everything else, a scorer is just a function with the right signature. Some examples I’ve written recently:

// Scorer: output is valid JSON matching a shape
import { z } from "zod";

const ResponseShape = z.object({
  intent: z.string(),
  entities: z.array(z.string()),
  confidence: z.number(),
});

function ValidJson({ output }: { output: string }) {
  try {
    const parsed = JSON.parse(output);
    ResponseShape.parse(parsed);
    return { name: "ValidJson", score: 1 };
  } catch {
    return { name: "ValidJson", score: 0 };
  }
}

// Scorer: output stays under a length budget
function UnderLengthLimit(limit: number) {
  return ({ output }: { output: string }) => ({
    name: `UnderLengthLimit(${limit})`,
    score: output.length <= limit ? 1 : 0,
  });
}

// Scorer: generated code parses as valid TypeScript
import * as ts from "typescript";

function CompilesCleanly({ output }: { output: string }) {
  const sourceFile = ts.createSourceFile(
    "out.ts",
    output,
    ts.ScriptTarget.Latest,
  );
  const hasError = (sourceFile as any).parseDiagnostics?.length > 0;
  return { name: "CompilesCleanly", score: hasError ? 0 : 1 };
}

These take five minutes to write and they run in milliseconds. Any behavior you can check with a deterministic function, check with a deterministic function. Save the LLM judge for the parts that genuinely need judgment.

Comparing experiments side-by-side

This is where evals stop being theoretical. Make a prompt change, run a second experiment against the same dataset:

Eval("my-app-evals", {
  data: () => ({ dataset: "support-ticket-classification-v1" }),
  task: async (input) => classifyWithNewPrompt(input),
  scores: [ExactMatch],
  experimentName: "new-prompt-v2",
  metadata: {
    model: "claude-sonnet-4-6",
    prompt_version: "v2",
  },
});

Open the Braintrust UI, select both experiments, click Compare. You get:

Aggregate score deltas (v2 is +2.1% overall).
A per-row diff view — for every row, the v1 output, the v2 output, and how each scorer moved.
Histograms of score distributions across both experiments.

What you’re looking for is not just “did the average go up.” The average going up while twelve specific hard cases regressed is a terrible outcome dressed up as progress. The side-by-side is for catching that.

Most of my actual prompt-engineering work now happens in this loop: change prompt, run experiment, compare, look at regressions, adjust, rerun. The loop cycle is a few minutes with a small dataset. That’s the unlock — the iteration speed on prompt quality finally matches the iteration speed on the code around it.

CI integration

Here’s where this goes from “useful tool I’ll remember to run” to “the safety net is always on.”

Braintrust experiments can be triggered from CI. The minimal GitHub Action:

# .github/workflows/eval.yml
name: Eval

on:
  pull_request:
    paths:
      - "src/prompts/**"
      - "src/agents/**"
      - "evals/**"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: pnpm/action-setup@v4
        with:
          version: 9

      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: "pnpm"

      - run: pnpm install --frozen-lockfile

      - name: Run eval
        env:
          BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          npx braintrust eval \
            --push \
            --fail-on-regression 0.02 \
            evals/

Two flags doing the work here:

--push logs the experiment to Braintrust so it appears in the web UI alongside manual runs. This matters because reviewers want to click through to the comparison.
--fail-on-regression 0.02 makes the action exit non-zero if any scorer drops more than 2% against the configured baseline (typically the last run on main). Tune this threshold — 2% is a reasonable starting point for most workloads; 1% is stricter but noisier with small datasets.

The paths filter is important. You don’t want every PR paying for an eval run — you want eval runs on PRs that touch prompts, agent code, or evals themselves. Tune the paths to match your repo.

Blocking merges on regressions

The action is now running on PRs and will fail the check on regressions. Last step: require the check to pass before merging.

Repository Settings → Branches → branch protection for main → Require status checks to pass before merging → add Eval / eval to the required checks.

Now a prompt change that regresses summarization quality by more than 2% cannot silently ship. The author has to either fix the prompt, update the dataset (with justification), or explicitly override the branch protection rule — which leaves an audit trail.

That’s the moment evals become load-bearing in your process. Before the branch protection rule, evals were a suggestion. After, they’re a contract.

What to eval, and what not to eval

The mistake I see most often is eval-coverage-as-aspiration: teams try to build a dataset that covers every possible usage pattern, produce a three-hundred-row monster, and then never refresh it because refreshing it is three hundred rows of work.

The alternative I recommend:

Eval the 10 tasks your product actually runs. Not 1,000 synthetic ones. If your product has one classification prompt, one summarization prompt, and one agent loop, you need three eval suites, not thirty.
Keep each dataset small enough to refresh monthly. 20-50 rows per suite. Small enough that you can re-curate from production traffic in an hour.
Bias the dataset toward failures you’ve seen. When a customer complaint surfaces a bad output, that input belongs in the dataset. A dataset curated from failures is a dataset that predicts failures.
Don’t eval things that are easy to unit-test. If your code parses a JSON response and throws on malformed output, you don’t need a ValidJson scorer — a unit test does that faster and for free.

A good eval suite is small, actively maintained, and targeted at the quality dimensions that actually matter for your product. A bad eval suite is large, stale, and measures many things nobody reads.

Eval rot

Datasets rot. Three specific ways:

Distribution drift. Your users’ inputs in March look different from their inputs in October. Your dataset, curated in March, drifts out of alignment with reality.

Model drift. You upgrade from Sonnet 4.5 to Sonnet 4.6. The baseline scores shift across the board. You now have to re-baseline every experiment.

Scorer drift. An LLM judge model updates underneath you and rates the same outputs differently. Deterministic scorers don’t drift, which is another reason to prefer them.

Counter-measures: schedule a dataset review quarterly. Pick a day, re-curate 20% of the dataset from recent production traffic, retire rows that no longer represent real usage, re-baseline. Treat the dataset as a maintained artifact, not a one-time deliverable.

Pitfalls

Datasets that are too easy. If every experiment scores 0.97, regressions are invisible — a 2% drop on a 0.97 mean is noise. Add harder rows until the mean sits somewhere around 0.75-0.85. You want headroom in both directions.

Dataset leakage. If you’re feeding eval dataset examples into your prompt as few-shot examples, you’re measuring memorization, not generalization. Keep your eval dataset strictly disjoint from any examples you include in prompts.

LLM-judge model version drift. Pinning the judge model matters more than pinning the tested model. If your Factuality scorer uses claude-sonnet-latest, you have a moving scorer; your scores are not comparable across time.

Metrics nobody reads. A CI check that fails and everyone merges around defeats the purpose. If the eval fails and the team overrides the protection, the threshold is wrong or the dataset is wrong. Recalibrate. Don’t let “override the eval” become a normal workflow step.

One giant eval suite. A single 500-row eval that takes 20 minutes is a CI bottleneck. Split by concern — one small suite per prompt or feature — and run only the suites affected by each PR.

The 60-minute starter path

If you do exactly this, in this order, you have evals in CI by lunch:

pnpm add braintrust autoevals, grab API key, set env var. (5 min)
Pick one prompt in your product. Curate 10 real examples from production with expected outputs. Upload as a dataset. (20 min)
Pick one scorer from autoevals (ExactMatch for classification, Factuality for QA). (5 min)
Write the Eval({ ... }) block. Run it locally, verify the numbers appear in Braintrust. (10 min)
Add an LLM-as-judge scorer if the output is open-ended. Pin the judge model version. (10 min)
Copy the GitHub Action YAML above. Commit. Add branch protection on main requiring the eval check. (10 min)
Open a PR that intentionally regresses the prompt. Watch the check fail. Close the PR. (5 min)

That last step is the one teams skip, and it’s the most important. You want to see the failure mode before you rely on it. Prove the safety net catches something before you bet on it.

Where this fits

Evals are the quality-side complement to a few other systems you’re probably thinking about:

Prompt caching is the cost side. Evals measure quality; caching measures cost. Both are production-readiness requirements — pair them. See the prompt-caching primer.
Retrieval needs its own evals. If you have a RAG pipeline, you need retrieval-quality metrics (recall@k, MRR) in addition to end-to-end answer quality. The pgvector primer covers the retrieval side; eval both layers independently.
Longer-running agents need evals the most. Multi-step agent tasks have more surface area to regress on. If you’re running anything longer than a single turn — see the task-budget primer — evals on success rate per task shape are the only way to tell when agent behavior drifts.

The short version: LLM features without evals are features without tests. You’d never merge untested code into main. Stop shipping untested prompts. The tooling is mature now; the excuse isn’t there anymore. Sixty minutes, one afternoon, and the class of “silent regression weeks after deploy” bug stops happening to you.