Evaluation and Verification: You Own Correctness

Intermediate

After this you can prove an AI system is right instead of trusting that it looks right. You build a small fixed test set, change one thing at a time, and measure against the cheapest reliable check that catches each kind of error — so when you ship a change you know whether it helped, hurt, or did nothing.

Understand

A deterministic system either passes its tests or it doesn't, and once it passes you stop thinking about it. An AI system has no such floor. The same prompt that nails one input quietly mangles the sixth, the model that was right last week regresses when the provider updates it underneath you, and none of this throws an error. The output that is wrong looks exactly as confident as the output that is right. That is the whole problem this lesson exists to solve: correctness is not a property the system guarantees, it is a property you have to measure and keep measuring, and if you don't own that, nobody does.

The first thing to internalize is that verification is not the cheap half of the work. It is the expensive half. Generating plausible output is nearly free now, while checking whether it is actually correct is slow, takes expertise, and is exactly the part a novice cannot do. The rough shape is familiar: a model gets you most of the way fast, and the last stretch, the debugging and the edge cases and the judgment about whether it is really right, is where it falls apart (the "70% problem," in Addy Osmani's framing). That last stretch is exactly what someone who couldn't have produced the output themselves has no way to supply. The practical consequence is uncomfortable. "This looks correct" should not relax you. On anything load-bearing it should make you verify harder, because the dangerous output is never the obviously-broken one, the kind you catch. It's the one that compiles, runs, passes the happy path, and is wrong somewhere you didn't look.

Most people's instinct when a prompt underperforms is to tweak the wording, eyeball one output, and ship. That isn't evaluation, it's fiddling, and it produces a prompt overfit to one lucky input. The reason eyeballing one output fails is sharper than "small sample." Changing things the model is sensitive to moves accuracy far more than intuition predicts, and the swings come from places that feel cosmetic. In the canonical study behind few-shot brittleness (Zhao et al., "Calibrate Before Use"), reordering the same few examples — identical content, different sequence — moved sentiment accuracy from 54%, barely better than a coin flip, to 93%, near the best the model could do. Nothing about the information changed. If a reshuffle you'd never think to test can swing a result by nearly forty points, then one good-looking output tells you almost nothing about whether the system is reliable. You need a fixed set of inputs you run every time.

Why one output proves nothingthe order-bias swing — the same examples in a different order move accuracy from near-chance to near-best, so a single eyeballed output can't tell you whether the system is reliable.

The eval loop that replaces fiddling is small and unglamorous. Assemble a fixed test set of real inputs, and deliberately include the edge cases and the ambiguous ones, not three easy examples that everything passes. Change exactly one variable, the prompt or the model or the retrieval, never several at once. Run the whole set. Compare against the last run. Then ship or iterate. The discipline that makes it work is the one-variable rule: if you change the prompt and the model together and the score moves, you have learned nothing about which one did it. A small fixed set you actually run every time is worth more than an elaborate eval framework you set up once and never populate.

The eval loopthe loop that replaces tweak-and-eyeball — a fixed test set including edge cases, one variable changed, the whole set re-run and compared before shipping.

Once you are measuring, the next question is how to measure, and the operator move is to reach for the cheapest check that reliably catches a given error, not the most powerful one. There is a hierarchy here, cheapest and most reliable first. At the base sit deterministic assertions: is the JSON valid, are the required fields present, is it within length bounds, are the forbidden phrases absent. These cost almost nothing and catch a surprising share of failures, and you build them first. Above that, anywhere you can reframe the task as classification you get precision and recall, which are well-defined and stable. Above that, a narrow specialized model, like an entailment model that checks whether an answer actually follows from its source, runs cheaper and steadier than a general judge. Only at the top, for the genuinely subjective remainder that nothing simpler can measure, do you reach for an LLM-as-judge. The common mistake is to start at the top. Teams reach for an LLM judge first and skip the eighty percent they could have caught with a regex and a schema check.

The measurement hierarchymatch each error to the cheapest reliable check — deterministic assertions catch most failures for near-nothing; the LLM judge is reserved for the subjective remainder, not the default.

There is one judge that everyone trusts and shouldn't: the model judging itself. "Ask the model to check its own work" is the most cited verification tip and one of the least reliable. Without some external signal of correctness, self-correction often degrades the output, because the model talks itself out of right answers as readily as wrong ones (Huang et al. find exactly this for reasoning, Stechly et al. for plans). The distinction that matters is whether the critic has access to something the generator didn't. A critic with test execution, a separate model, ground-truth data, or a human in the loop can genuinely improve the output, because it brings new information. "Reflect on your answer" with nothing new to reflect against is theater. It adds words and a feeling of rigor without adding correctness.

Where it breaks

The measurement discipline has its own failure modes, and naming them is what keeps it honest. Generic benchmarks are the first trap. A high leaderboard score doesn't transfer to your task, because public benchmarks get contaminated when the test leaks into training, and gamed when labs optimize for the headline. The only number that tracks your reality is a small private eval set built from your own inputs. The deeper failure is human, not technical. When a system is usually right, reviewers stop genuinely reviewing, verification decays into rubber-stamping, and the rare wrong output sails through because the system is usually right and you've stopped looking hard. This is the same trap: once AI generates faster than you can internalize, your review quietly slides from mentally simulating the output into pattern-matching on whether it looks plausible, which is the worst position to verify from. The defense is behavioral, not a tool. Assume something is wrong on every load-bearing review, and verify the parts that matter most rather than skimming everything evenly. There is also a boundary condition worth stating: an eval set is worth building only when a task runs repeatedly or the cost of being wrong is real. For a one-off, low-stakes answer, a single careful read is the right amount of verification, and a twenty-example harness is over-engineering.

Do it now

Before you trust an AI step that runs more than once or carries real consequences, stand up the smallest eval that catches its likely failures. Open a file, paste this scaffold, and fill it in for your own task — the point is that you run it, not that it's elaborate:

Paste this

EVAL SET — <the one task this checks>

Test inputs (5–20 real cases — include the hard / ambiguous ones, not just easy passes):
  1. <input>            → expected: <what a correct output must contain or satisfy>
  2. <edge case>        → expected: <...>
  3. <ambiguous case>   → expected: <... or "should ask, not guess">
  ...

Checks, cheapest first — match each likely error to the lowest tier that catches it:
  [ ] DETERMINISTIC — valid format? required fields present? within length? forbidden phrases absent?
  [ ] CLASSIFICATION — can I reframe "is this right?" as a label I can score (yes/no, category)?
  [ ] JUDGE (only for the subjective remainder) — binary Pass/Fail + one-line written reason. No 1–5 scale.

Protocol:
  - Change ONE variable per run (prompt OR model OR retrieval — never several).
  - Run the WHOLE set every time. Compare to the last run before shipping.
  - If a critic/self-check step "improves" the output, confirm it had NEW information
    (a tool, a source, a second model). "Reflect harder" with nothing new is not verification.

The scaffold does two things at once. The test-inputs block forces you to write down what correct actually means before you measure, which is most of the value — half the time the act of specifying the expected output reveals that "good" was never defined. The checks block forces the cheap-first order, so you don't burn an LLM judge on something a schema check would have caught for free.

Worked example

Illustrative

Illustrative. A constructed case to show the discipline, not a real run.

A support team wires up an AI step that reads an incoming ticket and tags it with one of six categories so it routes to the right queue. In a quick demo it tags four sample tickets correctly, so it ships. Mis-routes start trickling in — not many, just enough to annoy people — and the instinct is to rewrite the tagging prompt to be "clearer." Someone adds a few more rules, the demo tickets still pass, and the mis-routes keep coming.

The eval-loop version starts by refusing to trust the four-ticket demo. They pull twenty real tickets, deliberately including the awkward ones: a billing complaint that mentions a bug, a feature request phrased as an angry rant, a two-word ticket with almost no signal. Because tagging is a label, the check is pure classification. Does the output match the known-correct category, scored automatically, no judge needed. The first run scores 14 of 20. Reading the six misses, looking at the actual failures rather than the aggregate, the pattern is obvious within minutes: every ticket that mentions two topics gets tagged by whichever topic appears last in the text, a recency effect, not a wording problem at all.

Demo-pass versus eval-loopthe same tagging step — the four-ticket demo greenlights it and the real failure mode stays invisible, while a twenty-case classification set surfaces the recency bias the prompt rewrite would never have found.

Now the fix has a target. They change one thing, instructing the model to weigh the ticket's primary complaint over later mentions, and re-run all twenty. The score goes to 19 of 20, the last miss being the genuinely two-word ticket that no rule can save, and that one gets routed to a human by design. The point is not the number. It's that the prompt-rewrite path would have kept editing wording forever, because it never looked at which tickets failed and never saw that the failure was structural. Reading six real misses found in minutes what tweaking-and-eyeballing would have missed indefinitely. The system didn't tell anyone it was getting multi-topic tickets wrong. Someone had to measure it.