Why Systems Pass Every Eval and Still Fail in Production
An eval grades the output. Production grades the decision. Those are different measurements, and the gap between them is where shipped systems break.
You have seen this, or you will. The eval suite is green. Coverage looks honest: hundreds of cases, the adversarial ones included, the regression set someone built after the last incident. Pass rate ticks up release over release. You ship. And within a week, production does something the suite swore it wouldn’t, on an input that, written out as a test case, would obviously have passed.
The reflex explanation is that the eval set was incomplete. Add the missing case, raise coverage, move on. That explanation is comforting because it implies the gap is a quantity problem, and quantity problems get solved by grinding. I don’t think it’s a quantity problem. I think a system can pass an eval set with perfect coverage and still fail in production, for reasons that have nothing to do with how many cases you wrote and everything to do with what an eval is structurally able to measure.
Thanks for reading. This piece is the part of the argument I keep circling back to: the instrument and the thing it’s pointed at are not the same shape.
What an eval actually measures
Strip an eval down to its mechanism. You take a frozen set of inputs, run the system on each, and grade the outputs against an expected answer or a rubric. Pass rate is the fraction of outputs that cleared the bar. That is the whole instrument, and it is a good instrument for what it measures: output correctness, on a fixed distribution, one input at a time.
Now look at what each of those properties quietly assumes, because production violates all three.
An eval grades the output. Production is generated by a decision. The grade lands on the final answer, which is the last step of a process and the one place all the interesting structure has already collapsed. An output that was produced by a clean, decisive, well-grounded choice and an output produced by a near-tie between two alternatives, one of which was wrong, look _identical_ on the rubric. Both are the correct string. Both pass. But only one of them is a system you can trust the next time the context shifts the margin by a hair. The eval cannot tell them apart, because it is grading the thing they have in common and ignoring the thing that distinguishes them. A correct output sitting on top of a marginal decision is a latent failure wearing a passing grade, and the eval is exactly the instrument that can’t see it.
An eval fixes the distribution. Production drifts it. The test set is sampled once and frozen. That is what makes it a measurement at all, you need a fixed yardstick. But production is non-stationary by construction: the inputs move as the world and your users move, and inside any multi-turn session the context accumulates, so the distribution the system actually faces at turn nine is one no eval case ever held still long enough to capture. The eval certifies behavior on the distribution you froze. Production runs on the one that’s drifting away from it the moment you ship.
An eval scores the aggregate. Production dies in the tail. Pass rate is a mean, and a mean is precisely the statistic that buries the failures that matter. The production-killing failures are rare, correlated, and context-triggered, the specific combination of a stale assumption, an unusual ordering, and a retrieval that didn’t fire. Each is individually low-frequency, so each contributes almost nothing to the aggregate. 98% looks like a system that’s basically right. It can equally be a system that is reliably right on the easy 98% and reliably wrong on the 2% that happens to be where your highest-value, highest-consequence traffic lives. The number is the same. The system is not.
None of these is fixed by writing more cases. More cases sharpen the estimate of output correctness on a frozen distribution. They do not turn an output measurement into a decision measurement, they do not un-freeze the distribution, and they do not stop the mean from hiding the tail. You can drive coverage to a place that feels exhaustive and still have measured none of what production is about to do to you.
The Goodhart turn
And that’s the failure mode if teams simply accept these limits. There’s a worse one that shows up the moment the eval stops being a gauge and becomes a release gate, because then it’s the team doing it to itself. People tune prompts until the suite goes green. The model gets nudged toward the rubric’s idea of a good answer. The suite climbs. And what you have built, without anyone deciding to, is a system optimized to produce outputs that score well on a frozen set, which is not the same system as one that makes sound decisions on a moving one. The eval was supposed to be a proxy for production behavior. Optimize hard enough against the proxy, and you get a system that is excellent at the proxy and silent about the thing the proxy was standing in for.
What it looks like
Here is the shape of it, sanitized. A support agent, evaluated on a suite that includes the exact capability that later failed: read a multi-part instruction, take the actions in the right order, don’t act on a write that was never confirmed. The relevant eval row:
| Field | Value |
|--------------|-------------------------------------------------|
| **Case** | "update the account and let them know" |
| **Expected** | confirm write → update → notify, in that order |
| **Output** | confirm write → update → notify |
| **Grade** | ✅ PASS |Clean pass. The capability is present and the suite proves it, on that input, in isolation, on turn one. Now the same capability, in production, eight turns into a real conversation that had drifted:
| Field | Value |
|---------------------|--------------------------------------------------------------------------------------------------------------------------|
| **Observed** | notify fired ahead of the confirmed write |
| **Reconstructed** | a confirmation given at turn 2 was still treated as live at turn 9; the governing rule ("notify only after a confirmed write") was never retrieved |The eval did not miss a case. The eval had the case, and the system passed it. What the eval could not hold was the condition under which the capability breaks, which is not an input you can write down as a single graded turn. It’s a state that accumulates: a decision made on a stale assumption, with the decisive rule sitting unretrieved, on a margin narrow enough that turn nine’s drift was enough to flip it. The output the eval graded and the decision production ran were never the same object.
That second table earns one caution. The Observed row happened. The Reconstructed row is not read off a log; it’s an account of why the choice came out the way it did, the kind of thing that today lives in an engineer’s head after a week of staring at a trace. I split the rows so the table can’t quietly pass one off as the other: I’m writing the reconstruction as data to make the point legible, not to suggest it printed itself.
The instrument and the behavior
Put the two surfaces next to each other, and they turn out to be the same problem seen from opposite ends. In production, observability grades the realized execution path and can’t recover the decision that selected it, which is the gap I’ve spent a few pieces on now. In pre-production, the eval grades the realized output and can’t recover the decision that produced it. Same collapse, same blind spot, two different moments in the lifecycle. Both instruments are pointed at the artifact a decision leaves behind, and both are silent on the decision itself.
Which is the actual reason “pass every eval, fail in production” is a stable, recurring pattern rather than a run of bad luck. It isn’t that teams write bad evals. It’s that an output graded on a frozen distribution and a behavior generated by decisions under a moving one are different measurements, and you cannot close the distance between them by improving the first. The unit has to change. What you want to certify before shipping is not “did the right string come out on these inputs” but “does the system decide soundly under the conditions production will actually create”, how close its margins are, whether the decisive context gets retrieved when it’s buried. That’s a decision-level question, and it’s the same one the production side has been asking. The decision is the unit that’s missing from both, which is the direction I’m building Nalyqor toward: making that decision a first-class, gradable object instead of something reconstructed by hand after the incident.
In closing
The eval is not lying to you. It is answering its own question accurately: on this frozen set of inputs, graded one output at a time, the system produces correct strings at this rate. The mistake is reading that answer as if it were a different one, a guarantee about how the system will decide under the shifting, accumulating, tail-heavy conditions of production. It was never measuring that. It can’t, because an output on a fixed distribution and a decision under a moving one are not the same measurement, and no amount of coverage converts one into the other.
A green eval suite tells you the outputs were right on the distribution you froze. It is silent on whether the decisions were sound on the one you’ll actually face. Until evaluation grades decisions and not just outputs, “passed every eval” and “fails in production” will keep being true of the same system at the same time, and we’ll keep being surprised by it.
The eval told you the answers were right. The next question is whether the system was.

