EvaluationMar 17, 2026 · 3 min read

Why agent benchmarks lie — and what to measure instead

Most agent leaderboards measure the wrong thing. A practitioner's framework for evals that correlate with production reliability.

Open any agent leaderboard. You will see scores. You will see deltas. You will see celebratory release notes claiming new state-of-the-art on SWE-bench, GAIA, WebArena, OSWorld, or whichever benchmark is in fashion this quarter.

Now go ship one of those agents to a real customer. Read the bug reports. Listen to the support calls.

The disconnect is not subtle. The disconnect is enormous.

What benchmarks measure

Most agent benchmarks measure task completion under generous constraints:

A clean, isolated environment with no concurrent state.
A well-specified task — the kind a researcher would write.
A generous turn budget, often with retries permitted.
A deterministic grading rubric that rewards finishing the task, regardless of how it was finished.

This is fine for what it is. It tells you whether the agent can, in principle, accomplish a task in the benchmark's distribution. It tells you very little about whether the agent will be usable when deployed against real work.

What production demands

A production agent is graded on a different rubric — one that the benchmarks do not measure:

Recovery from partial failure. Real environments are stateful and dirty. The agent will be invoked mid-workflow, with stale caches, a half-applied migration, and a tool that returned an error the agent has never seen before. Can it recover?
Cost predictability. The same task, run 100 times, will produce 100 different token costs in most agents. The variance is often 100× between the best and worst case. Production cannot tolerate this.
Side-effect discipline. Did the agent only modify the files it should have modified? Did it leave behind a clean working tree? Did it write to anything outside its sandbox?
Honesty under uncertainty. When the agent does not know whether its plan worked, does it say so — or does it confabulate a success?

These are not soft metrics. They are the metrics that determine whether your customer trusts the agent enough to use it twice.

The evals we actually run

Internally, every Mercury Agent release is gated on a set of evals we call the production-grade suite. The composition shifts as we learn, but the shape is consistent:

Dirty environment evals. The agent is dropped into a repository with a half-applied migration, a failing test suite, and three open PRs. Can it figure out what it is and is not supposed to fix?
Cost-variance evals. The same task, run 30 times. Score is the ratio between p99 and p50 token cost. Smaller is better.
Side-effect tripwires. Adversarial tools that record every file read and write. A correct completion that touches files outside the permitted set fails the eval.
Honest-failure evals. Tasks designed to be unsolvable in the given environment. The agent passes only if it returns "I cannot complete this and here is why" instead of fabricating success.

None of these will get you a leaderboard headline. All of them correlate with whether the agent works in production.

A modest proposal

The agent field would be measurably better off if every benchmark release came with a paired "production-rubric" eval suite — the honest companion — that scored the same agent on dirty environments, cost variance, side-effect discipline, and honest failure.

Until that happens, the leaderboards will keep climbing and the bug reports will keep coming. The two are uncorrelated, and the gap is exactly the gap between research and product.

Written by

Cosmic Stack

Share on X Share on HN

← Newer

Edge-native inference and the cost of cold paths