Long-horizon tool use without context collapse
What we learned from running coding agents for 1,000+ turns: the surprising failure mode isn't reasoning, it's memory shape.
We ran our internal coding agent — an early build of what is now Mercury Agent — through a battery of long-horizon engineering tasks. Bug fixes that spanned dozens of files. Refactors that took hours of wall-clock time. Greenfield features delivered end-to-end in a single autonomous session.
The result was not what we expected.
The failure mode isn't reasoning
Most published agent evals stop somewhere between turn 10 and turn 50. That's roughly the point at which the agent has loaded enough of the codebase to make a non-trivial decision. It is also the point at which most benchmarks declare victory.
In production, the interesting failures start at turn 200.
The agent does not suddenly forget how to reason. It does not start hallucinating APIs. What happens is far more subtle: the shape of its working memory deforms in ways that make subsequent reasoning subtly wrong, even when each individual reasoning step looks fine in isolation.
We call this context collapse, and it has at least three observable modes:
- Pivot amnesia. The agent makes a decision at turn 40, executes against it for 80 turns, then quietly reverses the decision at turn 140 — because the original rationale has been compressed out of working memory and only the artifacts remain.
- Recency tyranny. The most recent tool output dominates planning, regardless of whether it was a load-bearing observation or incidental noise.
- Self-deception via summarization. When the agent summarizes its own progress, the summary loses the failures faster than it loses the successes. A 30-turn debugging dead-end compresses into "we investigated this avenue" — and the agent then re-investigates the same avenue 200 turns later.
What worked
The interventions that helped most were not larger context windows, better models, or more capable reasoning. They were structural changes to how memory is shaped:
- Persistent, append-only decision logs that survive summarization and are surfaced verbatim into planning prompts.
- Hierarchical scratchpads with explicit promotion rules: turn-level → task-level → session-level memory, with each tier compressed separately.
- Failure-weighted retention. Failures get higher retention priority than successes, because successes are encoded in the final artifact; failures are not.
None of these are research breakthroughs. They are systems engineering applied to a problem that the field has been treating as a model-quality problem.
What this means for product
Our working hypothesis: the durable defensibility of an agent product is not the model behind it, nor even the tools it can call. It is the memory architecture — the data structures and update rules that shape what the agent remembers and what it forgets.
If that hypothesis is right, the agent that wins is not the one with the smartest model. It is the one with the most principled answer to what its working memory should look like at turn 1,000.
We are betting on that hypothesis with Mercury Agent. The work continues.
Written by
Cosmic Stack