The first failure usually is not the expensive one.
The expensive part is what happens after the first failure when the system keeps trying, keeps spending, and keeps producing the same outcome because nothing about the situation changed.
We kept running into a simple pattern: the agent would miss a step, the runtime would retry, the next attempt would see the same state, and the loop would repeat until the cost was visible in the bill or the operator log. At that point the problem stops being a model-quality issue and becomes a control-system issue.
Why the loop hurts more than the mistake
A single bad step is recoverable. An unbounded retry loop compounds the mistake.
That is true for token spend, API calls, and operator attention. It is also true for trust. Once a system gets a reputation for wandering, people stop letting it touch real work.
The failure mode is boring, which is why it gets missed. Nobody looks at a happy-path demo and thinks about what happens after the third identical error. But that is where the real cost lives.
What we tried first
The obvious moves are usually the wrong ones:
- make the prompt longer
- add a generic retry
- increase the timeout
- let the model reason more
- rerun the same command with slightly different wording
Those changes can make a demo look better, but they do not fix a stuck loop.
If the environment is unchanged, a retry is often just a second copy of the same mistake.
What actually worked
The fix was not smarter language. It was stricter boundaries.
We had to make the runtime answer four questions before it kept going:
- What is the budget?
- What counts as success?
- What is the verifier?
- What happens when the same failure repeats?
A small policy block is often enough to make that concrete:
{
"budget_cap": 250,
"max_attempts": 3,
"stop_on_same_error": true,
"require_verifier": true,
"emit_receipt": true
}
That does not sound ambitious. That is the point.
The biggest reliability gain came from refusing to treat repeated failure as progress. Once the runtime could detect the same blocker twice or three times in a row, it had permission to stop instead of pretending the next rerun would somehow be different.
Why receipts matter
Receipts turn a run from a vague story into a checkable fact.
A receipt should show:
- what the agent tried
- what changed
- what failed
- why the run stopped
Without that, a loop can hide inside a confidence-generating summary. With it, you can see the exact stopping point and decide whether the next action should be a human intervention, a different tool, or no action at all.
That is also why this kind of work ends up feeling less like prompt engineering and more like operations.
The tradeoff
Stricter control means the system stops earlier.
That can feel annoying when you want the agent to push through friction. But earlier stopping is cheaper than a long blind retry sequence. More importantly, it preserves operator trust.
A bounded agent is less flashy than an agent that never gives up. It is also much more usable.
That is the core of the control-layer approach we keep coming back to in MartinLoop: the runtime should know when to stop, when to ask for help, and when to write down what happened.
What we are watching next
The next improvement is not more retries.
It is better failure classification so the runtime can separate:
- missing permission
- stale state
- tool mismatch
- external outage
- real task completion
When those are distinct, the system can choose a better next step instead of recycling the same command.
That is the line between an agent that looks autonomous and an agent that is actually operable.
What failure shape are you still letting your runtime retry too many times?
Top comments (2)
This resonates deeply. We kept hitting the same failure mode with our customer automation agent — it would miss a form field, runtime would retry, same state, same miss, until the token bill shouted at us.
The policy block approach is exactly what fixed it for us too. But I'd add a 5th question to your list: "Has the environment changed since the last attempt?" If nothing changed, retrying is just betting money on a different outcome from the same inputs. We also found that grouping failures by type (permission vs. stale state vs. tool mismatch) helped the runtime pick a different recovery path instead of just re-running.
One thing we added: a "budget per task" counter that accumulates across retries, not just per-attempt. That way even if individual attempts are cheap, the cumulative spend is always visible. The agent can see "I've spent ¥0.3 on this task and made zero progress" — that's often the signal to escalate instead of retry.
This is a very real cost center. The first failed attempt is information; the fifth identical retry is just spend with confidence. The missing primitive in many agent systems is state changed enough to justify another try. Without that, retries become a tax on ambiguity.