Designing the Autonomy Leash
After this you can hand a multi-step task to an agent and bound it, so a long unattended run cannot quietly drift into expensive or irreversible mistakes. You decide where the loop stops, what it is allowed to do without you, and which steps it has to prove before moving on.
The whole pitch for an agent is that it runs on its own. You give it a goal and some tools, and it loops: act, observe, decide, act again, until it judges the work done. The trouble is that every extra turn of that loop is another place the run can go wrong, and the errors do not add up, they multiply. If each step is right 95% of the time, two steps are right about 90% of the time, ten steps about 60%, and twenty steps closer to one in three. A loop that looks reliable on any single action is fragile across a long horizon, and that fragility is the reason "more autonomous" is not the same as "better."
Four constraints do most of the work of keeping that loop honest, and together they are the leash. A hard iteration cap, so the loop physically cannot run forever. A budget on tokens, cost, or time, so it cannot quietly spend the afternoon. Loop-detection, so an agent that keeps re-running the same search or oscillating between two actions gets caught instead of grinding. And verifiable sub-goals, which are checkpoints where the agent has to surface a result that you or a cheap deterministic check can confirm before it earns the next step. The iteration cap is the non-negotiable one. An agent that never quite decides it is done, re-reading the same file or almost-finishing forever, is the most common way an unbounded loop burns a whole budget with nothing to show, and the cap is what turns "ran away overnight" into "stopped at twenty steps and asked."
Harder than deciding how long to let the loop run is deciding where to put yourself inside it. The naive instinct is to gate on the model's confidence and stop to ask whenever it seems unsure. That is the wrong axis. Confidence is not correctness, and the actions that actually hurt are not the uncertain ones, they are the irreversible ones. Gate on irreversibility and blast radius instead. A reversible, low-stakes action should run unattended even when the model is unsure, because the cost of a wrong call is a cheap undo. An irreversible, high-stakes action, like sending the email or running the migration or charging the card or posting in public, should gate even when the model is completely confident, because you cannot un-send it and the confidence buys you nothing the moment it turns out wrong.
One more move cuts a whole class of quiet damage. Keep the model's decision separate from the action it triggers. The model call that decides what to do is non-deterministic, so retrying it gives you a different plan. The side effect it triggers is the part you cannot take back. Keep them apart. Record the decision so a retry replays the same choice instead of rolling a fresh one, and make the action itself safe to repeat. A loop that blurs the two will, on a retry after a half-finished step, happily send the same email twice or write a record that contradicts what already landed.
Where it breaks
The two failures sit on opposite ends. Over-delegation is handing the loop a task it cannot verify and a tool it cannot undo. The vivid version comes from outbound sales. Point an AI agent at a contact list with no human gate on the send, and a sending domain can land in spam-folder purgatory within two to three weeks. One team's recovery took 62 days across seven domains at a self-imposed 26-emails-per-day cap just to claw a reply rate from 3% back to 6%. The send was irreversible, the reputation damage outlived the campaign by months, and no amount of model cleverness upstream could buy it back. The opposite failure is over-leashing. Gate every step and you have built a slow manual process with AI-flavored extra clicks. An approval queue that interrupts forty times in a run trains the human to approve without reading, which is worse than no gate at all because it wears the costume of oversight while delivering none. The leash is for the actions that cannot be undone, not for every action.
Before you set any agent loose on a multi-step task, paste this and fill it in. It forces the four bounds and the one gate that actually matter:
Goal: <the one outcome, stated so the agent can tell when it's met>
Stop when: <the explicit done-condition the agent checks each loop>
Hard caps: max <N> steps · max <$ or tokens> · max <minutes>
Loop guard: stop if the last few actions just repeat one another
Verify before continuing: <the sub-goal checkpoints a cheap check can confirm>
Runs without me (reversible, low blast radius): <read, search, draft, summarize, ...>
Needs my approval (irreversible or high blast radius): <send, delete, deploy, pay, post>
If you hit a cap or can't verify a checkpoint, stop and hand back. Do not push on.The last line is the load-bearing one. A loop with no floor under it will spend the entire budget circling a step it cannot finish. The caps put that floor in, and the approval split makes sure the one action with a real blast radius cannot fire while you are asleep.
Worked example
IllustrativeIllustrative. A constructed run to show the leash, not a real session.
You ask an agent to research 50 prospects and reach out. With no leash, it does the whole thing in one breath:
Agent: Found 50 contacts. Drafting and sending… 50 emails sent.
It felt like leverage right up until the bounces came back. A third of the addresses were stale, the bounce rate crossed the line that flags a domain as a spammer, and now every email from the company, not just the agent's, starts landing in junk. The send was the one step that could not be undone, and it ran with no gate in front of it. Leashed, the same task stops exactly where it should:
Agent: Found 50 contacts. 38 verified deliverable, 12 unverifiable and held back. 38 drafts ready. These need your approval before send — the send is irreversible and affects domain reputation. Approve all, or review individually?
The leashed run is no more capable than the loose one. The only difference is that it cannot reach the one irreversible action without a human in the path, and it would not treat twelve guessed addresses as real. A leash does not make an agent smarter. It puts a floor under how wrong an unattended loop can get before anyone notices.