Coding With AI Agents: Stay the Verifier-in-Chief

Advanced

After this you can hand real coding work to an agent and keep the leverage, because you have a way to verify what comes back. You scope the work so it lands in reviewable pieces, you read every diff, and you treat the moment you can no longer follow what the agent wrote as a stop signal, not a speed bump.

Understand

The shift that defines this skill is not "the AI writes code now." It is what your job becomes once it does. A coding agent is a loop with hands: you give it a goal, it reads the relevant files, edits across however many it decides it needs, runs the tests, reads the output, and iterates against the result until it thinks it is done. You are no longer typing alongside it. The mental model that practitioners converge on after living with this is the one Nicholas Khami names after six months of daily use: the move is from "pair programming with an assistant" to "delegating to a junior engineer." You are not in the loop on every keystroke. You are reviewing the work after the fact, and almost everything that goes wrong with coding agents is a failure to make that shift well.

The coding-agent loopthe agent runs its own observe-edit-test-read cycle to "done," and your review gate sits after that loop closes, not inside it — which is exactly why the review has to be real.

Why the terminal, and not the editor you already have an AI bolted into? Because the terminal is the universal interface to the environment, and the agent can reach all of it. Git for cheap rollback, the test runner to check its own work, the package manager, the linter, the logs, anything you can invoke from a shell. Charlie Marsh's framing is the sharp version: an agent in the terminal can do everything a developer can do, not just edit files, while an IDE is fundamentally a single-file, human-speed, human-in-the-loop interface, and bolting an agent onto it constrains the agent to a human's interface. The composability is what makes the delegation safe. An agent that can run the tests is an agent that can tell when its own edit broke something, which is the difference between handing back working code and handing back a guess. None of this removes the IDE from your workflow. The honest practitioner pattern is multi-surface within one task: the terminal agent for delegation and bulk work, the editor for reading, reviewing, and the fiddly taste-driven changes you can do faster by hand than you can describe.

The leverage is real, and it is conditional. It holds for exactly as long as you can verify what the agent produced, and it inverts the moment you can't. This is the 70/30 problem, named precisely by Addy Osmani. The agent gets you roughly the first 70% of the way fast, the scaffolding and the obvious structure and the happy path, and the final 30% is where it falls apart. That last 30% is the understanding, the debugging, and the judgment about whether the thing is actually right, and it is exactly the part a novice cannot supply. So the same tool is an accelerant for someone who already knows what good looks like and a trap for someone who doesn't. A senior uses the agent to move faster on work they could already do, which is precisely why they spot the subtle wrongness when it appears. A junior accepts plausible output because they cannot evaluate what they could not yet have produced themselves. The tool does not close the skill gap. It widens the consequences of it.

The 70/30 splitthe agent covers the first 70% fast and cheap, then the curve turns vertical — the final 30% (understanding, debugging, judgment) is slow human work, and skipping it is where comprehension debt accrues.

The danger is not the obviously-broken output. You catch that one. It fails to compile, the test goes red, you throw it away. The dangerous output is the one that compiles, runs, passes the happy-path test, and is subtly wrong somewhere you didn't look: an edge case, a security boundary, a maintainability landmine. Plausibility is the trap. "This looks right" should not relax you on load-bearing code. It should make you verify harder, because looking right is the exact disguise the costly failures wear. Pete Hodgson's worst-case is the one to keep in mind when you decide how much to delegate at once: an agent confidently making a change that is subtly wrong across twelve files, and you spending more time reviewing and fixing it than you would have spent writing it yourself — the point where the leverage has not just disappeared but inverted.

Where it breaks

The failure that compounds quietly is comprehension debt, named by Bartek Kus. When the agent generates faster than you can internalize what it wrote, your verification degrades without you noticing. It slides from genuine review, where you mentally simulate the code and would have written something like it yourself, into pattern-matching on whether the output looks plausible. You become a reviewer of code you don't deeply understand, which Kus calls the worst position to verify from. It accrues silently because nothing fails today. It surfaces later as bugs you can't explain in a codebase you can't reason about, and by then the debt is spread across months of accepted diffs. This generalizes past code. Any output you accept but couldn't reconstruct yourself, a data pipeline or a config or a migration, is borrowed understanding that your future self pays back with interest. The deliverable on anything load-bearing is your understanding of the change, not the change passing its tests.

Two related traps sit alongside it. The first is rubber-stamp review. When the agent is right most of the time, attention decays. You stop genuinely reading the diff and start skimming for whether it looks fine, and the rare wrong change sails through precisely because the agent is usually right and you've stopped looking hard. The defense is behavioral, not a tool: assume something is wrong on every load-bearing review, and read the parts that matter most carefully rather than skimming everything evenly. The second is delegating work that shouldn't be delegated. Poorly-understood problems, subtle taste-driven judgment, and changes whose blast radius is too large to review meaningfully are exactly where the agent's confidence outruns its correctness and your review cost explodes. The operator move is to reason about the failure distribution, not the best case: match how much you delegate to how reviewable and recoverable the failure would be, and keep the hard-to-specify, high-blast-radius work closer to hand. The agent is for the work you can verify, not the work you wish you didn't have to do.

There is a boundary on the discipline itself. This is advice for load-bearing code that ships and that others depend on. A throwaway script you'll run once, a prototype you'll delete, a sketch to see if an approach is even viable. There the cost of being wrong is near zero, and reading every line is over-verification. The skill is not maximal paranoia. It is matching verification effort to what a wrong answer would actually cost.

Do it now

The leverage holds when verification keeps pace with generation. The way you guarantee that is to never let the agent generate more than you can review in one sitting, and to make every step independently checkable. Paste this loop and run it for any non-trivial change:

Paste this

CODING-AGENT LOOP — keep yourself the verifier

1. PLAN FIRST. Before it touches code, have the agent state its approach:
   which files it will change, what could break, how it'll prove it works.
   Read this. A wrong plan is cheap to fix now and expensive to fix in a diff.

2. SCOPE SMALL. One reviewable change per run. If the plan spans many files
   or several unrelated concerns, split it. You are sizing the work to your
   review capacity, not the agent's generation capacity.

3. DEMAND THE PROOF. Make "done" mean tests run and pass, not "looks complete."
   The agent runs the tests and shows you the output. No green, not done.

4. READ EVERY DIFF. Line by line, before you accept it. The moment you cannot
   explain WHY a change is correct, stop — that is comprehension debt arriving.
   Ask the agent to walk you through it, or do that part yourself.

5. COMMIT FOR ROLLBACK. Commit each accepted change on its own. Cheap rollback
   is what lets you delegate boldly; a clean revert point is the safety net.

STOP CONDITIONS — pause and take the work back when:
 - you're accepting diffs you don't fully understand (debt accruing)
 - review-and-fix is taking longer than writing it yourself would have
 - the change is taste-driven or its blast radius is too large to review

The plan-first step is the cheapest insurance in the loop. Vague instructions produce vague results, amplified across every file the agent touches, so the spec-writing burden shifts onto you the moment you raise the autonomy. Catching a wrong approach in a three-line plan costs you a sentence. Catching it in a twelve-file diff costs you an afternoon. The read-every-diff step is the one people quietly drop first, and it's the one the whole thing rests on. The agent will confidently do the wrong thing sometimes, and the diff is the only place you'll see it before production does.

For work that repeats against the same codebase, lift the loop out of your head and into a project-instructions file the agent reads on every run (a CLAUDE.md or AGENTS.md at the repo root). Put in it the things you'd otherwise re-explain every session:

Paste this

# Project instructions for the coding agent

## How to verify your work
- Run `<the test command>` before declaring any task done. Show the output.
- Run `<the linter / typecheck command>` and fix what it flags.
- A change is not done until tests are green AND I can review it.

## How to scope
- One logical change per run. If a task spans unrelated concerns, stop and
  ask me to split it rather than doing all of it at once.
- Make a short plan before editing. Wait for my go-ahead on non-trivial work.

## What this codebase assumes
- <conventions a newcomer would get wrong: error handling, naming, structure>
- <the load-bearing invariants that must not break>

This is the durable-context version of the same discipline: it makes verification cheap by making the agent's output predictable, and it stops you re-typing your standards into the prompt every time.

Worked example

Illustrative

Illustrative. A constructed sequence to show the discipline, not a real session.

A genuinely useful operator example of structuring this well is how a phased build harness keeps verification pace with generation across a larger piece of work. Rather than handing an agent one giant "build the feature" goal and reviewing a sprawling diff at the end, the work moves through ordered phases. It scaffolds, then specifies, then plans, then executes-and-reviews in tight alternation, then runs a quality pass, then a security pass, then ships, and each transition is a gate the work has to clear before the next phase opens. The point is not the phase names. It is that each gate is a verification checkpoint sized to a human's review capacity, so the agent is never generating faster than someone can check, and a wrong turn is caught at the gate it happened at instead of compounding through three more phases. That is the same loop above, scaled up: small reviewable steps, a proof of correctness at each, and a stop point built into the structure rather than left to willpower.

Here is the contrast at the single-change scale. The trap is the one-shot delegation:

You: Refactor the auth module to use the new token format. Make sure everything still works.

Agent: Done. I updated the token parser, the middleware, the session store, the login handler, and the three call sites. All tests pass.

You: (skims the green checkmark, accepts the diff across five files)

The tests are green and it looks complete, so it gets rubber-stamped. Two weeks later a token from the old format hits a code path the happy-path tests never exercised, and the failure is now spread across five files you never actually read. The disciplined version sizes the work to your review and demands the proof:

You: Plan a migration to the new token format. List the files, what could break, and how you'll prove the old format still round-trips. Don't edit yet.

Agent: Plan: change the parser first, behind a format-detecting branch. Files: parser, then middleware, then the three call sites — separately. Risk: old tokens in active sessions. I'll add a test that an old-format token still validates before I touch anything else.

You: Good, but write the old-format test first and show it passing against current code, so we know it's a real test and not one tuned to your change. Then do the parser only.

Agent: (adds the test, shows it green against current code, then makes the parser change — one reviewable diff)

Pair-programming vs delegate-then-reviewthe attention shift — you move from in-the-loop on every keystroke to reviewing completed work, which only stays safe when the review is real and the diffs are small enough to actually read.

Same task, same agent. The second works because each diff is small enough to actually read, the proof is a test written before the change rather than tuned to pass it, and there is an explicit point where not understanding the diff stops the line instead of getting waved through. The leverage came not from the agent being smarter but from never letting it outrun your ability to check it.