Build the Improvement Loop Into the System
After this you can turn corrections into something the system keeps, instead of fixing the same mistake forever. You read what actually went wrong, route each lesson to the one place it belongs by how general and how durable it is, and make the fix change future behavior rather than just the output in front of you.
Fixing a mistake changes one output. Improving a system changes what it produces next time. Most people only ever do the first: the AI produces something wrong, you correct it, you move on, and next week the same wrong thing comes back because nothing about your correction outlived the moment. The output got better once. The system that produced it learned nothing. Building the improvement loop in means treating each correction as data the system retains, so the second time a class of mistake shows up it has already been designed out.
The starting point is unglamorous and most teams skip it: you have to actually look. Not at a dashboard of scores, at the real outputs, one after another, reading what the model did and writing down where it went wrong in plain language. The strongest articulation of this comes from the evaluation field, where the move is to read on the order of a hundred real traces, take open-ended notes on each failure, and then group those notes into recurring failure modes. The reliable finding when teams finally do it is that around eighty percent of their errors trace to a small handful of issues they had no idea existed, problems no generic metric was ever going to surface because nobody had looked closely enough to name them. A dashboard of generic scores can measure a great deal while explaining none of it. Reading the actual outputs is what tells you which failures are real and which one to fix first.
Once you know what is breaking, the lesson has to go somewhere it will actually change future runs, and this is where the real craft sits. Not every correction belongs in the same place, because corrections differ in how general they are and how long they stay true. A quirk of one specific tool, the way some API returns dates in an odd format, is true only for that tool and belongs right next to the code that talks to it. A gap in a particular reusable workflow belongs with that workflow, queued as a change to make. A genuinely general lesson, something that holds across tools and projects and would help even on work that never touches the thing that taught it, belongs in your durable cross-project knowledge. And a doc that has simply grown too long to read belongs in a trimming pass. Routing by this half-life is what keeps each store useful, because a lesson filed in the wrong place either clogs something that should stay lean or gets buried somewhere nobody rereads.
Two disciplines make the routing hold up over time. The first is a generalization gate you apply before promoting anything to durable knowledge: would another agent, or you on a completely different project, benefit from this even if they never touch the tool that produced it? If not, it stays local, because a durable store that fills with project-specific trivia stops being read, and an unread store improves nothing. The second is separating capture from application. The moment you notice a lesson is usually a bad moment to apply it well, because you are mid-task and low on attention, so the move is to log it cleanly now and apply it in a dedicated pass later when you can do it properly. The workspace these lessons come from runs exactly this split. Skill-specific lessons get logged to a queue and applied in a session whose only job is draining that queue. Cross-cutting lessons get written as durable memory entries with the reason and the how-to-apply attached. The architecture that actually enforces behavior lives in the workflows and checks, not in the memory, which holds only the general lessons. Memory reminds. It does not enforce.
Where it breaks
The loop has one hard dependency that is easy to violate: the improvement signal has to come from outside the model. Asking a model to "check your work" with no new information is not a correctness signal. A model told to revise without an external check will often turn a right answer into a wrong one, because nothing independent is telling it which is which, and it cannot separate a real error from a fluent non-error by staring harder at its own output. The loop works when the signal is a real correction, a failing test, a human catch, a verifier the model cannot talk its way past. It does not work when the signal is the model grading itself, and a loop built on ungrounded self-reflection can degrade the thing it was meant to improve.
The other failures are about the bookkeeping rotting. Capture without application is the common one: logging corrections feels like progress, but a queue nobody drains is a graveyard of good intentions, and the mistakes keep happening while the lessons pile up unread. The apply has to actually occur. Miscategorization is the slow one: send every lesson to a single store and it bloats past the point of being read, which silently breaks the whole loop because the store you stopped reading can no longer change anything. And there is a human failure that gets worse precisely as the system gets better. When outputs are usually right, reviewers stop genuinely reviewing and start rubber-stamping, so the rare wrong output sails through unchecked exactly because the system earned that trust. The defense is behavioral, not technical: review as if something is wrong, especially when you expect it to be fine.
Keep two things within reach. One is the routing rule, so a correction lands in the right store by reflex instead of in whatever file you happen to have open. The other is the error-analysis pass, for when the same kind of failure keeps recurring and you want its real cause instead of another symptom patch.
ROUTE A CORRECTION — ask in order, stop at the first yes:
1. True only for ONE specific tool/integration? → write it inline, next to that tool's code. Done.
2. Specific to ONE workflow/skill? → log it to that workflow's change-queue (apply later, not now).
3. General — helps even work that never touches → durable cross-project knowledge, WITH the reason + how to apply.
the thing that taught it?
4. None of the above? → keep it local. Not every lesson generalizes, and that's fine.
Capture now, apply in a dedicated pass. Logging is not fixing.ERROR-ANALYSIS PASS — when one kind of failure keeps recurring:
1. Collect ~20–100 real outputs of the failing task (more than feels necessary).
2. Read each. In plain language, note what went wrong. Don't categorize yet — just describe.
3. Now group the notes into recurring failure modes. Name each one.
4. Count: which mode accounts for the most failures?
5. Fix the biggest bucket at its source (the instruction/workflow, not this one output).
6. Re-run the task on fresh inputs. Did that bucket shrink? Repeat on the next-biggest.The line that does the work in the first block is "apply in a dedicated pass." Capturing a lesson and applying it are different jobs done well at different times, and collapsing them is how good corrections end up logged and never used.
Worked example
IllustrativeIllustrative. A constructed pair to show routing, not a real log.
Two corrections come up in the same afternoon. The first: a draft keeps opening with "in today's fast-paced world." You could just delete it, again, as you have three times this month. Routed instead, you ask the questions in order. Is it specific to one tool? No. One workflow? It is specific to your drafting, so yes, it goes into the instructions that workflow uses, as a banned-phrase line. The next draft does not open that way, and you have stopped re-deleting it by hand.
The second: a particular calendar API returns event times without a timezone, which bit you once. Routed, the first question already answers it. This is true only for that one API, so the note lives in a comment right next to the code that calls it, and it never gets promoted to general knowledge because it would help nobody who is not calling that exact API.
The drafting fix changed every future draft. The API note changed nothing beyond warning the next person to touch that one integration. Both went where they did because the place was chosen by how far each lesson reaches: the banned phrase reaches every draft, so it belongs in the drafting instructions; the timezone quirk reaches one integration, so it belongs beside that integration. Ask how far a correction reaches and it tends to route itself.