Context Rot in Long Conversations

Intermediate

After this you can tell when a long thread has rotted, name which of four ways it went wrong, and repair it by removing the poisoned tokens instead of appending a correction the model will ignore.

Understand

Curating the window keeps a fresh task sharp. This lesson is about the opposite situation: a session that has already run long and gone wrong. You curated well, the session ran long, and somewhere in the last forty turns a wrong fact, a stale assumption, or an abandoned approach got written into the context. The model is now anchoring on it, and your instinct is to type the fix. Appending the fix is the natural move, and it does not work, because the wrong fact stays in the window sitting next to your correction.

Context rot is the name for what is happening underneath. A model does not read a long input with even attention, and across 18 leading models the Chroma Context Rot study found that accuracy degrades non-uniformly as the input grows, even on trivially easy tasks. The degradation is not "the relevant fact fell out." A separate EMNLP 2025 result showed drops of 13.9 to 85 percent as input length grew even when the irrelevant tokens were replaced with whitespace and the model was forced to attend only to the relevant span. Length itself is the tax. This kills the comforting assumption that if the information is still in the window, the model will use it.

As the previous lesson noted, the usable window is much smaller than the advertised one, often reliable only to 32K or 64K of a window sold as 200K or a million. What rot adds is the shape of the fall. One study watched Qwen2.5-7B drop rather than slope, its F1 falling from 0.55 to 0.30, a 45 percent collapse, at roughly 40 to 50 percent of its claimed maximum. The same Chroma context-rot study found the decline is non-uniform, with cliffs rather than a smooth ramp: a session that is fine at 50K tokens can turn noticeably worse by 80K. The operator habit is to read that drop as rot, not as the task suddenly getting harder.

Effective vs advertised windowperformance holding then collapsing at a cliff well before the advertised window fills, with the reliable-precision zone and the rot soft-limit marked.

The reason the append-a-correction instinct fails sits one level deeper. When you write "actually that was wrong, the deadline is Friday," both the original wrong deadline and your correction are now in the window at once. The model keeps attending to the poisoned tokens because they are still there, still phrased as fact, often still sitting earlier in the context where the model anchored. You are not overwriting the poison. You are sitting it next to its own contradiction and hoping the model picks the right one. The only reliable repair is to make the poisoned tokens leave the window. Rewind to before they entered, restart from a clean summary, or prune them out.

Why appending a correction failsthe wrong response (poison and correction both stay, model still anchors) against the right one (rewind drops the poison out of the window entirely).

Drew Breunig's taxonomy names four failure modes, and naming the one you are in is the first operator move because the repair follows from the name. Poisoning is a wrong fact or hallucination that got written in and now gets referenced repeatedly. DeepMind's Gemini 2.5 Pokémon agent poisoned its own goals section and stayed fixated on impossible goals for a very long time. The repair is removal, never a correction. Distraction is the window growing so long the model over-focuses on its own accumulated history and stops forming new plans. That same Pokémon agent, past roughly 100K tokens, started repeating past actions instead of synthesizing anything novel. The repair is to compact the history down to decisions and outcomes. Confusion is irrelevant but not wrong content steering the answer. One off-topic sentence appended to a math problem can flip the result. The repair is to prune the noise. Clash is two parts of the context disagreeing. When a task's details arrive spread across many turns instead of one upfront prompt, performance drops 39 percent on average, and o3 fell from 98.1 to 64.1. The mechanism is blunt. When a model takes a wrong turn early in a conversation, it gets lost and does not recover, because the wrong turn stays in the window and biases everything after it. The repair is to start clean with the details consolidated.

The four context-failure modeseach Breunig mode as trigger then symptom then the repair it actually needs, so naming the mode points straight at the fix.

Lost-in-the-middle, from the curation lesson, compounds all four. An instruction that was load-bearing forty turns ago now sits in the low-attention middle of a long thread (Liu et al., TACL 2023), technically present and effectively gone. That is not the model disobeying. It is the model not attending to where the instruction ended up.

Where it breaks

The hard part is that not every bad answer in a long session is rot, and treating everything as rot wastes the repair. Some perceived degradation is provider-side. The recurring "the model got dumber this week" threads are partly serving-layer load and time-of-day variance, not your context. Teach yourself to treat context discipline as the variable you control without blaming it for everything. The honest tell that it is rot is reproducibility. If a fresh session with the same curated prompt answers correctly and the long one keeps missing, the difference is the accumulated context, and removal is the fix. If the fresh session is also wrong, you have a reasoning or knowledge gap, and rewinding will not touch it.

The other place this breaks is over-aggressive removal. Rewinding or compacting drops tokens, and the subtle, critical detail whose importance only shows up later is exactly the kind of thing a summary quietly loses. Autocompaction makes this worse because it fires when the model is at its least intelligent point, already rotted, unable to predict where your work is heading, so it summarizes a debugging session right before you pivot to the thing the summary dropped. The defense is to never let removal happen automatically mid-task. Compact at a clean boundary, on your terms, and only drop what you can reconstruct. Keep the URL when you drop the page, keep the path when you drop the file. Never discard anything with no path back to it.

Do it now

When a long thread starts missing, do not type the correction. Run this two-step instead. First paste the diagnostic line into the rotted session to force the model to surface what it is actually anchoring on. Then take the answer to a fresh session using the clean-restart block, which removes the poison by leaving it behind rather than arguing with it.

Paste this

Diagnostic (paste into the rotted thread):
List, as bare facts with no commentary, every assumption and decision you are
currently treating as settled for this task. For each, mark it [confirmed by me],
[you inferred it], or [unsure]. Do not act on anything yet.

Paste this

Clean restart (paste into a NEW session, filled from the diagnostic):
Task: <the one outcome you want>
Confirmed facts (carry forward): <only the [confirmed by me] lines, corrected>
Settled decisions (carry forward): <the decisions still valid, restated cleanly>
Dropped on purpose: <the abandoned approach / wrong fact you are leaving behind>

Start from this state. Do not ask me to re-explain the earlier session.

The diagnostic works because rot hides in the [you inferred it] and [unsure] lines. That is where a hallucinated fact or a quietly-adopted bad assumption shows up, and seeing it listed tells you exactly what the poison is. The restart works because the wrong fact is never typed into the new window. It is named once under "dropped on purpose" so you do not re-derive it, then left out entirely. The poison cannot anchor a model that never received it.

One reach-for rule sits above the block. Match the move to the mode. Poisoning and clash need removal, so rewind or restart clean. Distraction needs the history shrunk, so compact to decisions and outcomes. Confusion needs the one noisy passage pruned. Appending more text repairs none of them.

Worked example

Illustrative

Illustrative. A constructed exchange to show the repair, not a real session.

A long planning thread has been running for two hours. Forty turns ago the assistant inferred the launch date was March 14 from an ambiguous note, and the date was actually never confirmed. Every plan since has been built around it. The operator notices the schedule looks off and reaches for the correction first, the way most people do:

You: Wait, March 14 isn't confirmed. Don't assume that date.

Assistant: Understood, I won't assume March 14. So working back from the March 14 launch, the code freeze should land around...

The correction did nothing. Both the instruction to drop the date and the date itself are in the window, and the model is still anchored on the more-stated fact sitting further up the thread. So the operator stops appending and runs the diagnostic instead:

You: [pastes the diagnostic line]

Assistant: Settled facts and decisions:

Launch date is March 14 — [you inferred it]

Two-week QA window before launch — [confirmed by you]

Marketing owns the announcement copy — [confirmed by you]

Code freeze 10 days before launch — [you inferred it]

There it is, listed plainly. The poison is the inferred March 14, and the code-freeze rule was derived from it, so both are suspect. The operator opens a fresh session and fills the restart block, carrying the two confirmed lines, leaving the inferred date out, naming it once so it does not creep back:

You: Task: build the launch plan. Confirmed: two-week QA window, marketing owns announcement copy. Dropped on purpose: the March 14 date (never confirmed) and any freeze date derived from it. Start from this state.

Assistant: Got it. The plan is date-relative until you confirm a launch date. QA holds the last two weeks before launch, so once you set a date the freeze falls at QA-start. What is the launch date, or should I leave it as a variable?

The difference between the two runs is not the model or the project. The second works because the bad date was never in the new window to anchor on, so the model asks for the missing fact instead of quietly reusing the wrong one. Nothing was argued away. The wrong token was simply removed.