RAG & Knowledge Access
After this you can decide how a model gets the knowledge a task needs — load it into the window, retrieve it on demand, or bake it into the weights — and you can tell when a confident answer is actually sitting on bad retrieval rather than on the facts you meant it to use.
Most people meet retrieval-augmented generation as a generation feature, a way to make the model "know your documents." It is closer to the opposite. RAG is a retrieval problem wearing a generation costume. The model never sees your corpus. It sees whatever a retrieval step selected and pasted into the window, and it reasons over exactly that. If retrieval hands it the wrong three paragraphs, the model writes a fluent, well-organized answer grounded in the wrong three paragraphs, and nothing in the output marks the miss. The generation half almost always looks fine. The quality you actually get was decided one step earlier, by what got retrieved.
There are three ways to put knowledge in front of a model, and they sit on a spectrum of effort and freshness. You can load it directly into the context window by pasting or attaching it. You can retrieve it on demand, which is what RAG means: embed the corpus into vectors once, then at question time fetch the top handful of chunks by similarity and inject those. Or you can bake it into the weights with fine-tuning, which teaches durable skills and formats far better than it teaches facts. The choice between them is not a matter of taste. It follows computable properties: how big the corpus is, how often it changes, what fraction of it is relevant to a typical question, and how much latency and cost you can spend per answer.
The reflex to reach for RAG is usually premature. If the material that bears on a question fits comfortably inside the effective window — and remember from context engineering that the usable window is smaller than the advertised one — loading it is simpler, fresher, and needs no infrastructure. RAG earns its keep when the corpus is too large to load, changes faster than you can paste, or buries a small relevant slice inside a large irrelevant bulk. The cost comparison is worth getting the right way round, because it inverts easily. At scale, holding everything in a long context is the expensive path, not the cheap one: long-context requests have been measured at roughly thirty to sixty times the latency and on the order of a thousand times the per-query cost of answering the same question through retrieval, which is exactly why retrieval exists for large corpora. The case for loading a small corpus directly is therefore not per-query price. It is that you skip the embedding store, the chunking decisions, and the retrieval tuning entirely, getting fresher data with far less to maintain.
The number that should change how you treat retrieval is the silent-miss rate. On a single-needle test, where one fact is hidden in a long document, a strong model like Gemini 1.5 Pro retrieved it about 99.7% of the time, which is the kind of figure that makes people trust long context blindly. On realistic multi-fact retrieval, where the answer depends on stitching several passages together, recall fell to roughly 60%. That is a 40% miss rate that throws no error and reads as confident prose. And more raw context is not the ceiling either: an order-preserving retrieval setup using 48K well-chosen tokens beat a full-context run at 117K tokens by 13 F1 points, at about a seventh of the budget. What moved accuracy was the selection, not the extra sixty-nine thousand tokens.
Because the answer is decided by retrieval, retrieval quality sets the ceiling on what the generation step can do, and the moves that raise it are mostly unglamorous. Chunking, the plumbing step of splitting documents before you embed them, tends to matter as much as or more than which embedding model you pick. The reason is mechanical. Split at the wrong boundary and the sentence carrying the answer lands in one chunk while the context that qualifies it lands in the next, so neither chunk matches the query well enough to be retrieved and the answer is lost even though both halves sit in the corpus. Over-fetching to make up for it tends to hurt rather than help: pulling thirty chunks because the window has room drags in near-misses, and accuracy on a RAG system often rises with the first few documents and then falls as more are added, because the extra chunks are hard negatives that look relevant and mislead.
Where it breaks
The dangerous failure is the quiet one. A retrieval miss does not crash. It returns a plausible passage that is subtly wrong — last quarter's pricing page, a deprecated policy, a similarly-worded section from a different product — and the model answers from it with full confidence. You only find out when someone acts on the answer. Bad chunk boundaries cause a related failure, where the fact a question needs is split across two chunks and neither one matches the query well enough to be retrieved. And chunking decisions are near-irreversible at scale, because changing them means re-ingesting the entire corpus, so a careless early choice becomes a standing tax. There is a quieter waste in the other direction. Standing up a retrieval pipeline for a corpus you could have pasted pays the operational surface of RAG to look up knowledge that would have fit in the window in the first place. Match the mechanism to the corpus, not to what sounds advanced.
Make the routing decision before you build anything, and put a trust gate on any retrieval-answered question that matters. Paste this and fill it in:
Knowledge routing — answer before building retrieval:
- Corpus size: does the relevant material fit in ~32–64K tokens of well-chosen context? If yes, load it and skip RAG.
- Change rate: does it change faster than you can re-load by hand? If yes, lean RAG.
- Relevance ratio: is the slice that matters less than ~20% of the corpus? If yes, lean RAG.
- Stakes: on a wrong answer, who is harmed, and how loudly do you need the miss to surface?
Retrieval-trust gate — paste above any RAG-answered question where a wrong answer costs something:
Answer ONLY from the retrieved passages provided. For each claim, name the passage it came from. If the passages do not contain the answer, reply "not in the retrieved context" — do not fill the gap from general knowledge.The four routing lines stop you from building a pipeline you do not need. Paste the gate so the retrieved passages sit alongside it, since the model can only attribute claims to passages it can actually see. The gate then does the load-bearing work: it converts the silent-miss failure into a visible one. A model told to answer only from the retrieved passages, and to say so when they fall short, surfaces a bad retrieval as a refusal you can see instead of a confident answer you cannot.
Worked example
IllustrativeIllustrative. A constructed exchange to show the failure mode and the fix, not a real system.
A support assistant answers from a documentation corpus. A user asks for the current per-seat price. Retrieval ranks an old pricing page just above the current one, because the two are nearly identical in wording, and injects the stale chunk:
User: What's the per-seat price on the Team plan?
Assistant: The Team plan is $12 per seat per month, billed annually.
The number is wrong by a release, and nothing flags it. The model reasoned correctly over the passage it was given. The passage was the problem. Now the same retrieval, with the trust gate in front of it:
Assistant: The retrieved passages describe the Team plan's seat model and feature tiers, but the only price figure in them is labelled as a prior-year rate. The current per-seat price is not in the retrieved context, so I won't state one — pull the live pricing page if you need the exact number.
Both paths ran the same retrieval, and the stale chunk ranked first in both. The gate changed nothing about the ranking. What it changed was whether a bad ranking reached the user as a confident answer or surfaced as a refusal someone could catch.