Skip to lesson

Research and Synthesis at Scale

After this you can run a multi-source research pass that scales without rotting. You fan the reading out across parallel extractors, hand the results to a separate synthesis step, and gate the final claims so every number traces back to a source you actually supplied.

Understand

The naive move when you want a researched answer is to hand one agent the question, point it at the web or a folder of documents, and ask for the report. It comes back fluent, structured, footnoted, and you cannot tell whether it is right. That is the trap. The work reads as authoritative precisely because the model is good at the part that does not matter, the prose, while the failure hides in the part that does, whether the claims are grounded in anything real. A growth team that swapped its analytics stack for one Claude agent named the result exactly: "garbage in, compelling-looking garbage out, now with AI confidence." Scale does not improve this; it makes a wrong answer look more finished.

The structural reason is that one agent reading many sources is doing two incompatible jobs in one context window, and both degrade as the window fills. The first job is extraction, reading each source closely and pulling what it says. The second is synthesis, holding everything at once and reconciling it into one account. Pile ten sources into a single window and the early ones drift out of working attention while the model is still reading the late ones, so by the time it writes, it is synthesizing from a half-remembered blur rather than from the sources in front of it. This is the same long-input decay that bites long-form generation. A builder who shipped fifty thousand articles found the model "loses recency, anything in the first 30% of the system prompt drifts out of working attention," and tested it across three frontier models that all drifted past the same point. It is not a model-size problem. It is what happens when one window carries the whole job.

The fix is to split the two jobs the way the work actually divides. Fan the extraction out: one sub-agent per source, each reading its source in a clean window and returning a tight structured summary, never the raw text. Then a dedicated synthesis step takes those summaries, not the originals, and reconciles them into the answer. The synthesis agent never reads forty documents. It reads forty digests, which fit. This is the move that turns "more sources" from a liability into a strength, because reading scales horizontally across parallel windows while the thing that does not scale, reconciliation, runs once over compressed inputs.

One agent versus the pipelinethe contrast — one window doing extraction and synthesis at once degrades as sources pile in, while the pipeline gives each source its own clean window and hands compressed summaries to one synthesis step that never reads the originals.
One agent versus the pipelinethe contrast — one window doing extraction and synthesis at once degrades as sources pile in, while the pipeline gives each source its own clean window and hands compressed summaries to one synthesis step that never reads the originals.

The pipeline has a recognizable shape: pose the question precisely, fan out per-source extraction, enrich the gaps the first pass missed, synthesize once, then verify. This very Handbook was built that way. Twenty-one extraction agents each read one research surface and returned a structured digest; a later synthesis pass reconciled those digests into the source map a module author reads, not by re-reading twenty-one raw briefs but by reading their summaries. The document you would author against is itself the output of this pipeline.

But fan-out and synthesis only get you a well-organized answer. They do nothing for whether the answer is true, and a confidently organized falsehood is worse than an obvious one because it survives review. This is where the gate earns its place. The drafting failure that real pipelines guard against is precise: the model will invent a plausible-sounding number to fill a gap, and "those numbers will survive editorial review unless an editor opens every link." The countermeasure is a hard constraint stated upfront, that every statistic must trace to a source in the supplied set, paired with a verification pass whose only job is to walk each claim back to its source and flag the ones that have none. That pass is not editing for flow. It is a distinct step asking one question per claim: where in the supplied sources does this come from?

The trace-every-claim gatethe verification pass walking each claim in the synthesized draft back to a supplied source, splitting claims into grounded (keep), unsupported (cut or flag), and fabricated figures (the model narrated a number no source contains).
The trace-every-claim gatethe verification pass walking each claim in the synthesized draft back to a supplied source, splitting claims into grounded (keep), unsupported (cut or flag), and fabricated figures (the model narrated a number no source contains).

Where it breaks

The trace-every-claim gate is the load-bearing part, and it is also where the discipline most often quietly fails. The first failure is the hallucinated figure that reads like a finding. A model will narrate "conversion rose from 3.2% to 4.8%" with the same cadence it uses for real numbers, and if your gate only checks claims that look uncertain, the confident ones walk straight through. The gate has to be indiscriminate, checking every number rather than the suspicious-looking ones, because the dangerous fabrications are the ones that sound the most settled. This Handbook's own pipeline caught a fabricated chart midpoint in a finished module during exactly this kind of pass; it had survived the draft because it looked like every other grounded number around it.

The second failure is synthesis that smooths over disagreement instead of surfacing it. When two sources conflict, the model's instinct is to blend them into one untroubled sentence, because agreeable prose is what it was trained to produce — and that blend quietly omits the conflict, which is often the most useful signal in the pass. Practitioners have named the reflex bluntly: LLMs "try to agree with everyone," and the same instinct that flattens a content draft flattens a research synthesis. Operator-grade synthesis does the opposite. It names the disagreement, attributes each side, and says which is better supported, because the conflict is often the most useful thing in the whole pass. If your synthesis never reports that two sources disagreed, suspect it is hiding the disagreement, not that the sources happened to align.

The third failure is fan-out that outruns synthesis. It is cheap to spawn forty extractors and expensive to reconcile forty digests well, so the temptation is to scale the reading and assume the synthesis will keep up. It will not. A single synthesis context has a real ceiling. Past roughly a handful of dense batches it starts compressing rather than reconciling, summarizing the summaries and losing the cross-source contradictions that were the point. The constraint to respect is that fan-out should produce only as much as one synthesis pass can actually hold and reconcile. If the synthesis step is drowning, the answer is fewer, denser extractions or a tiered synthesis, not a bigger pile.

And the whole pipeline is the wrong tool when the question has one source or no real sources. If the answer lives in a single document, fanning out is theatre, because one careful read beats five parallel partial ones. If the question is genuinely open-ended judgment with nothing to ground against, no amount of extraction produces a grounded claim, because there is nothing to trace to. The pipeline earns its cost only when there are many real sources and the risk is that one agent cannot hold them all honestly at once.

Do it now

Two artifacts run the spine. The first is the per-source extraction brief you give each parallel extractor, one per source, each in its own clean context. It forces a structured digest instead of a dump, which is what keeps the synthesis step readable:

Paste this
You are extracting ONE source for a research synthesis. Read it in full before writing.

SOURCE: <url or file path — the one source this agent owns>
RESEARCH QUESTION: <the single question the whole pass is answering>

Return ONLY this structure. Do not write a narrative.

- Claims relevant to the question: each as one line, with the exact figure/quote
  if the source gives one. No claim without a locator (section, page, or quote).
- Numbers/stats: list each with its surrounding context. If the source asserts a
  number without backing, mark it (source-asserted, unverified).
- Where this source DISAGREES with the common framing, or hedges, or caveats.
- What this source does NOT cover (so synthesis knows the gap).

Rule: if it is not in this source, it does not go in your summary. Do not add
context from your own knowledge. Flag anything you are inferring vs. reading.

The second is the verification gate the synthesis output passes before anyone trusts it. Run it as a separate pass, ideally a fresh context that did not write the draft:

Paste this
You are verifying a research synthesis against its supplied sources ONLY.
You are not improving the writing. You are checking grounding.

INPUT: the synthesized draft + the full set of per-source summaries it drew from.

For EVERY claim in the draft, and EVERY number without exception:
1. Name the supplied source it traces to. Quote the supporting line.
2. If no supplied source supports it → label UNSUPPORTED.
3. If it is a number that no supplied source contains → label FABRICATED, remove it.
4. If two sources disagreed and the draft reports only one side → label SMOOTHED,
   and state what the draft omitted.

Return a table: claim | verdict | source / quote | action.
Finding ungrounded claims is success, not failure. Do not pass anything you
could not trace.

The two together are the discipline. The extraction brief stops each reader from blurring sources, and forbids it from importing outside knowledge that would later read as a source-grounded claim. The gate's last line is doing quiet work: an agent told that finding gaps is success will hunt for ungrounded claims, where an agent told to verify will rubber-stamp. Frame the verifier as adversarial to the draft, not as its defender.

Worked example

Illustrative

Illustrative. A constructed run to show the pipeline, not a real session.

The question: "What does practitioner evidence say about the reliability of AI A/B-test analysis?" Five sources are in scope: three engineering write-ups, one CRO audit, one academic paper. The one-agent move is to paste all five and ask for the summary:

You: [pastes all five sources] Summarize what these say about whether AI A/B-test analysis is reliable.

Assistant: AI-assisted analysis is broadly reliable when paired with good data practices. Most experiments show no effect, and tools have improved significantly, with accuracy reaching the high 90s in modern setups...

It reads clean and it is quietly wrong. "Accuracy reaching the high 90s" blends a semantic-layer figure from one source into a general claim none of them made, and it has flattened a real disagreement, where one source argues AI lowers reliability and another that a metrics layer restores it, into one agreeable sentence. Run the pipeline instead. Five extractors each read one source and return a digest. The CRO-audit extractor returns: "Reclassified 57 claimed wins down to 17 under consistent standards, a 70% reduction (source-stated)." The semantic-layer extractor returns: "Raw-schema text-to-SQL accuracy near 40%; a governed metrics layer approaches ~100% (vendor-reported)." Synthesis reconciles the five digests and, critically, reports the disagreement rather than dissolving it:

Synthesis: Sources split on reliability. The audit and post-mortem sources hold that AI raises the rate of plausible-but-wrong analysis, with one audit cutting 57 claimed wins to 17. The tooling sources argue reliability is recoverable but only through a governed metrics layer, citing a jump from ~40% raw-schema accuracy toward ~100%. These are not in conflict once scoped: the failure is in ungoverned ad-hoc querying; the recovery is in constraining the model to a defined layer. No source supports general reliability without that constraint.

Then the gate runs on that draft against the five digests:

Verifier: "57 wins to 17" → GROUNDED (CRO audit, source-stated). "~40% toward ~100%" → GROUNDED (semantic-layer source, vendor-reported; flag as vendor claim, not independent). "raises the rate of plausible-but-wrong analysis" → GROUNDED (post-mortem). No FABRICATED figures. No SMOOTHED disagreement, the draft reports the split. PASS, with one number down-flagged as vendor-reported.

Where each step caught whatthe same five sources through both routes — the one-agent path blends a figure and flattens the disagreement, the pipeline path keeps each source separate, surfaces the split, and the gate down-flags a vendor number instead of passing it as fact.
Where each step caught whatthe same five sources through both routes — the one-agent path blends a figure and flattens the disagreement, the pipeline path keeps each source separate, surfaces the split, and the gate down-flags a vendor number instead of passing it as fact.

Same five sources, same model. The pipeline answer is less tidy and far more trustworthy, because it tells you where the sources disagree and which number you should not lean on without checking. Those are the two things the clean single-agent summary buried.