Resources & Further Reading

Beginner

After this you can reach for the one vetted primary source behind each concept in this handbook instead of a search-result link pile, because you know what each source is the canonical source *for* and where it stops being current.

Understand

A reading list is easy to assemble and almost worthless. Anyone can paste fifty links under a heading. The reason most resource pages fail the moment a skilled reader lands on them is that they index breadth and call it curation. An awesome-* list on GitHub is a real and useful thing, but its maintainers are usually explicit that inclusion is not a vetting or a security guarantee. A page that ranks an abandoned weekend project next to a provider-maintained reference, with no signal between them, has done the cheap half of the job and skipped the expensive half.

The expensive half is subtraction with a reason attached. The discipline here is the same one that runs through the rest of the handbook: the operator subtracts and verifies, the beginner accumulates and trusts. For a resource shelf that means three commitments. One source per concept, not ten, so the reader knows which one to open first. Each entry placed after the lesson it backs, so you reach for it when you actually need the depth, not as front-loaded homework. And a stated reason it earns its place, because a link with no why is indistinguishable from a link someone pasted to look well-read.

The shortcut breaks in three predictable ways. Treating a curated list as a quality signal is the first. The second is the bare link-dump, a title and a URL with nothing telling you why this one and not the other forty. The third is the one that catches careful people: the superseded source. A specific danger in this field is the dated leaderboard or the "best tool for X" post, accurate the week it shipped and quietly wrong six months later. The fix is to surface each source for its durable contribution, the thing it taught that survives a version bump, and to hedge anything volatile rather than freeze a ranking into the page.

Do it now

The shelf below is the asset. It is organized by the concept each source explains, not alphabetically and not by popularity, so when a lesson sends you here you land on the one entry that goes deeper. Every entry names what the source is the canonical source for, why it earns the slot, and the signal to check before you trust it today.

Which source is canonical for which conceptthe spine of the handbook mapped to the single primary source behind each idea, so the shelf reads as a concept index rather than a link pile.

Context engineering and the window

Anthropic, "Effective context engineering for AI agents" (anthropic.com/engineering/effective-context-engineering-for-ai-agents). The canonical source for the reframe from prompt engineering to context engineering: the question shifts from finding the right words to deciding "what configuration of context is most likely to generate the desired behavior." It earns its place because it is the primary statement from the engineering org that defines the term the rest of the handbook builds on, and it carries the "examples are the pictures worth a thousand words" guidance against stuffing edge cases into prose. Signal to check: the framing is durable, but specific token-window numbers and product behaviors drift between model releases, so read it for the principle and verify any hard number against current docs.

Drew Breunig, "How contexts fail and how to fix them" (dbreunig.com). The canonical source for the taxonomy of context failure: poisoning (a hallucination enters context and gets referenced repeatedly), distraction (context grows so long the model over-focuses on it), confusion (superfluous content degrades the answer), and clash (new content conflicts with earlier prompt content). It earns its place because it is the most teachable failure framework in this whole area and the names have become shared vocabulary. Pair it with Karpathy's and Philipp Schmid's framing of the same shift; Schmid's one line is the cleanest for teaching (philschmid.de/context-engineering): prompt engineering is what you say, context engineering is the system that decides what the model sees.

Tools, MCP, and its context cost

Anthropic, "Writing effective tools for agents" (anthropic.com/engineering/writing-tools-for-agents). The canonical source for why fewer tools beat more: consolidate into fewer high-level tools that accomplish complete tasks, because every tool you expose adds to the agent's decision burden and spends context. The concrete example to remember is a single find_contact tool standing in for three granular ones (list_contacts + get_contact + search_contacts). This is the resource the worked example below reaches for, and it earns its place because almost no generic toolkit content names the counterintuitive part. Signal to check: the design principles are stable; treat it as the reference you open when you are about to wire in a tool, not a one-time read.

Anthropic, "Code execution with MCP" (anthropic.com/engineering/code-execution-with-mcp). The canonical source for the insight beginners never see: tool definitions eat context before the agent does any work. With dozens of servers and hundreds of tools, the definitions alone can consume tens of thousands of tokens up front. The reframe it teaches is that MCP is an API surface you program against, loading definitions on demand and filtering results in the execution environment so only the answer returns to the model; Anthropic reports a greater-than-90% reduction in tool-related context in some cases. It earns its place because it is the primary source with a real number behind the "more tools is a tax" claim.

modelcontextprotocol.io (the MCP specification). The canonical source for how MCP actually works: a host embeds a client, servers expose tools, resources, and prompts, and the model only ever sees the tool descriptions injected into context. It earns its place as the reference of record rather than someone's summary of it. Signal to check: the spec versions, so confirm the current transport and auth state on the site rather than from a blog post that may predate the latest revision.

Invariant Labs, "MCP security notification: tool poisoning attacks" (invariantlabs.ai). The canonical source for the trust boundary collapse, and the one piece almost no generic toolkit content carries. Clients send the full tool description to the model but typically show users only a name and short summary, so a server can embed a hidden instruction the model follows and the user never sees. It names three attack classes worth memorizing: tool poisoning, the rug-pull (a benign server changes its descriptions after you approve it), and cross-server shadowing (one server alters how another's tools behave). It earns its place because installing a server means running someone else's code with your credentials, and this is the source that makes that concrete.

Workflows, agents, and frameworks

Anthropic, "Building effective agents" (anthropic.com/engineering/building-effective-agents). The canonical source for the workflow-versus-agent distinction and the "start simple, add complexity only when it pays" discipline: start with simple prompts, optimize with evaluation, and add multi-step agentic systems only when simpler solutions fall short. It earns its place because much of what is sold as "agents" is really workflows (chaining, routing, parallelization), and this is the primary source that draws the line cleanly.

Octomind, "Why we no longer use LangChain," and Cognition, "Don't build multi-agents." The canonical sources for frameworks and architecture as a liability when they stop matching your shape. Octomind (octomind.dev) is the clearest case study: high-level abstractions were great while needs were simple but hid the prompts being sent to the model, so they replaced the framework with direct API calls plus a handful of utilities and got a smaller, more debuggable codebase. Cognition (cognition.ai/blog/dont-build-multi-agents) makes the architectural cut: for most applications a single-threaded linear agent with carefully engineered context is more reliable than a multi-agent system, because parallel subagents make conflicting implicit assumptions. Together they earn their place as the counterweight to framework-first content. Signal to check: these are dated opinions about specific tools; the lesson (a framework helps only when it matches your shape) outlasts the specific verdicts, so do not read either as a permanent ranking of any named tool.

Evaluation and verification

Hamel Husain, "Your AI product needs evals" (hamel.dev/blog/posts/evals). The canonical practitioner source for building an eval discipline: start with cheap code assertions, grow to LLM-as-judge, and the most underrated step, actually read your outputs. The move that separates teams who improve from teams who guess is that the most valuable eval cases are your own failures, not a generic benchmark. It earns its place because it is the essay most people in the field point to first.

Where things break and where to start

swyx, "The AI Engineer Reading List" (latent.space/p/reading-list). The canonical meta-resource, a curated set of roughly fifty resources organized by topic. It earns its place precisely because the value is in the selection, not the enumeration, which is the same principle this page runs on. Use it as the broader map once a specific concept here sends you deeper than a single source can take you.

If you want the named-papers shelf behind the prompting boundaries, three earn their slot by being the why under a common failure: "Calibrate Before Use" (arxiv.org/abs/2102.09690) for recency, majority-label, and common-token bias in few-shot, where changing only the order of identical examples swung sentiment accuracy from 54% to 93%; the demos-position-bias paper (arxiv.org/abs/2507.22887) for the finding that relocating an identical demo block from start to end can swing accuracy by 20 to 50 points with no content change; and IFScale (arxiv.org/abs/2507.11538) for the why behind over-prompting, where even the best frontier models hit only 68% instruction-following accuracy at maximum instruction density. Read these only when a lesson cites the effect and you want the primary measurement.

Here is the one-line vet-check to paste at the top of any new resource before you add it to your own shelf. It is the copy-pasteable artifact of this page.

Paste this

Vet before adding to the shelf:
[ ] Names a concept it is the CANONICAL source FOR (not just a good read)
[ ] Maintained / dated within the last ~12 months, or a primary spec/paper that doesn't decay
[ ] Ships the reasoning, not just the artifact (a why, not only a what)
[ ] Volatile claims (rankings, "best tool", token numbers) hedged, not frozen
Fail any line -> it's a bookmark, not a shelf entry.

Generic shelf vs operator shelfthe same set of links handled two ways, and why the second survives a skilled reader.

Worked example

Illustrative

Illustrative. A constructed decision to show the shelf in use, not a real session.

You are about to give an agent access to your CRM and you have to decide how to expose it. The CRM's API has a dozen endpoints, and the obvious move is to wire one tool per endpoint so the agent can do anything. Before committing, you open the one entry this shelf places under tool design, Anthropic's "Writing effective tools for agents," for the specific takeaway rather than a general re-read.

Takeaway you reach for: consolidate into fewer high-level tools that complete a whole task. The example in the source is a single find_contact replacing list_contacts + get_contact + search_contacts, because fewer capable tools cut the agent's decision burden and conserve context.

So instead of a dozen thin REST mirrors you expose three task-shaped tools: find_contact, log_interaction, and update_deal_stage. The agent now picks from three obvious options instead of guessing among twelve overlapping ones, and the tool definitions spend far less of the window before any work starts. The source did the work the title alone could not: it told you which direction the design pressure runs, and the reason behind it, so the decision took one read instead of a trial-and-error loop, because the entry carried the direction the design pressure runs and the reason behind it.