Skip to lesson

Tools Are the Model's Hands

IntermediateSign in to save

After this you can design a tool an agent calls correctly the first time, because the name, the description, the arguments, and the return shape all tell it exactly when to reach for this tool and what it will get back.

Understand

An agent is only as capable as the tools you hand it. The model supplies the reasoning, but it cannot read a file, query a database, or send a message on its own. Each of those is a tool you wrote and exposed, a function the model invokes by emitting structured arguments. So the ceiling on what your agent can do is the set of tools in front of it, and the quality of each call is set almost entirely by how you described that tool. The model decides whether to call a tool, and with what arguments, from its name, its description, and its parameter schema. Nothing else. It never sees your implementation. The description is the only interface the model has, which means the description is the prompt, and a tool with a vague one fails in a way that looks like the model being stupid when it is really you under-specifying the job.

Watch what a loose description does. "Query the database" tells the model nothing about scope, so it writes a query against the wrong table, or a write where you meant a read, and it does this confidently because nothing told it not to. Tighten the same tool to "Run a read-only SQL SELECT against the analytics Postgres. Returns at most 100 rows. Do not use for writes or schema changes." and the wrong call mostly stops happening, because the boundary is now stated where the model reads. The strongest descriptions say two things a beginner leaves out: when to use the tool and when not to. The "when not to" is what stops the model from grabbing the nearest plausible tool the moment a task gets ambiguous.

Vague tool versus ACI-grade toolthe same database tool, before and after — a loose name and description produce a confident wrong call, while a precise one with a stated boundary and a recoverable error produce a right call.
Vague tool versus ACI-grade toolthe same database tool, before and after — a loose name and description produce a confident wrong call, while a precise one with a stated boundary and a recoverable error produce a right call.

Naming carries as much weight as the description. Two tools called get_data and fetch_data give the model a coin-flip every time it needs one of them, and it will sometimes call the wrong one and pass arguments shaped for the other. Names should be unambiguous about what the tool does and distinct enough that the model never has to guess between near-synonyms. The same discipline runs into the arguments. If a tool takes a free-text date field, the model will hand it 5/30 and next Tuesday and 2026-05-30 on different turns. Constrain the shape instead. An enum for a status, an ISO-8601 type for a date, a bounded integer for a limit. You are making the wrong call hard to express, which is poka-yoke applied to a function signature. A constrained argument blocks a whole class of wrong calls before the agent ever runs.

The return matters as much as the call. The naive instinct is to mirror your REST API one endpoint to one tool and pass the raw response straight back. That return is shaped for a programmer reading JSON, not for a model spending a context budget. A tool that hands back a 50KB JSON blob does two bad things at once. It burns thousands of tokens of context, and it buries the three fields the agent actually needed under pagination metadata, null fields, and nested envelopes the model now has to reason through. Return what the task needs, in the shape it can use. Filter, summarize, and page on the server side so the agent gets the answer rather than the API's idea of a response.

Raw REST mirror versus task-shaped returna 1:1 endpoint-mirroring tool dumping raw JSON into context versus a task-shaped tool returning only the filtered fields the agent needs.
Raw REST mirror versus task-shaped returna 1:1 endpoint-mirroring tool dumping raw JSON into context versus a task-shaped tool returning only the filtered fields the agent needs.

The same logic applies one level up, to how many tools you expose. The intuition is that more tools means a more capable agent. The opposite is true past a point that arrives faster than anyone expects. Every tool you connect injects its name and full description into the model's context on every single call, so a wall of tools is a standing tax paid before the agent does any work, and worse, it is a selection problem. The model has to pick the right tool from the list, and that choice degrades as the list grows. On the Berkeley Function-Calling Leaderboard, every model tested performed worse with more than one tool available, and the slide is steep. Past roughly thirty tools the descriptions start overlapping and selection accuracy falls. Past a hundred, failure is close to guaranteed. The sharpest single data point: Llama 3.1 8B failed a benchmark when handed 46 tools and succeeded on the same work when the set was cut to 19. The tools did not change. The number of them did.

Tool count versus selection accuracyselection accuracy falling as the tool list grows — workable in the low tens, degrading past ~30, near-certain failure past 100.
Tool count versus selection accuracyselection accuracy falling as the tool list grows — workable in the low tens, degrading past ~30, near-certain failure past 100.

Where it breaks

Good tool design has a floor it cannot lift you above. If the underlying task genuinely needs reasoning the model does not have, no description rescues it. A perfectly named, tightly scoped tool still fails when the work itself is beyond the model, and reaching for better tool copy there is the wrong fix. The opposite trap is consolidating too aggressively. "Fewer tools" is real advice, but collapsing three honest tools into one overloaded tool with a mode flag and eight conditional arguments just moves the ambiguity from tool selection into argument selection, where it is harder for the model to get right and harder for you to see go wrong. Consolidate around a task the agent actually performs, not around your urge to shrink the list. None of this trips an alarm. Nothing warns you that a description is too vague or that your fortieth tool tipped selection into the failure zone. The signal shows up in the calls themselves: the model reaching for the wrong tool, passing an argument shaped for a different one, or looping because a tool returned an error it could not act on. Those are tool-design symptoms wearing a model-stupidity mask, and the fix is in the tool, not the prompt.

Do it now

Run any tool you expose through this checklist before you ship it. Paste it next to the tool definition and answer every line. A no anywhere is a redesign, not a note for later.

Paste this
TOOL DESIGN CHECK — answer yes to ship

NAME
[ ] Unambiguous about what the tool does, not get_data / fetch_data
[ ] Distinct from every other tool's name (no near-synonyms the model coin-flips)

DESCRIPTION (this is the prompt the model reads)
[ ] States WHEN to use it
[ ] States WHEN NOT to use it (the line that stops wrong grabs)
[ ] Names the system it touches and whether it reads or writes
[ ] States any hard limit (max rows, max size, rate)

ARGUMENTS (make the wrong call hard to express)
[ ] Constrained types, not free-text: enum for choices, ISO-8601 for dates, bounded ints for limits
[ ] Every required arg is genuinely required; no silent defaults that hide intent

RETURN (shaped for the agent, not your REST API)
[ ] Returns only the fields the task needs, not the raw API payload
[ ] Filtered / summarized / paged server-side, not a 50KB dump into context
[ ] Errors are recoverable: 'unknown table foo, did you mean orders?' not a stack trace

THE SET (every tool taxes every call)
[ ] This tool earns its slot; the agent's task actually needs it
[ ] Total exposed tools stays well under the count where selection degrades (low tens, not dozens)

The two lines that catch the most failures are "when not to use it" and "errors are recoverable." The first stops the confident wrong grab before it happens. The second turns a dead-end failure into a turn the model can recover from on its own, which is the difference between an agent that self-corrects and one that loops.

Worked example

Illustrative

Illustrative. A constructed exchange to show the pattern, not a real session.

An agent has to pull one person's contact details. The first design mirrors the CRM's REST API directly, three thin tools wrapping three endpoints, each returning whatever the endpoint returns:

Tools exposed: list_contacts, get_contact, search_contacts

Agent's reasoning: I need Dana's email. search_contacts? Or list_contacts and scan? I will list them.

list_contacts returns: { "data": [ ...312 contact objects, each with 40 fields... ], "page": 1, "total_pages": 7, "meta": {...} }

Agent: [now holding 38KB of contacts across a paged response, re-reasoning over which one is Dana, deciding whether to fetch page 2]

The agent picked a defensible tool and still ended up in a bad place. It has to choose among three overlapping tools, then wade through a raw paginated dump, then page again. Every step spends context and adds a chance to drift. Redesign around what the agent is actually trying to do, which is find one contact:

Tool exposed: find_contact

Description: "Find a single contact by name or email. Returns name, email, company, and last-contacted date for the best match, or a not-found error listing close matches. Use for looking up one known person. Do not use to browse or export the full contact list."

Agent: find_contact(query="Dana")

Returns: { "name": "Dana Okafor", "email": "dana@acme.co", "company": "Acme", "last_contacted": "2026-05-12" }

One tool, one call, four fields. The agent never had to choose between near-identical tools, never held a paginated blob, never reasoned over fields it did not need. The decision burden the first design pushed onto the model, which tool, which page, which record, got absorbed into the tool itself, where you could design it once instead of hoping the model resolves it correctly on every run. That is the move underneath all of this. A good tool does the agent's hard thinking before the agent ever has to.