The Token Economy: Cost as a Design Constraint

Intermediate

After this you can treat cost as a design constraint that shapes the system from the first decision, not a bill that arrives later. You can find the axis that actually drives a workload's spend — cache misses, model tier, standing context, agent fan-out — and optimize that one instead of the axis that feels productive to fiddle with.

Understand

Cost in an AI system is not a number you read off an invoice at the end. It is a property you design in or out at the start, and it almost always reduces to one shape: how many tokens, at what per-token rate, how many times. Once you see spend that way, most cost problems turn out to be architecture problems wearing a pricing label, and the fix is rarely "write a shorter prompt." The expensive mistakes happen on axes that prompt-trimming never touches.

The axis most people never look at is cache hit rate. Providers cache the prefix of your context so that repeated calls re-reading the same system prompt and history do not pay full price each time, and the discount is large. Cached input tokens run roughly ten times cheaper than uncached. On a workhorse-tier model like Claude Sonnet the gap has run about $0.30 against $3.00 per million tokens, and while the absolute rates drift release to release, the rough tenfold ratio is the part you design around. One team that runs agents in production calls cache hit rate the single most important metric for a production agent, and the reason is the trap built into it. The cache keys on a stable prefix, so a single changed token near the start of the context invalidates everything after it. A second-precision timestamp stamped into your system prompt, an in-place edit of earlier history, or a tool list that reorders between calls will silently drop your hit rate to nothing. The output looks identical. The bill is ten times larger and nobody changed a feature. The defensive habits follow directly: keep the prefix byte-stable, append to context rather than editing it in place, and order anything structured deterministically.

Cache hit versus missthe ~10x gap between cached and uncached input, and the small edits that silently invalidate the cache and re-bill everything downstream at full rate.

Which model does the work is the next lever, and it pays only because the gap between tiers is large enough to matter. The landscape has held a stable three-tier shape for years even as the model names churn every quarter: a frontier reasoning tier, a workhorse tier, and a cheap-fast tier, with roughly a fivefold cost step between them. The operator move is to route by role rather than picking one model for everything. Orchestration, synthesis, and any judgment that pulls several sources together go to the strong tier, where the reasoning has to be right. Mechanical single-source work — extraction, reformatting, classification — goes down a tier or two, where it is good enough and a fraction of the price. The asymmetry that makes this worth doing also sets its limit. A wrong answer from a cheap model that you then act on costs far more than the token delta you saved, so the rule is to spend on the decisions and economize on the grunt work, never the reverse.

Cost also hides in everything that loads before the user's input even arrives. Every connected tool taxes every call: with dozens of MCP servers wired in, the tool definitions alone can consume tens of thousands of tokens up front, on every single request, before the user has typed a word. Progressive disclosure — loading tool definitions on demand and returning only the slice of a result that matters — has cut that standing cost by over 90% in some measured cases. Agents compound the same dynamic. Anthropic's multi-agent research system used on the order of fifteen times the tokens of a single chat, and in their evals token usage alone explained about 80% of the performance variance, which means fan-out is an economic decision and not a default reach. The lever here is to trim the standing load before you trim the prompt, because the standing load is paid on every call and the prompt is paid once.

Where it breaks

The signature failure is optimizing the wrong axis. Someone shaves adjectives off a prompt to save tokens while their cache hit rate sits at zero, because a timestamp in the system prompt invalidates it on every call — the real bill is the cache miss, and the prompt edits do nothing. Cheap-everywhere breaks on the hard minority, where the cheap model produces a wrong-but-plausible answer that is worse than a loud failure, because you cannot tell it apart from a right one. Flagship-everywhere is fine at chat scale and ruinous at agent scale, where the fifteen-times multiplier turns a tolerable per-call price into a runaway one. And premature routing complexity — building a machine-learned router before you have proven a single model is too expensive — usually loses to a plain try-the-cheap-model-then-escalate cascade. In each case the spend was real, but the axis under the knife was not the one carrying the cost.

Do it now

Before optimizing spend, find the axis that actually drives it. Run this triage in order and fix the biggest item, not the most visible one:

Paste this

Cost-axis triage:
1. CACHE — Is your prompt prefix byte-stable across calls?
   No timestamps, no in-place edits of earlier turns, deterministic key/tool order.
   A cache miss costs ~10x. This is the cheapest win there is — fix it first.
2. TIER — Are you using a frontier model for mechanical single-source work?
   Route grunt work down a tier (~5x cheaper); keep judgment and synthesis on the strong model.
3. CONTEXT — How many tokens load before the user's input?
   Count tool definitions + standing history + attached files. Trim the standing load before the prompt.
4. FAN-OUT — Does this task truly need parallel agents (~15x tokens),
   or is one coherent thread both cheaper and more reliable?

Rule: optimize the biggest axis. Never shave prompt words while the cache is cold.

The triage ships as an artifact you run against any workload, not as advice to keep in mind. Its whole job is to stop the reflex — shortening prompts — and point you at the axis that is actually moving the number.

Worked example

Illustrative

Illustrative. A constructed case to show the failure and the fix, not a real run.

An agent that summarizes incoming tickets "got expensive" — the daily token bill roughly tripled with no change in volume. The instinct is to trim the prompt, so the team cuts the system instructions down and saves a few hundred tokens per call, and the bill barely moves. Running the triage, the cache line fails immediately: the system prompt opens with a freshly generated timestamp on every call. That one token near the front invalidates the cached prefix every time, so 100% of the input — the long stable instructions included — is billed at the uncached rate.

Where the spend actually wasthe prompt-trim fix touching a tiny slice of cost while the real driver, a per-call timestamp forcing a full cache miss, sits untouched.

The fix is to move the timestamp to the end of the context, after the stable prefix, or to drop it where it was not load-bearing. The cache prefix goes stable again, hit rate recovers, and the bill falls back by close to the full ten times — with byte-identical output, because nothing about what the model produced ever changed. The cost lived on the cache axis the whole time. Every minute spent on the prompt axis was spent on the wrong constraint.