The Model Landscape, Without the Leaderboards

Beginner

After this you can pick the right model for a task by matching its real requirements (context window, modality, latency, cost, reasoning depth) to a tier, instead of reaching for whatever sits on top of a benchmark leaderboard.

Understand

New model names arrive every few weeks. The leaderboards reshuffle. A launch post calls one of them the best model in the world, and three weeks later a different one wears the crown. From the outside this looks like a field you have to track full-time just to stay current.

You don't. Underneath the churn the shape is stable, and has been for a while. Models sort into three tiers by what they cost and what they can do, and the tiers stay put even as the names inside them rotate.

The frontier / reasoning tier holds the most capable, most expensive, slowest models. They do multi-step reasoning, hold long and messy context, and handle the genuinely hard, ambiguous problems where the path isn't obvious. Novel code architecture, deep analysis, judgment calls. You pay the most per token and you wait the longest for an answer.

The workhorse tier sits in the middle. Strong general capability at a fraction of frontier cost. This is where most real production work belongs: summarizing, classifying, drafting, writing standard code, pulling structured fields out of messy text. Balanced on cost and speed, and the right default once you actually know what the task is.

The cheap / fast tier is the small, quick, low-cost models. They shine on high-volume, simple, latency-sensitive jobs. Routing a request to the right handler, a yes/no classification, autocomplete, a cheap pre-filter that runs before a more expensive step. Ask them to reason across many steps or recall broad world knowledge and they fall over.

The three tiers stay put while names rotatethe durable tier structure, with example model names slotted in as disposable placeholders that you should expect to change.

Treat the names as disposable and the tiers as durable. When a new model drops, the only real question is which tier it lands in and whether it changes your answer for a specific task. Most of the time it doesn't.

There's a second confusion worth clearing early. "The model" and "the product you reach it through" are not the same thing. The model is the trained weights. The chat app, the API, the IDE assistant, the agent harness each wrap those weights in their own system prompt, their own tools, their own hidden context handling. A prompt that behaves one way in a chat app can behave differently through the raw API, because the app quietly injects instructions you never see. When someone says "the model got worse this week," it's often the product that changed its system prompt, not the weights. If you want to judge a model honestly, reach it through the API, where you control the surface. The chat app is not a clean room.

The selection move follows from all of this. Define what the task actually needs, then reach for the cheapest tier that clears the bar, not the most capable one you can afford.

Selecting a tier: define the bar, then optimize downthe decision path from task requirements to a chosen tier, including the veto checks that can force you back up.

The instinct most newcomers carry is to start at the top and only come down if cost hurts. Flip it. Start where you think the bar lives, prove the tier below it fails, and only then settle. Cost and latency both improve as you move down, so every tier you can drop lowers the bill and speeds up the response at the same time.

The selection axes give you the language for "what does the task need." Context window is how much text the model can hold at once. Long-document work needs a big one, short calls don't. Modality is whether it handles text only, or vision, or audio, and paying for vision on a text task is pure waste. Latency matters for anything a human waits on, less for an overnight batch job. Cost per token scales with volume, so at high volume your tier choice is most of your bill. Reasoning depth is the real divider. Does the task need genuine multi-step thinking, or is it pattern-matching dressed up to look hard?

Where it breaks

Over-buying capability. This is the most common and most expensive mistake, so name it first. Teams default to the frontier tier because it's the safe-feeling choice, then pay 10-30x more per call for reasoning the task never uses. A classification job that a cheap model nails for a fraction of a cent gets run on a frontier model at frontier prices, thousands of times a day. The bill is invisible in testing and brutal at volume. The fix is the procedure above: prove the cheap tier fails before you pay for the expensive one.

Trusting the leaderboard. A benchmark number is not your task's number, for two structural reasons. The first is contamination. Public test sets leak into the web-scraped data models train on, so a model can score high by having effectively seen the answers rather than by reasoning, and you usually can't tell which. The second is overfitting, also called Goodhart's law. Once a benchmark becomes the number labs get judged on, they optimize for it, and it stops measuring the underlying capability it once stood in for. When a measure becomes a target, it stops being a good measure. The most-cited public leaderboards are also the most gamed. The model that tops them can still lose on your actual codebase, your prompt style, your constraints.

Confusing the product with the model. Benchmarking through a chat app and concluding "this model can't do X" when the app's hidden system prompt is the thing getting in the way. Test through the API where you control the surface, or you're measuring the wrapper, not the weights.

Letting "bigger window is better" pick for you. A larger context window costs more and degrades on long inputs (attention spreads thin and facts in the middle get missed). Match the window to the task. A big window you don't need is a slower, pricier call that can also answer worse.

Treating the choice as permanent. You picked a tier six months ago and never re-checked. Names rotate, prices drop, a new workhorse model clears a bar that used to need the frontier. Your eval is the constant. Re-run it when models ship. Don't assume the new shiny one is better for you, and don't assume your old pick is still right.

Each failure mode has the same shape underneath. The generic advice optimizes for sounding safe (use the best model, track every release, bigger is better) and the operator move optimizes for the task in front of you. That same skeptical question, what specific pain does this remove and do I actually have it, applies just as well to the tooling that surrounds models, which the Toolkit takes up on its own.

Generic advice vs. operator practicethe same five decisions, with the content-farm default on the left and the operator move on the right.

Do it now

Pick one task you currently run on an AI model, or one you're about to build. Run it through this selection sheet before you commit to a model. It takes ten minutes and routinely saves a tier.

Paste thistext

MODEL TIER SELECTION SHEET

TASK: ________________________________________________

1. DEFINE "GOOD ENOUGH"
   - What does a passing answer look like? ________________
   - What is the ONE hard constraint?
     [ ] accuracy   [ ] latency   [ ] cost
     [ ] context length   [ ] modality (vision/audio)

2. BUILD A TINY EVAL (do this once, reuse forever)
   - Collect 10-20 REAL examples from this task
   - Write the expected/acceptable output for each
   - Keep it private. Never publish it.

3. VETO CHECKS (these can force a tier, ignore cost)
   [ ] Needs vision / audio?  -> tier must support modality
   [ ] Input longer than ~50 pages? -> needs large context window
   [ ] Human waiting on it live? -> latency matters, avoid slowest tier

4. RUN THE LADDER (cheapest first that survives the vetoes)
   [ ] Run CHEAP/FAST tier against your eval.  Pass? -> stop here.
   [ ] If it fails, run WORKHORSE tier.        Pass? -> stop here.
   [ ] If it fails, run FRONTIER tier.         Pass? -> use it.

5. RECORD THE DECISION
   - Chosen tier: ________   Why the cheaper one failed: ________
   - Re-run this eval on the next model release. Your eval is
     the constant; the model name is disposable.

The discipline that makes this work is step 2. Ten to twenty real examples from your own task beat any public leaderboard for your decision, because they measure the thing you actually ship against. Once the eval exists, every future model release is a fifteen-minute re-test, not a research project.

Worked example

Illustrative

Illustrative scenario — constructed to show the pattern. Numbers are made up to be realistic, not measured from a specific product.

A small team is building a support inbox triage feature. Every incoming email needs one of four labels (billing, bug, feature-request, or other) so it routes to the right queue. They expect roughly 8,000 emails a day.

Their first instinct is the one almost everyone has: use the best model available, because correctness matters and support is customer-facing. They wire it to the frontier tier and it works beautifully in testing. Then someone runs the selection sheet before launch.

Step 1, the requirement: a four-way label on short text. Step 2, they pull 20 real emails and write the correct label for each. Step 3, the veto checks. Text only, short inputs, and the labeling happens in the background, so no live latency pressure. Nothing forces a high tier. Step 4, they run the ladder from the bottom.

Paste thistext

EVAL: 20 real support emails, 4 labels each
(illustrative numbers)

CHEAP/FAST tier
  Accuracy on eval ........ 19/20 correct
  Cost per 1k calls ....... ~$0.05
  -> PASSES the bar. Stop the ladder here.

WORKHORSE tier (run for comparison only)
  Accuracy on eval ........ 20/20 correct
  Cost per 1k calls ....... ~$0.80

FRONTIER tier (their original pick)
  Accuracy on eval ........ 20/20 correct
  Cost per 1k calls ....... ~$8.00

DAILY VOLUME: ~8,000 emails
  Frontier:  ~$64/day   -> ~$1,900/month
  Cheap/fast: ~$0.40/day -> ~$12/month

The cheap tier missed one of twenty. The frontier tier their instinct chose was perfect on the eval and roughly 160 times the price at this volume. The honest question is not "which scores highest" but "what does one extra correct label out of twenty cost, and is it worth roughly $1,900 a month?" For most triage flows the answer is no, because one misroute a day is cheaper to fix downstream than to prevent with a frontier model. If those numbers were reversed and the misroutes were costly, the workhorse tier at one-tenth the frontier price would still beat the original pick.

Nothing about the task changed between the two choices. The operator simply defined the bar first and bought down to it, rather than reaching for the top of a leaderboard and paying for capability the task never touched.