Prompt Patterns Library
After this you can pull a named prompt pattern off a shelf, paste its template into your task, and know the one condition that breaks it before it does.
A pattern is not a magic phrase. It is a shape of instruction that holds up across many inputs, and every one of them has a place where it stops working. The shelf below pairs each pattern with that place, because the boundary is the part that turns a copied trick into something you can rely on. Naming where a technique fails is the fastest tell that someone has actually shipped with it rather than read about it.
The instinct most people bring is accumulate-and-trust. They collect clever prompts, paste them verbatim, and assume more sophistication means more accuracy. Both halves of that instinct break. Sophistication is non-monotonic, so a more elaborate prompt can score lower than a plain one on the same task. And a prompt that worked in someone else's post worked because of its specificity to their input, not its wording. When you adapt a pattern, carry the specificity that made it work, not the exact sentences. The template is the skeleton; your task supplies the muscle.
The deeper point is that these patterns are the small visible part of a measurement-driven discipline. The thing that separates a prompt that ships from one that fiddles is not the pattern, it is whether you ran it against a small fixed set of inputs and watched what happened. The named anti-pattern here is treating prompts as a gallery, copy-pasting phrasings with no evals and no versioning, optimizing the smallest slice of the problem while the failures hide in the inputs you never tested. The shelf gives you the shapes. You still own the part where you check them.
A browsable set of patterns. Each entry is the pattern, the template you can paste, and the boundary that decides whether to trust it on your task. Read the boundary line first.
Chain-of-thought. Ask the model to work through its reasoning before it answers, instead of jumping to a verdict. The classic trigger is a line telling it to think first.
Work through this step by step before giving your final answer.
<the problem>
Show your reasoning, then end with: Final answer: <answer>Where it breaks: the explicit "think step by step" instruction earned its reputation on older models that did not reason by default. Newer models often reason internally whether you ask or not, so the line can add tokens and latency without moving accuracy. Treat it as something to test on your model, not a reflex to keep forever.
ReAct (reason then act). The loop pattern behind tool-using agents. The model alternates between a reasoning step and an action step, observing the result of each action before deciding the next. Introduced in the ReAct paper (arxiv 2210.03629), it is the bridge from a single prompt to an agent that does things.
You have these tools: <tool list with one-line descriptions>
For each step, output:
Thought: <what you're reasoning about>
Action: <the tool to call and its input>
Then I'll give you the Observation. Repeat until you can answer, then output:
Final answer: <answer>Where it breaks: every step depends on the one before it, so a wrong observation early poisons everything downstream. At 95% reliability per step a ten-step loop lands near 60% end-to-end. The pattern is only as good as your stop condition and your willingness to check the steps, not just the answer.
Few-shot. Show two to five worked examples of the input-to-output mapping you want, then give the real input. Examples beat explanations when you want a specific shape, tone, or edge-case handling. One good demonstration outperforms a paragraph describing the rule.
Here are examples of the task:
Input: <example input 1>
Output: <ideal output 1>
Input: <example input 2>
Output: <ideal output 2>
Now do the same for:
Input: <your real input>Where it breaks: this is the pattern with the sharpest cliff, so it gets the worked example below. More examples is not strictly better. A routing task on Gemini 3 Flash scored 93% at zero-shot, climbed as examples were added, then collapsed to 30% at eight examples; Qwen 3.5 fell from 56% to 0% on a code-fixing task once more examples piled on.
Order matters too. Changing only the order of the same examples swung sentiment accuracy from 54% to 93% in the Calibrate-Before-Use study (Zhao et al., arxiv 2102.09690), because the model over-repeats whichever label sits last. And position swings it again. Relocating an identical demo block from the start to the end of the prompt moved accuracy by 20 to 50 points with no content change (DPP, arxiv 2507.22887), with demos placed before the query reliably more stable. So few-shot is a precision instrument, not a volume knob. Use a few balanced, boundary-spanning examples, place them before the task, and cap the count.
Self-consistency. Run the same reasoning prompt several times at a nonzero temperature, then take the majority answer instead of trusting a single roll. It buys accuracy on problems with one correct answer by averaging out the unlucky paths.
[Run this chain-of-thought prompt N times, then keep the most common final answer.]
Work through this step by step, then end with: Final answer: <answer>
<the problem>Where it breaks: you pay N times the tokens for one answer. It is worth it only when correctness justifies the cost (a number that feeds a decision) and pointless when any reasonable answer will do (a brainstorm). It also assumes there is a single answer to vote on, so it does nothing for open-ended generation.
Structured output over JSON-begging. When you need machine-readable output, stop writing "return only valid JSON, no markdown" in the prompt and use the model's structured-output or tool-calling mode instead. One production team cut JSON parse errors from 8% to 0.3% by switching from prose instructions to tool-calling. Begging in prose tries to enforce something the model does not guarantee.
[Use the API's structured-output / tool-calling feature, not a prose instruction.]
Define the schema in the tool definition:
{ "name": "...", "fields": { "summary": "string", "confidence": "number" } }Where it breaks: forcing a single pass to both reason and emit strict JSON can degrade the reasoning, because the model spends attention on shape instead of substance. When the answer needs real thinking, let it reason in free text first, then extract the structured fields in a second step. Shape and reasoning compete for the same pass; split them when both matter.
Positive over negative specification. Tell the model the target to hit, not the thing to avoid. "Respond in at most three sentences" beats "don't be verbose." "Only discuss our product" beats "don't mention competitors." Models hit a named target more reliably than they avoid a named hazard.
Respond in at most 3 sentences. (not: don't be verbose)
Only reference the attached doc. (not: don't make things up)
Output a single number. (not: don't add explanation)Where it breaks: it depends on you actually having a target to name. When you genuinely only know what you don't want, you have a specification gap to close first, and a negative rule is a sign you have not decided the positive shape yet.
One job per prompt. Ask for one outcome per call. A prompt told to summarize, translate, score, and reformat in one shot produces mediocre work on all four, because every added instruction competes for attention. The IFScale benchmark found even the best frontier models top out near 68% instruction-following at maximum instruction density, with a bias toward the earlier instructions. Split the jobs.
[One prompt = one outcome.]
Job: <the single thing you want>
Input: <the input>
Output format: <the one shape you need back>Where it breaks: splitting into a chain trades one risk for another. Each handoff is a place a wrong output can poison the next step, so the more you decompose, the more you need a checkpoint between stages. One job per prompt is right; naive chaining without a verification gate between links is how the error you split out comes back compounded.
Role / persona. Open with "You are a <role>" to set the model's register, vocabulary, and the slice of knowledge it draws from. Useful for tone and framing.
You are a <role> writing for <audience>.
Your job: <task>. Keep the register <formal / plain / technical>.Where it breaks: a role sets voice, not correctness. It does not make the model meaningfully better at hard reasoning, and an overly creative persona can make output less stable. Reach for it to shape how an answer reads, not to fix whether an answer is right.
The copy-pasteable artifacts here are the templates themselves. The thing that makes them operator-grade rather than a gallery is the boundary line on each. Before you trust any pattern on a real task, run it against three to five fixed inputs, change one thing, and compare. The pattern gives you the shape; that small loop is what tells you it works on your inputs and not just in the example.
Worked example
IllustrativeIllustrative. A constructed task to show the few-shot boundary respected, not a real run.
The task is classifying incoming support emails as Billing, Bug, or Feature. The accumulate-and-trust move is to paste eight examples and assume more is safer:
Prompt: Here are 8 examples of emails labeled Billing, Bug, or Feature [eight examples, six of them Feature because those were easiest to find]. Now classify: "The export button spins forever and never downloads."
Output: Feature
It calls a plain bug a Feature. Two boundaries fired at once. The example set was unbalanced (six of eight were Feature), so the model leaned toward the majority label, and at eight examples the task was past the point where more demonstrations help. The fix respects both cliffs: fewer examples, balanced across the labels, and one of them deliberately on the hard boundary:
Prompt: Examples: Input: "Charged twice this month." Output: Billing Input: "App crashes when I hit save." Output: Bug Input: "Wish there was a dark mode." Output: Feature Input: "The export spins and never finishes." Output: Bug Now classify: "The export button spins forever and never downloads."
Output: Bug
Same model, same email. The second works because three balanced examples plus one that spans the bug-versus-feature boundary teach the decision better than eight lopsided ones. If you had only eyeballed the first output on one lucky email it might have looked fine; running both versions against a fixed set of twenty labeled emails is what surfaces the saturation-and-imbalance failure before it ships.