Prompting Fundamentals That Actually Move Output

Beginner

After this you can take a prompt that is giving you mediocre results and improve it on the next try by being more specific and cutting what does not earn its place, rather than piling on more instructions and examples and hoping.

Understand

Most beginners fix a bad prompt by adding to it: a role, three more examples, a paragraph of rules, a magic phrase someone swore by. The prompt swells, the output gets worse rather than better, and the reflex is to add even more. This is the single most common self-inflicted wound in prompting, and it runs backwards from what a practiced operator does.

Two forces drive results, and neither one is volume. The first is specificity. A model cannot read your mind. When you ask for "a short professional email" it has to guess what short means, what professional means to you, and what the email is supposed to accomplish, and it fills every gap with the statistical average of everything it has seen. The fix is not more words. It is more specific words. "Three sentences, plain, no exclamation marks, asking the client to confirm the Friday deadline" removes the guesses. Notice that is shorter than a vague version padded with adjectives, not longer.

The second force is subtraction, and it is the part nobody teaches a beginner. Everything you put in a prompt competes for the model's attention. Every instruction you add can quietly conflict with one you wrote earlier and forgot. Past a point, more material does not sharpen the output. It dilutes it. This is not a motivational claim, it is measured. On a benchmark that packs hundreds of instructions into a single prompt, even the strongest current models top out around 68 percent instruction-following at high density, and they bias toward the instructions near the start of the prompt while the later ones quietly fall away. Sophistication does not save you either. More elaborate prompts sometimes score lower than plain ones on the same task. The prompt that finally works is often shorter than the draft you started with.

Examples are the sharpest version of this trap, because they feel so obviously helpful. Showing the model one good example of the output you want usually beats a paragraph describing it. So beginners reason that if one example helps, eight must help more. They do not. Example count has a saturation curve, and past the peak it does not plateau, it collapses.

The few-shot saturation curvethe characteristic climb-then-collapse shape. Illustrative — the two measured points are 0-shot at 93% and 8-shot at 30% (Gemini 3 Flash, routing task); the curve between them shows the shape, not per-shot measurements.

That curve is real. A team testing a routing task scored 93 percent with zero examples, climbed a little as they added a few, then watched accuracy collapse to 30 percent by the eighth example. On a separate code-fixing task another model fell from 56 percent all the way to 0 percent as examples piled up. Order matters as much as count. Rearranging the same set of examples, changing nothing but the sequence, has swung accuracy on a classification task from near-chance to near-best, because the model over-weights whatever label it saw last and whatever label it saw most. Examples work like a precision instrument rather than a volume knob: a few that span the boundary you care about, including the hard or ambiguous case, beat a long list of easy ones.

So the operator habit is the inverse of the beginner reflex. When a prompt underperforms, get more specific about what you actually want, and strip out everything that is not pulling weight, before you add a single new thing.

Where it breaks

Subtraction is not a universal law, and treating it as one fails just as hard as over-prompting. If the model is wrong because it genuinely does not know something, no amount of trimming or rephrasing fixes a knowledge gap. That is a job for giving it the actual information, which is a different lesson. Cut too far and you hit the opposite failure: starve the prompt of a constraint it really needed, and the model will not stop to ask. It will guess, confidently, and the guess looks exactly as polished as a correct answer, so you will not catch it by glancing at the output. There is also no gauge for any of this. Nothing tells you the moment a prompt tips from lean to starved or from rich to bloated. You read it from the results, which means you only know after you run it. And one more honest caveat. A model provider can quietly update the model underneath you, so a prompt you tuned last month can drift even though you never touched a word of it.

Do it now

Change one thing at a time, and judge it on more than a single lucky output. Beginners tweak the wording, look at one result, decide it is better, and ship a prompt that is overfit to that one input. Operators keep three or four test inputs around, change exactly one variable, run all of them, and compare. You do not need tooling for this. Paste this control line at the top of your own prompt and fill the brackets. It forces the specificity and signals the subtraction in one move.

Paste this

You are answering for: <who reads this and what they need to do with it>
Produce exactly: <the output shape — e.g. "3 bullet points, under 15 words each">
Must include: <the 1-3 things that genuinely bind — deadline, format, constraint>
Do NOT: <name the failure you keep seeing — e.g. "no preamble, no 'I hope this helps'">
If a detail you need is missing above, ask for it. Do not fill the gap with a guess.

Two lines do the heavy lifting. "Must include" caps you at the few constraints that actually bind, which stops the pile-on, because you cannot list fifteen things on a line meant for three. The last line covers the starvation risk: a prompt you trimmed too hard surfaces as a question instead of a confident wrong answer.

When a prompt is underperforming and you are tempted to fix everything at once, do not. Run the change through this:

One change, then measurethe loop that separates engineering a prompt from fiddling with it — one hypothesis, one variable, judged on a fixed set of inputs.

The rule the diagram enforces: never change the wording, the examples, and the model in the same pass. If you change three things and it improves, you have no idea which one helped, so you cannot do it again. Change one, measure on the set, keep or revert.

Worked example

Illustrative

Illustrative. A constructed before-and-after to show the habit, not a real session.

Someone wants a model to label incoming support messages as bug, billing, or feature-request. The beginner instinct is to teach it hard, so they write a rule for every situation and stack up eight examples:

Prompt: You are an expert support triage AI with years of experience. Carefully and thoughtfully classify each message. If it mentions money, it might be billing, unless it's about a broken payment, which is a bug, unless they're asking for a refund feature... [six more rules] Here are examples: [eight labeled messages, six of them billing]. Now classify this message.

Output: billing (the message was a clear bug report)

It gets it wrong, and it will keep getting billing wrong, because six of the eight examples were billing and the model leans on the label it saw most. The rules contradict each other, and the model has no way to tell which one wins. Now the subtraction-and-specificity version. Cut the rules to the one distinction that actually trips it up, and give three balanced examples that span the hard boundary instead of eight lopsided ones:

Prompt: Classify the message as exactly one of: bug, billing, feature-request. A broken payment is a bug (something is malfunctioning), not billing. Examples: "Charged twice this month" → billing. "Checkout button does nothing when I pay" → bug. "Wish I could export to CSV" → feature-request. Message: "Card was declined but the page just spun forever and then charged me anyway." → ?

Output: bug

Shorter prompt, fewer examples, better answer. The win came from cutting the contradictory rules down to the single boundary that mattered and balancing the three examples across the categories, not from teaching the model harder. If you later find a category it still confuses, you change that one example pair and re-run your handful of test messages, not the whole prompt at once.

A related habit makes this stick. State what you want, not what you do not want. "Respond in at most three sentences" lands more reliably than "don't be verbose," and "only discuss our product" beats "don't mention competitors," because a model hits a target you name far better than it avoids one you merely forbid. For shaping the output's structure beyond this, getting clean tables or strict formats out of a model, that is its own lesson and worth treating separately.