laranevans.com
Topics / AI / Prompt Engineering / Few-Shot Prompting

Few-shot prompting is the practice of including a small number of input-output examples directly in the prompt to guide the model's behavior on a new input. The model is not retrained on the examples. It sees them as part of the prompt context and uses them to infer the task. The practice is also called in-context learning, the term Brown et al. (2020) used when GPT-3 demonstrated that sufficiently large language models could perform competitively on many NLP tasks from a handful of examples alone, without any gradient updates.

One decision generates the whole space

Before choosing few-shot, you are choosing among four ways to specialize a model for a task. Two questions generate all four. How many examples does the model see, and are those examples fixed in the prompt or selected per query? Below the threshold of a model update, the answer to the first question runs from none to a handful. Cross that with whether the examples stay constant, and the space resolves to four options.

The naming convention reflects the example count. Zero-shot uses no examples, instruction only. One-shot uses one example. Few-shot uses a handful. Above that count, the practice merges with the retrieval-augmented patterns where examples are selected dynamically rather than fixed in the prompt.

Examples activate a capability rather than teaching one

The most intuitive story is that examples teach the model the task. Show it the input-output mapping, and it generalizes the pattern. That story is partly right and partly misleading.

It is right in the sense that demonstrations make the desired pattern concrete in a way that descriptions cannot. A description leaves room for interpretation. An example demonstrates what counts. Examples also set format, register, and level of detail implicitly, without requiring the prompt to spell each one out.

It is misleading in the sense that the model is not "learning" the task from the examples in the way that word usually implies. Webson and Pavlick (2022) found that models often improve about as much from prompts that are intentionally irrelevant or even pathologically misleading as they do from instructively good ones, and that performance depends more on the choice of output labels than on the example prose itself. Pre-training already established the model's capability. The examples primarily activate the right capability, in the right format, with the right granularity. That framing matters more for choosing examples well than the simple show-don't-tell story suggests.

When few-shot helps

Few-shot helps most when the task has properties that pure instruction-following handles poorly.

  • Ambiguous tasks. When the instruction admits multiple reasonable interpretations, examples disambiguate by showing which interpretation the author wants. "Summarize this article" means a one-sentence headline or a three-paragraph abstract. An example fixes the answer.
  • Format-constrained tasks. When the output must follow a specific shape (a particular JSON schema, a specific markdown structure, a fixed prose pattern), examples set the shape more reliably than descriptions of the shape. This overlaps with structured outputs but does not replace it. The strongest format guarantees still come from constrained decoding or schema validation, not from in-prompt examples alone.
  • Boundary cases the instruction cannot describe. Some distinctions are easier to demonstrate than to articulate. "Treat sarcasm as positive sentiment" is a rule. The example "The interface is so intuitive I needed a tutorial to find the save button" teaches the same rule by demonstration and covers cases the rule did not anticipate.
  • Novel or unusual tasks. When the task sits outside common training distributions, examples ground the model in what the author wants rather than the closest familiar task the model would otherwise default to.

For well-understood, unambiguous tasks with no special format requirements, zero-shot is often fine and sometimes better. Adding examples to a task the model already handles well introduces noise, biases the output toward the examples' specific phrasing, or distracts the model from instructions that were working.

Four levers control example quality

Once you commit to few-shot, the choice of examples carries the result. The same examples in a different order produce wildly different results, and the wrong selection is worse than no examples at all. Four levers control quality: how many examples (count), which ones (selection), in what sequence (order), and in what layout (format). The reference table below states what each lever moves and what to do about it. The discussion that follows works through why each one matters.

Count tops out fast. The naive intuition is that more examples are always better. The actual relationship is task-dependent. For many classification and short-form generation tasks, performance plateaus by five to ten examples and degrades beyond that. For complex reasoning tasks, more examples sometimes help by giving the model more variations of the desired reasoning pattern, but the marginal gain shrinks quickly. Start with a small count (three to five) and add only when evaluation shows it helps.

Selection should cover the space of inputs the model will see, not the space the author finds easy to imagine. Real production inputs often differ from cases an author would mentally enumerate. Sampling examples from actual usage tends to outperform writing them from scratch. When examples are written by hand, deliberately include cases that look slightly different from each other rather than rephrasings of the same case. Diversity in the examples shows the model what dimensions are allowed to vary.

Order matters more than most authors expect. Lu et al. (2022) showed that the same set of examples in a different permutation swings few-shot accuracy from near state-of-the-art to near-random, and that the effect persists across model sizes. The paper proposes an entropy-based heuristic for ordering. In practice, most teams treat order as a hyperparameter to evaluate rather than something to optimize analytically.

Format has to stay consistent. Sclar et al. (2023) found accuracy swings of up to 76 points on LLaMA-2 13B from semantically equivalent format variations: different separator characters, whitespace, label punctuation, and so on. The same examples laid out two different ways produce two different model behaviors. Pick a format and apply it uniformly across examples. Do not let the format vary between the demonstrated examples and the actual query.

How few-shot prompting fails

Order sensitivity and format sensitivity are two of the most consequential failure modes, and the four-lever discussion above covers both. Four others are worth watching for.

Distributional bias in examples. If the examples skew toward one class, one length, or one phrasing pattern, the model outputs more responses in that mold than the underlying data would justify. A sentiment-classification prompt with five positive examples and one negative tends to overpredict positive on ambiguous cases. Balance examples deliberately when the underlying distribution is meaningful.

Memorization rather than generalization. When examples are highly similar to the production input, the model produces outputs that pattern-match the example rather than reason from the input. The output looks correct because it resembles a real example. The reasoning is not happening. This is more common with smaller models and with examples that share surface features with the query.

Format drift between examples and query. If the demonstrated examples use one format and the actual query uses a different one, the model's behavior becomes unpredictable. This is a subset of the format-sensitivity problem and is one of the easiest to introduce by accident, when programmatic prompt assembly stitches examples and query from different code paths.

Static examples on a drifting input distribution. Examples chosen well for last quarter's inputs do not generalize to this quarter's. When the input distribution shifts, the examples become misleading rather than informative, pointing the model at a task that no longer matches what users are sending.

Choosing among the four options

The four options from the top of the page are the decision view of few-shot. Each one wins under different conditions. The choice is mostly empirical, and prompt evaluation is what makes that choice principled rather than guessed.

Zero-shot carries the whole task in the instruction. It is the cheapest option and often enough on its own.

Few-shot fixes a small example set in the prompt to carry what the instruction leaves underspecified. It is rarely the long-term answer for a production system. It is almost always a good first answer: fast to set up, easy to iterate, and informative about what the task needs even when the eventual implementation moves to something else.

Dynamic few-shot selects examples at runtime by similarity to the current input, so different queries see different example sets. This handles distribution shift better than static few-shot at the cost of complexity in the retrieval pipeline.

Fine-tuning folds the examples into the model's weights. It does not eliminate the role of evaluation, which still determines whether the fine-tuned model beats the prompted one.