Prompt Engineering
Prompt engineering is the discipline of shaping the text input to a language model so its output is more useful, more reliable, and better aligned with the task. The practice treats the prompt as a working interface to iterate on, measure against, and version, not as a one-shot guess.
What it spans
Prompt engineering shows up at several layers of an LLM-using system:
- System prompts — the persistent layer that sets role, format, and constraints across a conversation or workload.
- User prompts — task framing, anchors, examples, and the specific request itself.
- Techniques — patterns like chain of thought, few-shot examples, structured outputs, tool-use specifications, and the agentic patterns that compose them.
- Composition — when prompts are reused, layered, or programmatically assembled (templates, prompt programs, prompt-as-code). At this layer prompts stop being "a string of text" and become an artifact with versioning, ownership, and review.
What makes a prompt work
A prompt is doing its job when it produces the desired output reliably across the inputs the system will see in production. What "good" looks like depends on the task: instructions that work well for code review can underperform for summarization.
Four factors influence what a prompt can do:
- Task instructions — what the model is being asked to do, expressed clearly and unambiguously.
- In-context examples — demonstrations of the desired input-output mapping rather than descriptions of it. See few-shot prompting.
- Format constraints — the shape the output must take, often via structured outputs or schema validation. LLMs can output structured JSON, CSV, code, or prose all equally well. They need to be told what format is desired.
- Capability invocation — explicit use of what the surrounding system can do: tool calls, sub-agents, skills, retrieval. A prompt written without regard for the host environment (Claude, Cursor, Codex, etc.) gives up capability that environment already offers.
A prompt that isn't working usually fails due to one of the following: ambiguous instructions, missing or misleading examples, no enforced output shape, or unused capabilities of the surrounding system.
Underneath the four factors is a fifth skill that doesn't appear in the prompt text: model awareness. The same instruction can land differently across models because models differ in what they do without being asked, where they focus their attention, and which kinds of inputs throw them off. Prompts written for a specific model's strengths outperform prompts written in general terms. Model awareness doesn't make outcomes deterministic, but it does shape which levers a prompt author reaches for and how hard to lean on each.
How prompts fail in production
Production prompts fail in patterned ways. The five patterns named below aren't a canonical list, but they show up often enough to organize around.
The brittleness pattern: small phrasing changes have big effects. Wording, ordering, whitespace, and example selection can produce disproportionately large changes in output. Lu et al. (2022) showed that example order alone can swing few-shot accuracy from near state-of-the-art to near-random across model sizes, and Sclar et al. (2023) found accuracy variance of up to 76 points across semantically equivalent format variations on LLaMA-2 13B. A prompt that performs well on a small set of hand-picked inputs can underperform on messier, real-world inputs real users send. This can be corrected by evaluating against samples drawn from real usage rather than a small set of familiar examples hand-picked by the author.
That phrasing sensitivity compounds when the model changes underneath the prompt. The cross-model transfer pattern: prompts don't transfer cleanly across models. A prompt tuned for one model rarely works the same on another. The order-sensitivity work cited above also reports that a good prompt configuration for one model often fails to achieve the same performance on others. Migrating between providers (or even between minor versions from the same provider) can result in meaningful differences in behavior. Prompts that rely on a model's specific capabilities (tool use, sub-agents, skills, extended thinking, etc.) won't transfer at all to models that don't support those capabilities.
Even on a stable model, more isn't better. The over-prompting pattern: simple, focused prompts tend to outperform complex ones. Beyond a threshold, adding more instructions, more examples, or more guardrails can make outputs worse.
This tends to come from three factors:
- The model spends attention parsing rules instead of doing the task
- Some rules contradict each other
- Some rules suppress behaviors that were already correct.
Editing a prompt to remove unnecessary or conflicting parts can often be more effective than adding additional clarifications. Removing a rule reveals whether it was doing work, and rules that "fix" a rare edge case often suppress behaviors that were correct on the common case.
The other source of variability is the input, not the prompt. The distribution-shift pattern: production inputs differ from evaluation inputs. Production traffic can include weirder phrasing, longer text, malformed structure, and unexpected requests that no curated evaluation set captures. The reliable way to correct for this is to regularly sample real production inputs back into the evaluation set and re-test prompt changes to see how they hold up.
The final failure pattern comes down to trust. The prompt-injection pattern: untrusted content can carry instructions. Anything in the prompt (or the context in general) that isn't fully controlled, whether retrieved documents, tool outputs, or user-supplied content, can include adversarial instructions aimed at the model. The type of attack was named "prompt injection" by Simon Willison in September 2022. It was formalized for retrieval and tool-integrated systems by Greshake et al. (2023), and it now sits at the top of the OWASP Top 10 for LLM Applications as LLM01.
Each of these failures has the same fundamental antidote: a way to determine whether a specific change to a prompt improved results. Without an evaluation method, prompt engineering reduces to taste, and taste doesn't survive contact with real production traffic. See prompt evaluation for more.
The five failure patterns at a glance
| Pattern | What goes wrong | The correction |
|---|---|---|
| Brittleness | Wording, ordering, whitespace, and example selection swing output disproportionately | Evaluate against samples drawn from real usage, not hand-picked examples |
| Cross-model transfer | A prompt tuned for one model rarely behaves the same on another | Re-evaluate on each model and account for its capabilities and tendencies |
| Over-prompting | Past a threshold, more instructions, examples, or guardrails make output worse | Remove unnecessary or conflicting parts to see what was doing work |
| Distribution shift | Production inputs are weirder, longer, and more malformed than the evaluation set | Sample real production inputs back into the evaluation set and re-test |
| Prompt injection | Untrusted content carries adversarial instructions aimed at the model | Treat untrusted content as data, never as instructions |
Atomic pages
- Chain of Thought — telling the model to "think step by step": when it helps, when it doesn't, what's replaced it
- Zero-Shot CoT — the "Let's think step by step" cue alone, without exemplars
- Self-Consistency — sampling multiple chains and voting on the answer
- Least-to-Most Prompting — decomposing a hard problem into a sequence of easier sub-problems
- Tree of Thoughts — search over candidate intermediate states with self-evaluation and backtracking
- Automatic Prompt Optimization — replacing manual iteration with measured search over prompts
- Few-Shot Prompting — providing examples in-context: selection, ordering, and the limits of in-context learning
- System Prompts — the persistent role-and-constraint layer and how it composes with the user turn
- Structured Outputs — constraining responses to a schema: JSON mode, tool calls, grammar decoding
- Prompt Evaluation — measuring whether a prompt change is real or noise
- Prompt Versioning — treating prompts as code: version control, code review, rollback
- Prompt Injection — adversarial inputs and the defenses that hold up