laranevans.com
Topics / AI / Context Engineering

Context engineering is the practice of curating the information a language model sees at inference time. The work covers system prompts, retrieved documents, prior conversation turns, tool descriptions, tool outputs, and any working memory the agent maintains across turns. Anthropic's engineering team defines it as "the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference, including all the other information that may land there outside of the prompts".

The term emerged from production practice. As LLM applications grew from single-turn chat into multi-turn agents with tools, memory, and long-horizon tasks, the question shifted. Prompt engineering asks how to phrase the instruction. Context engineering asks the broader question of what should appear in the model's view at all, across a finite token window and workloads spanning many turns.

One constraint generates the whole field

Every choice in context engineering descends from a single constraint. The context window is finite, and a model does not read it uniformly. Two forces follow from that constraint, and together they generate the rest of the field.

The five components below are what competes for the budget. The four patterns are the moves that keep the budget under control as a task runs long.

The reference spine

Two tables hold the facts the rest of the page rests on. The first names what occupies the window. The second names the moves that manage it.

What lives in context

A typical agent's context at any given turn is built from five components.

Component What it holds Atomic page
System prompt Persistent role, format, and behavioral constraints System Prompts
Tool definitions JSON schemas describing each tool the model has access to Tool Calling
Conversation history Prior user and assistant turns, including past tool calls and their results n/a
Retrieved content Documents pulled in by retrieval, search results, file contents the agent has read Retrieval-Augmented Generation
Working memory Scratchpads, notes, or planning artifacts the agent maintains across turns n/a

Patterns that manage the window

Four patterns recur across production agentic systems, named in the Anthropic post linked above. Each one trades a cost for a smaller or fresher window.

Pattern What it does Best suited to
Compaction Summarize prior history near the window limit. Keep decisions and open work, discard redundant tool output Back-and-forth conversational workloads
Just-in-time retrieval Expose retrieval tools the agent calls on demand instead of loading everything up front Large or open-ended source material
Structured note-taking Hold state in files or external storage, read back into context when needed Iterative work with clear milestones
Sub-agent architectures Spawn sub-agents with clean windows that return condensed summaries Focused subtasks and parallel exploration

The context window has structure

Modern LLMs report context windows of 200,000 to 2 million tokens. Those numbers describe a maximum, not uniform usable capacity. Position inside the window affects retrieval reliability. Information at the start and end of the window tends to be recalled more reliably than information buried in the middle, a pattern documented across model families.

Extending context windows has a cost beyond memory and inference latency. The technique that drove the move from 2K to 100K+ tokens in many model families came from work like Position Interpolation, which extended rotary position embeddings to longer sequences. Newer models lean on a mix of architectural changes and training techniques to support long contexts, and the practical capacity often runs shorter than the advertised window.

A few specifics from the study worth carrying:

  • Performance drops more steeply when the needle and the question share less lexical overlap. Semantic-only matches degrade faster with length than direct keyword matches.
  • A single topically-related distractor reduces accuracy. Multiple distractors compound the effect non-linearly.
  • Structured haystacks performed worse than shuffled ones. Coherent surrounding text seems to create false-salience patterns that mislead the model.
  • Models differ. Claude models showed the lowest hallucination rates with distractors across the study's 18-model sweep. GPT models showed the highest. Hallucination patterns also varied across models in the same family.

Compaction trims history near the limit

Compaction summarizes prior conversation history when the context approaches the window limit. Preserve architectural decisions and unresolved work. Discard redundant tool outputs. Compaction is conversational by nature, and best suited to workloads with a back-and-forth shape.

A safer first step before full compaction is to drop stale tool results. Tool outputs from earlier turns are often redundant once the model has acted on them, and clearing them is a lighter touch than re-summarizing prose.

Just-in-time retrieval defers loading until needed

Just-in-time retrieval exposes retrieval tools the agent invokes on demand instead of loading all potentially-relevant data into context up front. Pass lightweight identifiers around. Fetch full content only when the agent needs to act on it. This pattern reduces baseline context size and lets the agent's exploration drive what enters the window.

Structured note-taking holds state outside the window

Structured note-taking maintains persistent memory outside the context window. Files, scratchpads, or external storage hold state the agent reads back into context when needed. This pattern suits iterative work with clear milestones, where the cost of re-reading is lower than the cost of carrying the full history forward.

Sub-agent architectures keep the coordinator bounded

Sub-agent architectures spawn specialized sub-agents with clean context windows for focused tasks. The sub-agent returns a condensed summary, not its full transcript. The coordinating agent's context stays bounded even as the underlying work grows. This pattern composes well with parallel exploration.

Tool design shapes the budget

Tool design is part of context engineering, not separate from it. Every tool definition lives in the window. Every tool output lands back in the window. A bloated tool set or verbose tool outputs eat into the budget you have for reasoning.

Anthropic's Writing Tools for Agents post argues for consolidating tools by intent rather than wrapping every existing API endpoint. A tool that handles "schedule an event" (search availability, check conflicts, book) leaves the agent less to figure out than three separate tools for the same workflow. The Model Context Protocol formalizes how tools, resources, and prompts get exposed to clients. See Model Context Protocol.

Atomic pages

More atomic pages land as the cluster grows.