Context Engineering — laranevans.com

Context engineering is the practice of curating the information a language model sees at inference time. The work covers system prompts, retrieved documents, prior conversation turns, tool descriptions, tool outputs, and any working memory the agent maintains across turns.

The term emerged from production practice. As LLM applications grew from single-turn chat into multi-turn agents with tools, memory, and long-horizon tasks, the question shifted. Prompt engineering asks how to phrase the instruction. Context engineering asks the broader question of what should appear in the model's view at all, across a finite token window and workloads spanning many turns. Anthropic's engineering team defines it as "the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference, including all the other information that may land there outside of the prompts".

What lives in context

A typical agent's context at any given turn includes:

The system prompt. Persistent role, format, and behavioral constraints. See System Prompts.
Tool definitions. JSON schemas describing each tool the model has access to. See Tool Calling.
The conversation history. Prior user turns and assistant turns, including past tool calls and their results.
Retrieved content. Documents pulled in by retrieval-augmented patterns, search results, file contents the agent has read.
Working memory. Scratchpads, notes, or planning artifacts the agent maintains across turns.

Every byte in that pile shares the same finite window. The choices an engineer makes about what to include, what to compact, and what to discard shape what the model produces.

The context window has structure

Modern LLMs report context windows of 200,000 to 2 million tokens. Those numbers describe a maximum, not uniform usable capacity. Position inside the window affects retrieval reliability. Information at the start and end of the window tends to be recalled more reliably than information buried in the middle, a pattern documented across model families.

Extending context windows has a cost beyond memory and inference latency. The technique that drove the move from 2K to 100K+ tokens in many model families came from work like Position Interpolation, which extended rotary position embeddings to longer sequences. Newer models lean on a mix of architectural changes and training techniques to support long contexts, and the practical capacity often runs shorter than the advertised window.

Context rot: performance degrades as context grows

Chroma published Context Rot in 2025, an empirical study of 18 LLMs across needle-in-a-haystack variants and other long-context benchmarks. Their finding: "models do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows." The degradation showed up even on tasks rated easy at short context lengths.

A few specifics worth carrying:

Performance drops more steeply when the needle and the question share less lexical overlap. Semantic-only matches degrade faster with length than direct keyword matches.
A single topically-related distractor reduces accuracy. Multiple distractors compound the effect non-linearly.
Counterintuitively, structured haystacks performed worse than shuffled ones. Coherent surrounding text seems to create false-salience patterns that mislead the model.
Models differ. Claude models showed the lowest hallucination rates with distractors across the study's 18-model sweep. GPT models showed the highest. Hallucination patterns also varied across models in the same family.

The practical takeaway: a longer window does not buy you uniform reliability across that window. Curation still matters.

Patterns that work

Four patterns recur across production agentic systems, named in the Anthropic post linked above.

Compaction

Summarize prior conversation history when approaching the window limit. Preserve architectural decisions and unresolved work. Discard redundant tool outputs. Compaction is conversational by nature, and best suited to workloads with a back-and-forth shape.

A safer first step before full compaction: drop stale tool results. Tool outputs from earlier turns are often redundant once the model has acted on them, and clearing them is a lighter touch than re-summarizing prose.

Just-in-time retrieval

Instead of loading all potentially-relevant data into context up front, expose retrieval tools the agent invokes on demand. Pass lightweight identifiers around. Fetch full content only when the agent needs to act on it. This pattern reduces baseline context size and lets the agent's exploration drive what enters the window.

Structured note-taking

Maintain persistent memory outside the context window. Files, scratchpads, or external storage hold state the agent reads back into context when needed. Useful for iterative work with clear milestones, where the cost of re-reading is lower than the cost of carrying the full history forward.

Sub-agent architectures

Spawn specialized sub-agents with clean context windows for focused tasks. The sub-agent returns a condensed summary, not its full transcript. The coordinating agent's context stays bounded even as the underlying work grows. This pattern composes well with parallel exploration.

How tools shape context

Tool design is part of context engineering, not separate from context engineering. Every tool definition lives in the window. Every tool output lands back in the window. A bloated tool set or verbose tool outputs eat into the budget you have for reasoning.

Anthropic's Writing Tools for Agents post argues for consolidating tools by intent rather than wrapping every existing API endpoint. A tool that handles "schedule an event" (search availability, check conflicts, book) leaves the agent less to figure out than three separate tools for the same workflow. The Model Context Protocol formalizes how tools, resources, and prompts get exposed to clients. See Model Context Protocol.

Atomic pages

Model Context Protocol — the open protocol for connecting LLMs to data sources, tools, and workflows.
Tool Calling — how LLMs invoke external functions and APIs.
Retrieval-Augmented Generation — the dominant pattern for grounding LLM output in current, private, or domain-specific information.

More atomic pages land as the cluster grows.