laranevans.com
Topics / Context Engineering / Retrieval-Augmented Generation

Retrieval-augmented generation combines a parametric language model with a non-parametric document store. At inference, a retriever fetches relevant passages from the store, the generator conditions on those passages alongside the user query, and the output is grounded in retrieved evidence rather than only in the model's training weights. The technique was introduced by Lewis et al. (2020) under the name Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks and has since become the default pattern for any LLM application that needs current, private, domain-specific, or citable information.

The original RAG paper's contribution was treating retrieval as a differentiable component of the generation pipeline, not a separate preprocessing step. Production systems have since moved away from joint training while keeping the broader architecture: retrieve, condition, generate.

The original architecture

Lewis et al. paired two components and trained them end-to-end.

Dense Passage Retrieval (DPR) provided the retriever: a dual encoder where one BERT encodes the question into a dense vector and another BERT encodes each passage. The corpus (Wikipedia, in the original work) was pre-encoded and indexed with FAISS. At query time, dot-product similarity between the question and passage vectors selected the top-K passages.

BART provided the generator: a sequence-to-sequence model that conditioned on the retrieved passages alongside the question and produced the answer autoregressively.

The two were combined probabilistically, marginalizing the answer distribution over the retrieved documents. The paper proposed two ways of doing this marginalization:

  • RAG-Sequence. A single document is sampled, and the entire answer is generated conditioned on that one document. The probabilities of full sequences are mixed across documents at the end.
  • RAG-Token. At every output token, the model independently marginalizes over all retrieved documents. Different parts of the answer draw from different passages.

RAG-Token was more expressive and sometimes performed better when an answer needed evidence from multiple documents. RAG-Sequence was simpler and more stable.

End-to-end training propagated gradients through the retriever encoders, letting them specialize for the downstream generation task.

What RAG outperformed

The original evaluation focused on knowledge-intensive tasks: open-domain question answering on Natural Questions, TriviaQA, WebQuestions, and CuratedTREC, plus fact verification on FEVER. RAG outperformed:

  • Parametric-only seq2seq baselines. A fine-tuned BART without retrieval, where all knowledge had to live in the parameters. The gap was largest on questions whose answers depended on rare or recent facts.
  • Traditional retrieve-and-extract pipelines. Pre-RAG systems retrieved with BM25 or DPR and then ran a separate reader. End-to-end training of retriever and generator beat these task-specific compositions.
  • Contemporary open-domain QA systems. RAG set state-of-the-art or near-SOTA on several benchmarks at publication.

The qualitative observation from the paper has held up well in practitioner work since: RAG outputs are more specific, more diverse, and more factual than parametric-only generation on knowledge-intensive tasks.

How production RAG diverged from the original

Modern RAG systems carry the broader architecture but rarely use the original components verbatim. The shifts worth carrying:

Chunking strategy

The original paper used Wikipedia split into fixed 100-word passages. Production systems use chunks tuned to the LLM context window and to the document type. As a practitioner rule of thumb, chunk lengths commonly land in the few-hundred-token range with sentence-aware splits, sliding windows, or hybrid rules that respect document structure (headings, paragraphs, list items). Chunking decisions affect retrieval recall (smaller chunks for higher precision per chunk) and answer coherence (larger chunks for context).

Embeddings replace DPR-style retrievers

The original DPR was trained specifically for retrieval. Production systems typically use general-purpose embedding models (Cohere, OpenAI, sentence-transformers, or open-weights embedding LLMs) without task-specific fine-tuning. Multi-vector and late-interaction approaches (ColBERT and its descendants) appear when recall on long documents matters more than per-query latency.

Vector databases replace in-memory FAISS

The original was a research setup. Production replaces it with Pinecone, Weaviate, Qdrant, Milvus, pgvector, or similar. These add metadata filtering (by time, source, user, access control), sharding, replication, and hybrid retrieval (dense and BM25 together with score fusion).

Reranking sits between retrieval and generation

The original architecture had no separate rerank step. Modern systems use a two-stage retrieve-then-rerank: a fast first-pass retrieval (dense or hybrid) returns an over-fetched candidate set, then a slower cross-encoder reranker (monoT5, BGE reranker, or an LLM-based scorer) re-orders the top results by relevance. Production over-fetch counts vary by system and corpus size, typically running an order of magnitude larger than the final top-K that enters the prompt. The pattern improves precision on the passages that actually enter the prompt.

Black-box LLMs replace the trained BART

The original trained the generator alongside the retriever. Production systems treat the generator as a black-box API call (GPT, Claude, Gemini, or a self-hosted Llama). End-to-end training is rare. The retriever and the generator are coupled only through the prompt-construction step that formats retrieved passages into the LLM's context.

Orchestration adds a layer

Modern RAG is rarely a single retrieve-then-generate call. Production systems wrap the core with query rewriting, intent classification, multi-hop retrieval, agentic decomposition, and post-generation verification. Some systems use the model to plan retrievals dynamically (see Agentic Workflows for the agent-loop pattern that drives this).

Where RAG fails

The pattern has well-documented failure modes:

  • Retrieval misses. The right passage is in the corpus but the retriever does not surface it. Causes range from poor embedding fit (the question and answer don't share vocabulary) to chunk boundaries that split the answer. Mitigations: hybrid retrieval, larger K with reranking, query expansion.
  • Context conflict. The retrieved passages disagree with each other, or with the model's prior knowledge. The model picks one source confidently without flagging the conflict. Mitigations: prompt the model to compare sources, return citations the user reviews directly.
  • Retrieved-context poisoning. An attacker injects content into the corpus that contains adversarial instructions or false facts. The retriever surfaces the poisoned passage, the generator follows the injected instructions. The PoisonedRAG work by Zou et al. demonstrates this empirically. Mitigations: treat retrieved content as data not instruction, validate sources, restrict who writes to the corpus. See Prompt Injection for the broader threat model.
  • Truncation under context limits. Retrieved passages plus the query plus the system prompt exceed the model's context window. The system silently drops passages or truncates them. Mitigations: explicit token accounting, reranker-driven selection, context-compression patterns from context engineering.
  • Lost-in-the-middle. The right passage is in context but positioned where the model attends less. Liu et al. (2023) showed that information at the start and end of a long context is recalled more reliably than information in the middle. Mitigations: rerank to put the most relevant passage near the beginning, keep the retrieved set small enough to escape the middle.
  • Stale corpus. The retriever returns a passage that was correct when indexed but is no longer accurate. Mitigations: re-indexing pipelines, source dates in metadata, freshness-aware ranking.

Evaluating RAG

Standard QA metrics (exact-match, F1) measure whether the answer matches a reference. They do not catch hallucination supported by retrieved-but-irrelevant context, or correct answers produced despite retrieval failure. The RAGAs framework by Es et al. (2023) decomposes RAG evaluation into four component-wise metrics:

  • Retrieval precision — how relevant the retrieved passages are to the gold answer.
  • Groundedness — whether the generated answer's claims are supported by the retrieved passages.
  • Answer relevance — whether the answer addresses the user's question, independent of the source.
  • Faithfulness — whether the answer introduces information not present in the retrieved context (the inverse of hallucination).

The four metrics decouple failure modes that single-number metrics conflate. A system scores high on answer relevance and low on faithfulness when its output is confident, plausible, and unsupported. A system scores high on faithfulness and low on retrieval precision when it is honest about insufficient sources but the sources never had the answer. See Prompt Evaluation for the broader practice this fits into.

When RAG is the right tool

RAG fits well when:

  • The answer depends on information that changes faster than the model is retrained (news, prices, internal docs, ticket history).
  • The information is too large to fit in a training corpus or context window (millions of documents, terabytes of logs).
  • Provenance matters — the user or compliance regime needs to know which source produced the claim.
  • The domain is narrow enough that the model's prior knowledge is unreliable but the retrievable corpus is authoritative (legal, medical, internal policies).

RAG fits less well when:

  • The task is purely reasoning over information the model already has (math, code generation, well-known facts).
  • The retrievable corpus is itself unreliable (random web pages without curation), in which case retrieval grounds the answer in noise.
  • Latency matters more than freshness, and the marginal accuracy gain from retrieval does not pay for the extra round-trip.

The pattern is not a default. It is the right answer for grounding-and-freshness problems, and the wrong answer for reasoning-and-computation problems.