laranevans.com
Topics / AI / Agentic Development / Agentic Systems

An agentic system is the production deployment of an agentic workflow. The workflow is a structural choice (prompt chaining, routing, an agent loop). The system is the set of operational decisions wrapped around that workflow once it runs against concurrent users, reused contexts, compounding failures, and no engineer watching.

The organizing frame for the whole page is that gap. A workflow pattern that runs in a Jupyter notebook with a single user, a fresh context, and an attentive engineer behaves differently in production. Seven operational dimensions generate the engineering work that closes the gap: observability, cost, latency, safety, reliability, memory, and the human-in-the-loop boundary. Pick the workflow pattern, then settle these seven, and the system's behavior is settled.

The seven dimensions at a glance

Each dimension is a question the workflow does not answer on its own. The middle column names what surfaces in production that the notebook hid. The right column names the lever that holds it.

Dimension What breaks in production The lever
Observability An agent misbehaves and debugging is guesswork without a trace Trace IDs linking every LLM call and tool call of one run, plus structured metadata
Cost A single looping agent runs a five-figure bill overnight Per-step model right-sizing, prompt caching, harness-enforced budget caps
Latency Loop iterations stack and a user-facing agent stalls An iteration budget that drives model choice, parallel tool calls, streaming
Safety A tool call modifies state or follows injected instructions Sandboxing, permissions, rollback, rate limits, prompt-injection defense
Reliability A model or tool returns a 5xx, or a retry duplicates a write Model and tool fallbacks, idempotency keys, loop termination
Memory The agent forgets everything, or remembers everything and prompts unpredictably Bounded session memory, opt-in persistent memory, working-memory scratchpads
Human-in-the-loop Autonomy drifts in because nobody set a boundary An explicitly named approval, override, or escalation pattern

The sections below are the discursive read on each dimension. The table is the reference spine. Read the table to scan the surface, read the sections for the engineering detail behind each lever.

Observability: trace every call or debug blind

Every LLM call, every tool call, every retry, every fallback leaves a trace. Without that trace, debugging an agent that misbehaved is guesswork. A production agent needs:

  • Request and response logging at the LLM level, with token counts.
  • Tool-call logging at the tool level, with arguments and results.
  • A trace identifier linking the LLM calls and tool calls of a single agent run.
  • Structured metadata (user, session, agent version, model version) on every call.

Tools like LangSmith, Langfuse, and Helicone cover the LLM-observability surface. Conventional APM tools (Datadog, Honeycomb, OpenTelemetry) cover the wider system. Connect both. An agent that fails because the database is slow looks identical at the LLM layer to an agent that fails because the model hallucinated. Only the system trace separates them.

Cost: token spend scales faster than call count

Token spend is the bill nobody warns you about. A single agent run with ten tool calls, each result added to context, costs more than ten unrelated LLM calls. At current frontier-model pricing, a naive deployment across a million-user-a-day product surfaces costs that scale into five-figure-daily territory quickly.

Three levers hold the cost down:

  • Right-size the model per step. A routing step rarely needs the strongest model. A code-generation step often does. Mixed-model agents (small model for classification, large model for synthesis) cut cost without proportional quality loss.
  • Cache. Anthropic's prompt caching, OpenAI's automatic prefix caching, and provider-side context caching all reduce the per-call cost of repeated system prompts and stable context.
  • Budget caps. A per-run token budget the harness enforces stops runaway loops before they become incidents. See Context Engineering for the patterns that keep individual contexts bounded.

Latency: the iteration budget drives the rest

Agent latency is the sum of LLM latency, tool latency, and the number of loop iterations. The number of loop iterations is the variable the workflow shape controls.

Parallel tool calls (the model invokes several tools in a single message) cut wall-clock time on multi-source queries. Streaming the model's output lets the user see progress on long generations. Speculative execution (start tool calls before the model finishes deciding) is an emerging pattern that trades cost for latency in specific cases.

Set a budget. A user-facing agent with a 30-second cap behaves differently from a batch agent with a 30-minute cap. The harness's iteration limit, model selection, and fallback policy all flow from that budget.

Safety: constraints the harness enforces around tool use

Safety in an agentic system is the set of constraints the harness enforces around the model's tool use:

  • Sandboxing. Code-execution tools run inside an isolated environment with no access to host systems. Network access, filesystem access, and process limits all matter.
  • Permissions. Tools that modify state (send email, create issue, charge card) require either explicit user confirmation or a verified policy that authorizes the specific action.
  • Rollback. State-modifying tool calls log enough information to reverse. An agent that booked the wrong appointment needs a cancellation pathway the harness owns, not one the model proposes.
  • Rate limits. A misbehaving agent in a loop sends 1000 tool calls per minute. Rate limits at the tool level cap blast radius.
  • Prompt-injection defense. Untrusted content returned from tools (web search results, file contents, scraped pages) is treated as data, not as instruction. See Prompt Injection.

Reliability: production agents fail in ways toy agents do not

Production agents fail in ways toy agents do not. The reliability surface:

  • Model fallbacks. If the primary model returns a 5xx, the harness retries against a secondary. The fallback model often returns a result of slightly different quality. Track the differential.
  • Tool fallbacks. A failed search tool either retries against the same provider, falls back to a secondary provider, or surfaces the error to the agent so the agent picks a different approach.
  • Idempotency. Tool calls that the harness retries (network errors, timeouts) need to be idempotent at the tool level. A "create issue" tool that runs twice creates two issues unless the implementation handles deduplication.
  • Loop termination. Iteration caps, time caps, and confidence-based termination all matter. An agent that loops on a task it cannot solve consumes budget until something forces it to stop. The harness owns "something."

Memory and state: forget too much or remember too much

An agent that forgets everything between sessions is limited. An agent that remembers everything is expensive and prompts unpredictably. The middle ground is the engineering work:

  • Session memory. The conversation history of the current session. Bounded by context-engineering patterns (compaction, just-in-time retrieval).
  • Persistent memory. Facts the agent learned across sessions (user preferences, prior decisions, project context). Stored in a database or vector store, retrieved at session start.
  • Working memory. Scratchpads or planning artifacts the agent maintains across turns inside a single session. Often a structured-note pattern on a filesystem the agent reads back.

The line between session memory and persistent memory is a design choice. Conservative defaults treat persistent memory as opt-in per fact, with the user (or the agent's principal) approving what gets stored.

Human-in-the-loop boundaries: name the pattern per system

The most consequential design decision in an agentic system is where the human enters the loop. Three patterns cover most systems:

  • Approval gate. The agent does its work, presents a plan or a diff, and waits for human approval before acting. Used for high-stakes irreversible actions (sending external email, deploying to production, large payments).
  • Override channel. The agent acts autonomously, but the human has a real-time channel to interrupt or redirect. Used for long-running tasks where the human is co-present but not approving each step.
  • Escalation path. The agent acts autonomously most of the time, but escalates to a human on detected uncertainty (low confidence, repeated failure, ambiguous input). Used for high-volume customer-facing systems.

What goes wrong

A short taxonomy of agentic-system failure modes worth carrying. Each maps back to one of the seven dimensions and the lever that holds it.

  • Context window saturation under load. A pattern that worked in testing breaks in production when conversations get longer or tool results get bigger. The fix lives in context engineering patterns.
  • Cost spirals. A single misbehaving agent in a loop runs up a five-figure bill overnight. Budget caps at the harness level are the defense.
  • Tool-poisoning. Untrusted content returned from a search or file-read tool injects instructions the agent follows. Defense: treat tool results as data, not instruction.
  • Confident wrong answers under tool failure. A silently-failed tool returns an empty result. The agent proceeds as if the result was meaningful. Defense: structured error returns the agent recognizes as failure.
  • State corruption from non-idempotent retries. The harness retries a "send email" tool after a network timeout. Two emails go out. Defense: idempotency keys at the tool boundary.
  • Drift. An agent that worked yesterday behaves differently today because the model version changed silently. Defense: pin model versions and treat upgrades as deployments.