Agentic Systems
Agentic systems are the production deployments of agentic workflows. The terminology matters because the engineering work shifts at this layer. A workflow pattern (prompt chaining, routing, an agent loop) is a structural choice. A system around that pattern is a set of decisions about observability, cost, latency, safety, reliability, memory, and the human-in-the-loop boundary.
The systems layer is where the cost of running agentic patterns in production lives. A workflow pattern that works in a Jupyter notebook with a single user, a fresh context, and an attentive engineer watching behaves differently when users are concurrent, contexts are reused, failures compound, and no engineer is watching. The dimensions below are the surface the engineering work shows up on.
Observability
Every LLM call, every tool call, every retry, every fallback leaves a trace. Without that trace, debugging an agent that misbehaved is guesswork. A production agent needs:
- Request and response logging at the LLM level, with token counts.
- Tool-call logging at the tool level, with arguments and results.
- A trace identifier linking the LLM calls and tool calls of a single agent run.
- Structured metadata (user, session, agent version, model version) on every call.
Tools like LangSmith, Langfuse, and Helicone cover the LLM-observability surface. Conventional APM tools (Datadog, Honeycomb, OpenTelemetry) cover the wider system. Connect both. An agent that fails because the database is slow looks identical at the LLM layer to an agent that fails because the model hallucinated; only the system trace separates them.
Cost
Token spend is the bill nobody warns you about. A single agent run with ten tool calls, each result added to context, costs more than ten unrelated LLM calls. At current frontier-model pricing, a naive deployment across a million-user-a-day product surfaces costs that scale into five-figure-daily territory quickly.
Practical levers:
- Right-size the model per step. A routing step rarely needs the strongest model. A code-generation step often does. Mixed-model agents (small model for classification, large model for synthesis) cut cost without proportional quality loss.
- Cache. Anthropic's prompt caching, OpenAI's automatic prefix caching, and provider-side context caching all reduce the per-call cost of repeated system prompts and stable context.
- Budget caps. A per-run token budget the harness enforces stops runaway loops before they become incidents. See Context Engineering for the patterns that keep individual contexts bounded.
Latency
Agent latency is the sum of LLM latency, tool latency, and the number of loop iterations. The number of loop iterations is the variable the workflow shape controls.
Parallel tool calls (the model invokes several tools in a single message) cut wall-clock time on multi-source queries. Streaming the model's output lets the user see progress on long generations. Speculative execution (start tool calls before the model finishes deciding) is an emerging pattern that trades cost for latency in specific cases.
Set a budget. A user-facing agent with a 30-second cap behaves differently from a batch agent with a 30-minute cap, and the harness's iteration limit, model selection, and fallback policy all flow from that budget.
Safety
Safety in an agentic system is the set of constraints the harness enforces around the model's tool use:
- Sandboxing. Code-execution tools run inside an isolated environment with no access to host systems. Network access, filesystem access, and process limits all matter.
- Permissions. Tools that modify state (send email, create issue, charge card) require either explicit user confirmation or a verified policy that authorizes the specific action.
- Rollback. State-modifying tool calls log enough information to reverse. An agent that booked the wrong appointment needs a cancellation pathway the harness owns, not one the model proposes.
- Rate limits. A misbehaving agent in a loop sends 1000 tool calls per minute. Rate limits at the tool level cap blast radius.
- Prompt-injection defense. Untrusted content returned from tools (web search results, file contents, scraped pages) is treated as data, not as instruction. See Prompt Injection.
Reliability
Production agents fail in ways toy agents do not. The reliability surface:
- Model fallbacks. If the primary model returns a 5xx, the harness retries against a secondary. The fallback model often returns a result of slightly different quality. Track the differential.
- Tool fallbacks. A failed search tool either retries against the same provider, falls back to a secondary provider, or surfaces the error to the agent so the agent picks a different approach.
- Idempotency. Tool calls that the harness retries (network errors, timeouts) need to be idempotent at the tool level. A "create issue" tool that runs twice creates two issues unless the implementation handles deduplication.
- Loop termination. Iteration caps, time caps, and confidence-based termination all matter. An agent that loops on a task it cannot solve consumes budget until something forces it to stop. The harness owns "something."
Memory and state
An agent that forgets everything between sessions is limited. An agent that remembers everything is expensive and prompts unpredictably. The middle ground is the engineering work:
- Session memory. The conversation history of the current session. Bounded by context-engineering patterns (compaction, just-in-time retrieval).
- Persistent memory. Facts the agent learned across sessions (user preferences, prior decisions, project context). Stored in a database or vector store, retrieved at session start.
- Working memory. Scratchpads or planning artifacts the agent maintains across turns inside a single session. Often a structured-note pattern on a filesystem the agent reads back.
The line between session memory and persistent memory is a design choice. Conservative defaults treat persistent memory as opt-in per fact, with the user (or the agent's principal) approving what gets stored.
Human-in-the-loop boundaries
The most consequential design decision in an agentic system is where the human enters the loop. Three common patterns:
- Approval gate. The agent does its work, presents a plan or a diff, and waits for human approval before acting. Used for high-stakes irreversible actions (sending external email, deploying to production, large payments).
- Override channel. The agent acts autonomously, but the human has a real-time channel to interrupt or redirect. Used for long-running tasks where the human is co-present but not approving each step.
- Escalation path. The agent acts autonomously most of the time, but escalates to a human on detected uncertainty (low confidence, repeated failure, ambiguous input). Used for high-volume customer-facing systems.
Naming the pattern explicitly per system prevents the silent drift toward "fully autonomous because nobody set a boundary." Autonomy without an explicit decision is not autonomy. It is deferred boundary work.
What goes wrong
A short taxonomy of agentic-system failure modes worth carrying:
- Context window saturation under load. A pattern that worked in testing breaks in production when conversations get longer or tool results get bigger. The fix lives in context engineering patterns.
- Cost spirals. A single misbehaving agent in a loop runs up a five-figure bill overnight. Budget caps at the harness level are the defense.
- Tool-poisoning. Untrusted content returned from a search or file-read tool injects instructions the agent follows. Defense: treat tool results as data, not instruction.
- Confident wrong answers under tool failure. A silently-failed tool returns an empty result. The agent proceeds as if the result was meaningful. Defense: structured error returns the agent recognizes as failure.
- State corruption from non-idempotent retries. The harness retries a "send email" tool after a network timeout. Two emails go out. Defense: idempotency keys at the tool boundary.
- Drift. An agent that worked yesterday behaves differently today because the model version changed silently. Defense: pin model versions and treat upgrades as deployments.
Related
- Agentic Workflows — the patterns these systems deploy.
- Context Engineering — the practice that keeps individual contexts bounded.
- Tool Calling — the message-shape underneath every tool boundary.
- Prompt Injection — the threat model for tool-result content.