laranevans.com
Topics / AI / Context Engineering / Tool Calling

Tool calling is the mechanic by which a language model invokes external code through a structured message. One idea generates the whole design. The model emits a request, and the application stays in control of what runs. The model never executes anything. It produces a tool-call message naming a tool and its arguments, and the application decides whether and how to act on that request.

Hold that separation and the rest of the page follows from it. Tool definitions are the application telling the model what it is allowed to request. The message loop is the request-and-result handshake. Error handling, parallel calls, and tool choice are all questions about what the application does with a request once the model emits one.

The message loop: request, execute, result

A tool-calling turn has three messages even though they happen in one loop. The model produces the first, the application produces the next two.

  1. The tool-use message. The model emits this in place of (or alongside) regular text. It carries the tool's name, a JSON object of arguments, and a unique tool-use ID.
  2. The execution step. The application receives the tool-use message, looks up the named tool, validates the arguments against the schema, runs the code, and produces a result. Nothing in this step is the model's doing.
  3. The tool-result message. The application sends this back, referencing the tool-use ID and carrying the result as text, JSON, or a structured content block. The model reads it on the next turn and continues.

Some providers wrap this loop in a higher-level abstraction, a single complete_with_tools call that runs the loop internally. The underlying mechanic is the same: the model requests, the application executes.

The pattern is older than the term. Early LLM agents stitched it together with regex parsing of free-text output. Modern provider APIs (Anthropic's, OpenAI's, Google's) now expose tool calling as a first-class message type with structured inputs and outputs, which removes the brittle parsing step and gives the model a clean signal that it is requesting a tool rather than emitting prose.

sequenceDiagram
    participant Model
    participant App as Application
    participant Tool
    Model->>App: tool-use (name, args, id)
    App->>App: validate args against schema
    App->>Tool: run with args
    Tool-->>App: result
    App->>Model: tool-result (id, result)
    Note over Model: reads result on next turn, continues

Tool definitions: what the application lets the model request

A tool definition is the application's side of the contract. It tells the model what it is allowed to request and how to shape the request. A definition has four parts.

  • A name. Unique per session. The model uses it to reference the tool.
  • A description. Natural-language guidance the model reads to decide when to call the tool. This is the most important part of tool design, and the most frequently underwritten.
  • A parameter schema. JSON Schema describing what arguments the tool accepts. The model uses it to construct valid calls.
  • The implementation. Code that runs when the tool is invoked. It lives in the application, not in the protocol, which is the model-emits-request, application-executes line made concrete.

The description is where most tool design lives. A description that names the tool's purpose, lists its inputs in plain language, mentions when not to use it, and gives one example of a typical call gives the model what it needs to use the tool well. A description that says "Calls the search endpoint" leaves the model guessing.

Anthropic's Writing Tools for Agents post recommends consolidating tools by intent rather than wrapping every existing API endpoint. A schedule_event tool that handles search, conflict check, and booking inside one call leaves the model less to figure out than three separate tools wired in sequence.

Tool choice: how hard the application forces a request

The application controls whether the model is free to request a tool or required to. Provider APIs expose a tool_choice parameter (names differ across providers) that sets the model's behavior at the tool-selection boundary. The four values run from full model discretion to no tools at all.

tool_choice What the model does When to use it
Auto Decides on its own whether to call a tool or respond with text Default for most workloads
Any Required to call some tool When the application needs a structured output shaped like a tool call
Specific Required to call one named tool Testing, or workflows where the next action is fixed
None Emits text only, no tool call A final summarization turn after a multi-step tool-using interaction

Most production workloads stay on Auto. Forcing tool choice fits narrow cases: structured-output extraction, evaluation harnesses, deterministic workflow steps.

Error handling: returning a failure the model can read

When a tool fails (bad input, network error, downstream service down), the application returns the error as a tool-result message. It sets an is_error flag, an error code, or distinct content the model recognizes as an error. The model reads the failure on the next turn and either retries with adjusted arguments, picks a different tool, or asks the user for guidance.

Returning a raw exception as the tool result usually works in practice. The model recognizes a stack trace and adjusts. A structured error block (a short message and a code) gives the model a cleaner signal and avoids wasting context on noise.

What to avoid is silently returning an empty result on failure. The model assumes the tool succeeded with no useful output and proceeds. The downstream effect is a confidently wrong response from the assistant.

Parallel tool calls: many requests in one message

Modern providers support multiple tool-use blocks in a single assistant message. The model uses this when several independent calls answer parts of the same question. A query like "What's the weather in San Francisco and New York?" triggers two parallel get_weather calls in one turn rather than two sequential turns.

The application runs the calls in parallel, collects the results, and returns all of them in a single user message containing multiple tool-result blocks. The model sees both results on its next turn and answers.

Parallel calls reduce latency on multi-source queries. They also stress the application's concurrency model. A tool that holds a database connection from a small pool serializes the parallel calls in practice even when the model issued them concurrently.

Tool calling and context budget

Every tool definition lives in the system context. Every tool-use and tool-result lives in the conversation history. A long agentic session accumulates tool traffic faster than user prose, and the context budget tightens correspondingly. See Context Engineering for the patterns that keep this manageable: just-in-time retrieval, structured note-taking, sub-agent architectures, compaction.

Tool design decisions ripple into context cost.

  • A verbose tool description costs tokens on every turn the tool is available. Tight, complete descriptions beat long ones.
  • A tool returning a 10KB result spends 10KB of context on every subsequent turn the result is in scope. Return identifiers, not full payloads, when the next step is another tool call rather than a final answer.
  • A tool that should rarely fire still occupies its slot in the description block. A tool the model has used twice in a month is a candidate for removal.

Failure modes worth naming

Each failure mode below is a place the model-emits-request, application-executes line gets crossed or trusted too far.

  • Hallucinated tool calls. The model invokes a tool name not in the schema, or invents an argument the schema does not accept. Validate every tool call against the schema before execution. Return a structured error if validation fails so the model self-corrects.
  • Looping on the same call. The model retries the same failing tool with the same arguments. Detect repetition at the application layer and either inject a guidance message ("This call has failed three times. Try a different approach.") or stop the loop and surface to the user.
  • Tool-result poisoning. A tool returns attacker-controlled content (web search results, scraped pages, untrusted file contents) and the model treats it as instructions. The defense is the same as for any untrusted input: mark the content as data, not as instruction, and apply the prompt-injection guard.
  • Confidence mismatch. The model invokes a tool to "check" a fact when the fact is already in the system prompt or the recent conversation. It wastes a turn. Usually a sign the system prompt is unclear about what the model already knows.