Tree of Thoughts — laranevans.com

Tree of Thoughts generalizes chain-of-thought into a deliberate search over candidate intermediate states. The model produces multiple candidate thoughts at each step, evaluates them with self-judgment, and chooses which branch to expand. Yao et al. (2023), in Tree of Thoughts: Deliberate Problem Solving with Large Language Models, introduced the framework and showed it outperformed CoT on tasks requiring planning, lookahead, or backtracking.

The whole family of reasoning-prompt patterns comes from one move: each pattern lets the model explore a wider region of the solution space and recover from a wrong turn. A linear chain commits to one path. A vote averages several paths. A tree searches over paths with the ability to back up. Read the patterns in order of how much structure they let the model explore, and Tree of Thoughts sits one step short of an arbitrary graph.

ⓘ Info · The reasoning-prompt patterns form one expressiveness ladder

Each pattern below adds search structure, and adds cost, over the one before it. The generator is a single question: how much of the solution space does the model get to explore, and can it back out of a dead end?

Chain of Thought samples one linear reasoning path. No exploration, no recovery.
Self-consistency samples N independent chains and votes. Exploration by parallel sampling, no recovery within a chain.
Least-to-most decomposes the problem once, then solves a fixed linear sequence of subproblems. Structure in the decomposition, still a single path.
Tree of Thoughts searches a tree of candidate thoughts with lookahead and backtracking. Exploration and recovery both.
Graph of Thoughts generalizes the tree to arbitrary dependency graphs, so intermediate results re-enter the reasoning later.

Pick the lowest rung that solves your task. Each rung up buys expressiveness with LLM calls.

The motivating contrast sits between rungs 1 and 4. CoT samples one linear reasoning path. If an early step goes wrong, the chain runs to a wrong answer without revisiting the bad step. Tree of Thoughts treats the problem as search, with the model acting as both the move generator and the evaluator.

The patterns at a glance

Pattern	Search structure	Recovery from a bad step	Cost
Chain of Thought	One linear path	None	One pass
Self-consistency	N independent chains, then vote	None within a chain	N passes
Least-to-most	One decomposition, linear solve	None	One pass plus decomposition
Tree of Thoughts	Tree of candidate thoughts, lookahead and backtracking	Backtracking	Many passes per query
Graph of Thoughts	Arbitrary dependency graph	Re-entry and merging	Many passes, more bookkeeping

The hierarchy adds expressiveness and cost in roughly that order.

The mechanism rests on three moves: thought, state evaluation, search

A Tree of Thoughts run is a search loop built from three named pieces plus a decision about granularity.

Thought decomposition sets the granularity

A thought is the unit the search operates on. What counts as one depends on the task: a line of math, a sentence in a writing task, a tile move in a puzzle. The decomposition determines how fine-grained the search tree is.

Thought generation produces the candidate branches

At each search-tree node, the model produces multiple candidate next thoughts. Two strategies are common: sample diverse continuations from the same prompt, or propose distinct alternatives in a single call. These candidates are the branches the search chooses among.

State evaluation scores each candidate

State evaluation is the model judging the value of each candidate thought. The judgment takes the shape of a scalar score (rate this state on a scale), a comparative ranking, or a classification (sure / maybe / impossible). The evaluator is itself an LLM call, which makes its reliability the load on which the whole method rests.

Search walks the tree

A standard search algorithm (BFS, DFS, beam search) walks the tree, expanding promising states and pruning hopeless ones. Yao et al. used BFS and DFS depending on the task structure.

When Tree of Thoughts helps

Tree of Thoughts is appropriate when:

The task has structure that admits intermediate-state evaluation. Math, puzzles, planning, code generation with checkable subgoals all fit.
Early errors are expensive, because the chain commits and cannot recover gracefully.
The cost of search (many LLM calls per query) is acceptable relative to the accuracy gain.

The original paper reported large gains on Game of 24 (a small math puzzle benchmark), Creative Writing (where multiple plans compete), and Mini Crosswords. CoT solved Game of 24 about 4% of the time. Tree of Thoughts solved it about 74%.

When Tree of Thoughts does not help

Tasks without checkable intermediate states. Free-form generation where every continuation is plausible offers the evaluator no signal. The search collapses.
Tasks where the model cannot self-evaluate reliably. Tree of Thoughts depends on the evaluator being calibrated. A model that rates every state as "promising" cannot prune.
Latency-sensitive interactive systems. The cost of search is N-fold more LLM calls. Wall-clock time grows accordingly.

The faithfulness issue from Turpin et al. (2023) extends to the evaluator: the model's stated reason for preferring one branch does not necessarily match the actual basis of its preference. The search is only as honest as the evaluator.

Graph of Thoughts extends the tree

Besta et al. (2024) extended the tree to a Graph of Thoughts, allowing thoughts to form arbitrary dependency graphs rather than only linear chains or trees. The graph generalization fits tasks where intermediate results re-enter the reasoning later, such as merging two computed quantities back into a third. This is the top rung of the expressiveness ladder above.

Practical guidance

A few rules of thumb:

Reach for ToT when the task has clear intermediate states and CoT under-performs. Do not adopt it as a default.
Validate that the model evaluates intermediate states reliably for the specific task before scaling up. A weak evaluator wastes the search budget.
Cap the search depth and branching factor. Unbounded search burns through LLM cost without proportional gain.
For tasks where intermediate states are runnable code, prefer the executable check over an LLM-judged one. The interpreter does not hallucinate.

Chain of Thought — the linear baseline ToT extends.
Self-Consistency — the voting-over-chains alternative.
Least-to-Most Prompting — the decomposition cousin.
Prompt Engineering — the broader cluster.