Tree of Thoughts
Tree of Thoughts generalizes chain-of-thought into a deliberate search over candidate intermediate states. The model produces multiple candidate thoughts at each step, evaluates them with self-judgment, and chooses which branch to expand. Yao et al. (2023), in Tree of Thoughts: Deliberate Problem Solving with Large Language Models, introduced the framework and showed it outperformed CoT on tasks requiring planning, lookahead, or backtracking.
The whole family of reasoning-prompt patterns comes from one move: each pattern lets the model explore a wider region of the solution space and recover from a wrong turn. A linear chain commits to one path. A vote averages several paths. A tree searches over paths with the ability to back up. Read the patterns in order of how much structure they let the model explore, and Tree of Thoughts sits one step short of an arbitrary graph.
The motivating contrast sits between rungs 1 and 4. CoT samples one linear reasoning path. If an early step goes wrong, the chain runs to a wrong answer without revisiting the bad step. Tree of Thoughts treats the problem as search, with the model acting as both the move generator and the evaluator.
The patterns at a glance
| Pattern | Search structure | Recovery from a bad step | Cost |
|---|---|---|---|
| Chain of Thought | One linear path | None | One pass |
| Self-consistency | N independent chains, then vote | None within a chain | N passes |
| Least-to-most | One decomposition, linear solve | None | One pass plus decomposition |
| Tree of Thoughts | Tree of candidate thoughts, lookahead and backtracking | Backtracking | Many passes per query |
| Graph of Thoughts | Arbitrary dependency graph | Re-entry and merging | Many passes, more bookkeeping |
The hierarchy adds expressiveness and cost in roughly that order.
The mechanism rests on three moves: thought, state evaluation, search
A Tree of Thoughts run is a search loop built from three named pieces plus a decision about granularity.
Thought decomposition sets the granularity
A thought is the unit the search operates on. What counts as one depends on the task: a line of math, a sentence in a writing task, a tile move in a puzzle. The decomposition determines how fine-grained the search tree is.
Thought generation produces the candidate branches
At each search-tree node, the model produces multiple candidate next thoughts. Two strategies are common: sample diverse continuations from the same prompt, or propose distinct alternatives in a single call. These candidates are the branches the search chooses among.
State evaluation scores each candidate
State evaluation is the model judging the value of each candidate thought. The judgment takes the shape of a scalar score (rate this state on a scale), a comparative ranking, or a classification (sure / maybe / impossible). The evaluator is itself an LLM call, which makes its reliability the load on which the whole method rests.
Search walks the tree
A standard search algorithm (BFS, DFS, beam search) walks the tree, expanding promising states and pruning hopeless ones. Yao et al. used BFS and DFS depending on the task structure.
When Tree of Thoughts helps
Tree of Thoughts is appropriate when:
- The task has structure that admits intermediate-state evaluation. Math, puzzles, planning, code generation with checkable subgoals all fit.
- Early errors are expensive, because the chain commits and cannot recover gracefully.
- The cost of search (many LLM calls per query) is acceptable relative to the accuracy gain.
The original paper reported large gains on Game of 24 (a small math puzzle benchmark), Creative Writing (where multiple plans compete), and Mini Crosswords. CoT solved Game of 24 about 4% of the time. Tree of Thoughts solved it about 74%.
When Tree of Thoughts does not help
- Tasks without checkable intermediate states. Free-form generation where every continuation is plausible offers the evaluator no signal. The search collapses.
- Tasks where the model cannot self-evaluate reliably. Tree of Thoughts depends on the evaluator being calibrated. A model that rates every state as "promising" cannot prune.
- Latency-sensitive interactive systems. The cost of search is N-fold more LLM calls. Wall-clock time grows accordingly.
The faithfulness issue from Turpin et al. (2023) extends to the evaluator: the model's stated reason for preferring one branch does not necessarily match the actual basis of its preference. The search is only as honest as the evaluator.
Graph of Thoughts extends the tree
Besta et al. (2024) extended the tree to a Graph of Thoughts, allowing thoughts to form arbitrary dependency graphs rather than only linear chains or trees. The graph generalization fits tasks where intermediate results re-enter the reasoning later, such as merging two computed quantities back into a third. This is the top rung of the expressiveness ladder above.
Practical guidance
A few rules of thumb:
- Reach for ToT when the task has clear intermediate states and CoT under-performs. Do not adopt it as a default.
- Validate that the model evaluates intermediate states reliably for the specific task before scaling up. A weak evaluator wastes the search budget.
- Cap the search depth and branching factor. Unbounded search burns through LLM cost without proportional gain.
- For tasks where intermediate states are runnable code, prefer the executable check over an LLM-judged one. The interpreter does not hallucinate.
Related
- Chain of Thought — the linear baseline ToT extends.
- Self-Consistency — the voting-over-chains alternative.
- Least-to-Most Prompting — the decomposition cousin.
- Prompt Engineering — the broader cluster.