Self-Consistency
Self-consistency is a decoding strategy that replaces a single greedy chain-of-thought sample with multiple sampled chains, then aggregates the final answers by voting. Wang et al. (2022), in Self-Consistency Improves Chain of Thought Reasoning in Language Models, showed the technique improved accuracy on arithmetic, commonsense, and symbolic-reasoning benchmarks over a single greedy chain-of-thought sample.
The intuition is that a problem with a single correct answer often has many reasoning paths leading to it. A model that samples five chains and arrives at the same answer in four of them is more likely correct than a model that produces one chain leading to one answer. Diversity of reasoning is the signal. Agreement on the answer is the aggregation rule.
The mechanism
The recipe has three steps:
- Sample N reasoning chains with non-zero temperature. The chains diverge because the sampling adds randomness to token selection.
- Extract the final answer from each chain. The extraction is task-specific: a number for math, a category for classification, a span for extraction.
- Vote. Return the most common answer. For numerical or categorical answers, this is a straightforward majority vote. For free-form answers, semantic equivalence checks (LLM-as-judge, normalization rules) are needed before counting.
The paper used N values from 5 to 40. Returns diminish after roughly 20 chains on most benchmarks. The temperature was typically 0.5 to 0.7 — enough randomness to diversify the chains, not so much that the chains degenerate.
When it helps
Self-consistency is appropriate when:
- The task has a single correct answer (math, multiple-choice, classification).
- Multiple reasoning paths plausibly reach that answer (the model has more than one route in mind, and the routes are not all wrong in the same way).
- Accuracy matters more than cost. Self-consistency multiplies LLM cost by N.
The largest gains in the original paper landed on math word problems (GSM8K, SVAMP, AQuA), where intermediate arithmetic errors are common in individual chains but the modal answer across chains is often correct.
When it does not help
A few patterns where self-consistency adds cost without accuracy:
- Tasks where most sampled chains agree by default. The model is already confident, and voting changes nothing. The marginal sample is wasted.
- Tasks with subjective or free-form output. "Write a poem" has no modal answer. Voting collapses on a generic mean that nobody asked for.
- Tasks where the model is systematically wrong. If every chain makes the same error, voting amplifies the error.
Self-consistency also assumes the answer-extraction step is reliable. A bug in extraction (parsing the wrong number, missing a negative sign) corrupts the vote.
Cost trade-offs
The cost multiplier is the most cited objection. N=20 means 20× the LLM calls for one user query. Several mitigations:
- Early stopping. If the first K chains agree, return the majority answer without sampling the rest.
- Adaptive N. Sample more chains only when the early ones disagree.
- Smaller model for chains, larger model for verification. Sample chains with a cheap model, verify the modal answer with a more capable one.
None of these are in the original paper. The paper established the baseline. Production deployments adapt the recipe.
Related
- Chain of Thought — the underlying technique self-consistency wraps.
- Zero-Shot CoT — pairs naturally with self-consistency (sample multiple zero-shot chains).
- Prompt Evaluation — measuring the accuracy/cost trade-off.
- Prompt Engineering — the broader cluster.