Self-Consistency — laranevans.com

Self-consistency is a decoding strategy that replaces a single greedy chain-of-thought sample with multiple sampled chains, then aggregates the final answers by voting. Wang et al. (2022), in Self-Consistency Improves Chain of Thought Reasoning in Language Models, showed the technique improved accuracy on arithmetic, commonsense, and symbolic-reasoning benchmarks over a single greedy chain-of-thought sample.

Two forces generate the whole technique

The technique comes from one observation about problems that have a single correct answer: many distinct reasoning paths reach that answer, and wrong paths tend to fail in different ways. Self-consistency turns that observation into a method by pairing two forces. Set both, and the rest of the recipe follows.

ⓘ Info · The two forces every self-consistency setup balances

Diversity of reasoning is the signal. Sampling at non-zero temperature makes the chains diverge, so the model explores more than one route to the answer. Without divergence the chains repeat each other and the method collapses into a single sample at higher cost.
Agreement on the answer is the aggregation rule. The final answer is the one the most chains reach. Correct chains tend to converge on the same answer while errors scatter, so the modal answer is more likely correct than any single chain.

A model that samples five chains and reaches the same answer in four of them is more likely correct than a model that produces one chain leading to one answer. Diversity supplies candidate answers. Agreement selects among them. The parameters below tune how far the diversity runs and how the agreement gets counted.

The mechanism turns the two forces into three steps

Sample N reasoning chains at non-zero temperature. The chains diverge because sampling adds randomness to token selection. This is the diversity force in operation.
Extract the final answer from each chain. The extraction is task-specific: a number for math, a category for classification, a span for extraction.
Vote. Return the most common answer. For numerical or categorical answers this is a straightforward majority vote. For free-form answers, semantic equivalence checks (LLM-as-judge, normalization rules) run before counting. This is the agreement force in operation.

The parameters

The values below come from the original paper. They set how far the diversity force runs and where it stops paying off.

Parameter	What it controls	Value from the paper
N (chains sampled)	How many reasoning paths the model explores before voting	5 to 40
Diminishing returns	The point past which more chains stop improving accuracy on most benchmarks	After roughly 20 chains
Temperature	How much the sampling diverges the chains. Low values repeat one route. High values degenerate the chains	Typically 0.5 to 0.7
Cost multiplier	LLM calls per user query, relative to a single chain	N (so 20 chains means 20 calls)

When it helps

Self-consistency fits when all three of these hold:

The task has a single correct answer (math, multiple-choice, classification).
Multiple reasoning paths plausibly reach that answer. The model holds more than one route in mind, and the routes do not all fail in the same way.
Accuracy matters more than cost, because self-consistency multiplies LLM cost by N.

The largest gains in the original paper landed on math word problems (GSM8K, SVAMP, AQuA). Intermediate arithmetic errors are common in individual chains, and the modal answer across chains is often correct even when single chains are not.

When it does not help

Three patterns add cost without accuracy:

Tasks where most sampled chains agree by default. The model is already confident, voting changes nothing, and the marginal sample is wasted.
Tasks with subjective or free-form output. "Write a poem" has no modal answer. Voting collapses onto a generic mean nobody asked for.
Tasks where the model is systematically wrong. If every chain makes the same error, voting amplifies the error rather than correcting it.

Self-consistency also assumes the answer-extraction step is reliable. A bug in extraction (parsing the wrong number, missing a negative sign) corrupts the vote before the agreement force ever runs.

Cost trade-offs

The cost multiplier is the most cited objection. N=20 means 20 times the LLM calls for one user query. Three mitigations reduce that bill:

Early stopping. If the first K chains agree, return the majority answer without sampling the rest.
Adaptive N. Sample more chains only when the early ones disagree.
Split the work across model sizes. Sample chains with a cheap model, then verify the modal answer with a more capable one.

None of these appear in the original paper. The paper established the baseline. Production deployments adapt the recipe.

Chain of Thought. The underlying technique self-consistency wraps.
Zero-Shot CoT. Pairs naturally with self-consistency (sample multiple zero-shot chains).
Prompt Evaluation. Measuring the accuracy/cost trade-off.
Prompt Engineering. The broader cluster.