laranevans.com
Topics / AI / Prompt Engineering / Self-Consistency

Self-consistency is a decoding strategy that replaces a single greedy chain-of-thought sample with multiple sampled chains, then aggregates the final answers by voting. Wang et al. (2022), in Self-Consistency Improves Chain of Thought Reasoning in Language Models, showed the technique improved accuracy on arithmetic, commonsense, and symbolic-reasoning benchmarks over a single greedy chain-of-thought sample.

Two forces generate the whole technique

The technique comes from one observation about problems that have a single correct answer: many distinct reasoning paths reach that answer, and wrong paths tend to fail in different ways. Self-consistency turns that observation into a method by pairing two forces. Set both, and the rest of the recipe follows.

The mechanism turns the two forces into three steps

  1. Sample N reasoning chains at non-zero temperature. The chains diverge because sampling adds randomness to token selection. This is the diversity force in operation.
  2. Extract the final answer from each chain. The extraction is task-specific: a number for math, a category for classification, a span for extraction.
  3. Vote. Return the most common answer. For numerical or categorical answers this is a straightforward majority vote. For free-form answers, semantic equivalence checks (LLM-as-judge, normalization rules) run before counting. This is the agreement force in operation.

The parameters

The values below come from the original paper. They set how far the diversity force runs and where it stops paying off.

Parameter What it controls Value from the paper
N (chains sampled) How many reasoning paths the model explores before voting 5 to 40
Diminishing returns The point past which more chains stop improving accuracy on most benchmarks After roughly 20 chains
Temperature How much the sampling diverges the chains. Low values repeat one route. High values degenerate the chains Typically 0.5 to 0.7
Cost multiplier LLM calls per user query, relative to a single chain N (so 20 chains means 20 calls)

When it helps

Self-consistency fits when all three of these hold:

  • The task has a single correct answer (math, multiple-choice, classification).
  • Multiple reasoning paths plausibly reach that answer. The model holds more than one route in mind, and the routes do not all fail in the same way.
  • Accuracy matters more than cost, because self-consistency multiplies LLM cost by N.

The largest gains in the original paper landed on math word problems (GSM8K, SVAMP, AQuA). Intermediate arithmetic errors are common in individual chains, and the modal answer across chains is often correct even when single chains are not.

When it does not help

Three patterns add cost without accuracy:

  • Tasks where most sampled chains agree by default. The model is already confident, voting changes nothing, and the marginal sample is wasted.
  • Tasks with subjective or free-form output. "Write a poem" has no modal answer. Voting collapses onto a generic mean nobody asked for.
  • Tasks where the model is systematically wrong. If every chain makes the same error, voting amplifies the error rather than correcting it.

Self-consistency also assumes the answer-extraction step is reliable. A bug in extraction (parsing the wrong number, missing a negative sign) corrupts the vote before the agreement force ever runs.

Cost trade-offs

The cost multiplier is the most cited objection. N=20 means 20 times the LLM calls for one user query. Three mitigations reduce that bill:

  • Early stopping. If the first K chains agree, return the majority answer without sampling the rest.
  • Adaptive N. Sample more chains only when the early ones disagree.
  • Split the work across model sizes. Sample chains with a cheap model, then verify the modal answer with a more capable one.

None of these appear in the original paper. The paper established the baseline. Production deployments adapt the recipe.