Who Actually Chooses the Next Token?

The neural network does not emit the next token. It emits logits. A decoding policy transforms those logits, discards some candidates, and either takes an argmax or performs a random draw.

Two products can run the same model weights and produce different text because their decoders make different choices. This is policy, not formatting.

A language model is not writing a paragraph all at once. It generates one token at a time: take the text so far, compute a score for every possible next token, turn those scores into probabilities, choose one token, append it, and repeat.

The loop exposes every decoding decision to the user, one token at a time.

Temperature: the simplest knob

Suppose the model has logits $z_1,\dots,z_V$ . A temperature-scaled distribution is

p_i(T)=\frac{e^{z_i/T}}{\sum_j e^{z_j/T}}.

Lower $T$ sharpens the distribution. Higher $T$ flattens it.

As $T$ approaches zero from above, the distribution concentrates on the largest logits. As $T$ grows, it moves toward uniform over finite logits. Temperature reshapes the whole distribution but does not decide which tail tokens should remain eligible.

This is the same algebra used for post-hoc temperature calibration and a different job. Calibration fits $T$ on held-out outcomes to improve probability estimates. Decoding chooses $T$ to change generation behavior.

Top-k: fixed-width truncation

Top-k sampling says: keep only the k most probable tokens, renormalize their probabilities, and sample from that smaller set.

A practical idea. It throws away the long tail and says: the next token should come from the head of the distribution, but not necessarily the single top token.

The catch is that a fixed k is awkward across contexts. In some contexts the next-token distribution is flat across many reasonable options. In others most of the mass sits on one or two tokens. A constant k is therefore often too rigid.

Top-p, or nucleus sampling

Top-p sampling, also called nucleus sampling, keeps the smallest set of tokens whose cumulative probability mass is at least p.¹

Formally, if $V^{(p)}$ is the smallest set such that

\sum_{x \in V^{(p)}} P(x \mid x_{1:i-1}) \ge p,

then you renormalize over that set and sample from it.

The set size adjusts automatically. A concentrated distribution may cross the threshold with a few tokens; a diffuse one needs more. Calling that “confidence” would repeat the calibration mistake from the softmax article. Top-p only sees probability mass produced by the model.

Token Decoding Playground

Prompt context

Thecatsatonthe?

Decoding parameters

T = 1.0

top-k = off

top-p = off

All tokens active

active

12/12

top mass

50%

effective

4.9

Probability distribution

mat

50.0%

floor

16.6%

bed

12.3%

roof

7.5%

table

5.0%

chair

3.4%

sofa

2.3%

grass

1.4%

fence

0.8%

moon

0.5%

pizza

0.2%

cloud

0.1%

Recent draws

sample to fill the tape

Drag the sliders above. Temperature reshapes the entire distribution, top-k keeps a fixed number of candidates, and top-p adapts the candidate count to the distribution’s concentration. Then sample repeatedly. The bars show probabilities; the draw tape makes the resulting behavior visible.

A tiny decoder

tiny-decoder.ts

function softmax(logits: number[], temperature: number): number[] {
  if (!(temperature > 0)) {
    throw new Error("Temperature must be greater than zero");
  }

  const scaled = logits.map(z => z / temperature);
  const max = Math.max(...scaled);
  const exps = scaled.map(z => Math.exp(z - max));
  const sum = exps.reduce((a, b) => a + b, 0);
  return exps.map(e => e / sum);
}

function applyTopK(probs: number[], k: number): number[] {
  if (!Number.isInteger(k) || k < 1 || k > probs.length) {
    throw new Error("k must be a valid candidate count");
  }

  const indexed = probs.map((p, i) => ({ p, i }));
  indexed.sort((a, b) => b.p - a.p);
  const keep = new Set(indexed.slice(0, k).map(x => x.i));
  const filtered = probs.map((p, i) => keep.has(i) ? p : 0);
  const sum = filtered.reduce((a, b) => a + b, 0);
  return filtered.map(p => p / sum);
}

function applyTopP(probs: number[], p: number): number[] {
  if (!(p > 0 && p <= 1)) {
    throw new Error("p must be in (0, 1]");
  }

  const indexed = probs.map((prob, i) => ({ prob, i }));
  indexed.sort((a, b) => b.prob - a.prob);

  let cumSum = 0;
  const keep = new Set<number>();
  for (const { prob, i } of indexed) {
    keep.add(i);
    cumSum += prob;
    if (cumSum >= p) break;
  }

  const filtered = probs.map((prob, i) => keep.has(i) ? prob : 0);
  const sum = filtered.reduce((a, b) => a + b, 0);
  return filtered.map(prob => prob / sum);
}

function sampleFrom(probs: number[]): number {
  const u = Math.random();
  let acc = 0;
  for (let i = 0; i < probs.length; i++) {
    acc += probs[i];
    if (u < acc) return i;
  }
  return probs.length - 1;
}

The snippet applies temperature before truncation. Reversing transformations or combining top-k and top-p in a different order changes the resulting distribution. Production decoders also apply repetition penalties, grammar masks, banned-token rules, and other logit processors, so the policy needs an order rather than a bag of knobs.

Training and decoding make different promises

There is a subtle distinction in the nucleus-sampling paper: maximum likelihood can train a useful language model, while using likelihood as the decoding objective through greedy or beam search can produce bland, repetitive text in open-ended generation.² Unrestricted sampling from the unreliable tail creates a different failure mode.

Decoding becomes a balancing act: keep enough structure to stay coherent, keep enough randomness to avoid rigid repetition.

Decoding is downstream of the model and upstream of every visible token. Calling it “just post-processing” hides the only step that converts a distribution into one irreversible choice.