How Language Models Choose the Next Token
Series parts
On this page
A language model can be brilliant and still be boring.
That sounds unfair, but it is one of the clearest lessons from decoding research. A model that always picks the highest-probability next token can sound rigid, bland, or repetitive. A model that samples too freely from the full distribution can wander into incoherence.
A language model is not writing a paragraph all at once. It generates one token at a time: take the text so far, compute a score for every possible next token, turn those scores into probabilities, choose one token, append it, and repeat.
That “choose one token” step sounds small. It is not. It is where much of the model’s behavior shows up.
Temperature: the simplest knob
Suppose the model has logits . A temperature-scaled distribution is
Lower sharpens the distribution. Higher flattens it.
If is very small, the model behaves greedily. If is large, the tail gets more chance to speak. Temperature reshapes the whole distribution but does not decide which tail tokens should be allowed in.
Top-k: fixed-width truncation
Top-k sampling says: keep only the k most probable tokens, renormalize their probabilities, and sample from that smaller set.
A practical idea. It throws away the long tail and says: the next token should come from the head of the distribution, but not necessarily the single top token.
The catch is that a fixed k is awkward across contexts. In some contexts the next-token distribution is flat across many reasonable options. In others most of the mass sits on one or two tokens. A constant k is therefore often too rigid.
Top-p, or nucleus sampling
Top-p sampling keeps the smallest set of tokens whose cumulative probability mass is at least p.
Formally, if is the smallest set such that
then you renormalize over that set and sample from it.
The nice part: the set size adjusts automatically. If the model is very confident, the nucleus may be tiny. If the model is uncertain, the nucleus expands.
That is why top-p became such a standard knob in generation UIs. It matches the candidate set to the model’s own confidence profile.
Drag the sliders above. Watch how temperature reshapes the entire distribution, while top-k always trims to a fixed number of candidates and top-p adapts the candidate set to the model’s confidence. That single contrast explains a lot about why top-p feels more natural than top-k for open-ended generation.
A tiny decoder
function softmax(logits: number[], temperature: number): number[] { const scaled = logits.map(z => z / temperature); const max = Math.max(...scaled); const exps = scaled.map(z => Math.exp(z - max)); const sum = exps.reduce((a, b) => a + b, 0); return exps.map(e => e / sum);}
function applyTopK(probs: number[], k: number): number[] { const indexed = probs.map((p, i) => ({ p, i })); indexed.sort((a, b) => b.p - a.p); const keep = new Set(indexed.slice(0, k).map(x => x.i)); const filtered = probs.map((p, i) => keep.has(i) ? p : 0); const sum = filtered.reduce((a, b) => a + b, 0); return filtered.map(p => p / sum);}
function applyTopP(probs: number[], p: number): number[] { const indexed = probs.map((prob, i) => ({ prob, i })); indexed.sort((a, b) => b.prob - a.prob);
let cumSum = 0; const keep = new Set<number>(); for (const { prob, i } of indexed) { keep.add(i); cumSum += prob; if (cumSum >= p) break; }
const filtered = probs.map((prob, i) => keep.has(i) ? prob : 0); const sum = filtered.reduce((a, b) => a + b, 0); return filtered.map(prob => prob / sum);}
function sampleFrom(probs: number[]): number { const u = Math.random(); let acc = 0; for (let i = 0; i < probs.length; i++) { acc += probs[i]; if (u <= acc) return i; } return probs.length - 1;}The model gives scores. You decide how much randomness to preserve. Then you sample.
Why these choices matter so much
Pure likelihood maximization tends to produce generic, repetitive language in open-ended generation. Pure unrestricted sampling from the full tail can go off the rails too.
Decoding becomes a balancing act: keep enough structure to stay coherent, keep enough randomness to avoid rigid repetition.
That is why sampling deserves its own chapter in any LLM explainer. It is not just post-processing. It directly shapes the generated text.
At small scale, this is just another weighted-sampling step. At the scale of a real LLM, it happens over tens of thousands of vocabulary items at every decode step. What if the probability problem is also a memory-traffic problem? That is the next post.