One Token, Zero Materialized Logits

One decode step computes a score for every vocabulary item and returns one integer. A conventional pipeline may write the full logits tensor to GPU memory, read it back for sampling, and discard every value except the winner.

FlashSampling asks a sharp systems question: if the output is one index, why materialize all the logits in high-bandwidth memory at all?¹

The hardware vocabulary

HBM means high-bandwidth memory, the GPU’s large off-chip memory. A tile is a chunk of the matrix small enough for a kernel to process with fast on-chip storage. The distinction matters because FlashSampling still reads model weights and activations; “zero materialized logits” does not mean zero HBM traffic.

The standard pipeline

The final projection of a language model, often called the LM head, is just a matrix multiply:

Y = HW^\top \in \mathbb{R}^{B\times V},

where $H$ is the batch of hidden states and $W$ is the vocabulary-sized output matrix. The result is one logit per vocabulary item.

A conventional exact-sampling pipeline then does roughly this:

Compute logits.
Write them to HBM.
Read them back.
Apply temperature, masking, or other transforms.
Compute softmax.
Build a cumulative distribution.
Draw a uniform random number.
Search for the first cumulative mass that crosses it.

Mathematically, nothing is wrong with that. Systems-wise, materializing [B, V] writes an intermediate whose useful output has shape [B].

One avoidable round trip

On-chip storage is small and fast. HBM is large, but an avoidable round trip still consumes bandwidth and time.

If you only need one sampled index, why keep hauling the whole logits vector back and forth through slow memory?

Probabilities are cheap on paper and expensive in bandwidth.

The mathematical escape hatch

The key move is to stop thinking “softmax, then sample” and start thinking “argmax after the right random perturbation.”

This is the Gumbel-Max trick. If $\tilde{\ell}_i$ are transformed logits and $g_i$ are independent Gumbel random variables, then

i^* = \arg\max_i (\tilde{\ell}_i + g_i)

is an exact sample from the categorical distribution with probabilities proportional to $e^{\tilde{\ell}_i}$ .

Why does this work? The Gumbel-Max identity says that if you add independent Gumbel noise to logits and take the argmax, the probability of each index winning is exactly proportional to $e^{\tilde{\ell}_i}$ , which is the same distribution softmax would produce. The engineering consequence is the important part: sampling becomes an argmax.

With that change, sampling is no longer a normalization problem. It is a reduction problem: find the maximum perturbed score and return its index.

Why tiling works

Once sampling becomes “find the max,” tiling becomes natural.

FlashSampling fuses this idea into the LM-head matrix multiplication. It processes one vocabulary tile at a time, computes logits on chip, adds Gumbel noise, and keeps only the best candidate from that tile. A small second-stage reduction chooses among tile winners.

This is exact, not approximate. Maxima decompose over partitions: if a global winner exists, it must also be the winner of its own tile. So a “max of tile-maxima” gives the same answer as a max over the whole vocabulary.

The paper reports no full-logits materialization in HBM and up to 19% lower decoding time in its evaluated settings. “Up to” matters: the gain depends on model shape, batch, hardware, and how much of end-to-end decoding this kernel dominates.

Where decomposition stops

Exact categorical sampling and top-k fit tiled reduction because maxima and top-k candidates compose across disjoint tiles. Nucleus sampling does not. Top-p needs globally ordered cumulative probability mass, so an arbitrary tile cannot decide by itself whether a token belongs to the nucleus.

The paper discusses applying top-p after a reduced top-k candidate set. That can be practical, but it is a sequential top-k-then-top-p policy, not exact top-p over the untouched full vocabulary. The optimization is impressive precisely because it preserves a stated distribution; changing that statement would make the benchmark easier and the result less interesting.

A toy version

import numpy as np

def gumbel(shape):
    u = np.random.rand(*shape)
    eps = np.finfo(np.float64).eps
    u = np.clip(u, eps, 1.0 - eps)
    return -np.log(-np.log(u))

def gumbel_max_sample(logits):
    scores = np.array(logits, dtype=np.float64)
    scores += gumbel(scores.shape)
    return int(np.argmax(scores))

def tiled_gumbel_max_sample(logits, tile_size=4):
    logits = np.array(logits, dtype=np.float64)
    best_score = -np.inf
    best_index = -1

    for start in range(0, len(logits), tile_size):
        end = min(start + tile_size, len(logits))
        tile = logits[start:end]
        scores = tile + gumbel(tile.shape)

        local_index = int(np.argmax(scores))
        local_score = float(scores[local_index])

        if local_score > best_score:
            best_score = local_score
            best_index = start + local_index

    return best_index

This NumPy code is neither fast nor fused, but it preserves the pathwise identity: perturb every logit once, keep one winner per tile, then reduce the winners. Clipping u away from 0 and 1 avoids infinities in the double logarithm; production kernels use counter-based random-number generation and stricter numerical choices.

The Gumbel-Max identity changes a probability-normalization pipeline into an argmax reduction. The hardware optimization follows from that mathematical rewrite; the benchmark follows only after the distribution stays exact.

The sample was always one integer. The engineering problem was everything required to make that integer honest.