Before a Token Is Chosen, Context Has to Move

Sampling cannot recover information the model failed to put into its logits. Before one token is chosen, the network must move information from earlier positions into the representation used for that choice.

Attention already existed in recurrent encoder-decoder models. The 2017 Transformer selected the stronger option: it removed recurrence and convolution from the main sequence model and built the encoder and decoder around attention instead.¹ Modern decoder-only language models reuse causal self-attention without the original paper’s separate encoder and cross-attention path.

That does not mean attention is the whole model. Embeddings, feed-forward blocks, residual connections, normalization, training data, and scale all matter. But attention is the part that makes context movable: each token can mix information from other tokens before the model scores what should come next.

Queries, keys, and values

The basic attention mechanism is usually written as

\operatorname{Attention}(Q, K, V) = \operatorname{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V.

The formula is simpler than it looks.

The dot products inside $QK^\top$ produce learned compatibility scores between positions. Softmax turns each row into mixing weights, and multiplication by $V$ combines value vectors into a new representation.

A useful mental model is:

A query is what this token is looking for.
A key is what each token offers.
A value is the information each token can contribute.

If a query and key have a large dot product, that value receives a larger mixing coefficient for this head.

Two softmaxes, two different jobs

The softmax inside attention normalizes over token positions. In ordinary inference, the model does not sample one position and discard the rest; it computes a weighted mixture of value vectors.

The output softmax arrives later and normalizes over vocabulary items. That distribution may be sampled to choose one token. The same mathematical function appears twice, but the axes and semantics differ: one mixes context, the other defines candidate-token mass.

Multi-head attention

The model does not do this only once. Transformers use multi-head attention: several attention mechanisms run in parallel, each looking at the same tokens through a slightly different learned projection.

Different heads use different learned projections, so the layer can form several mixtures in parallel. Some heads exhibit patterns humans can name; many do not. Attention weights are computational coefficients, not a guaranteed causal explanation of the model’s decision.

The decoder is masked

This is the part that matters for language generation.

In an autoregressive language model, when predicting the token at position i, the model is allowed to look backward but not forward. That is what the causal mask enforces.

Without the mask, the model could peek at the answer. With the mask, each position can only build its representation using earlier positions. That is the bridge from “general attention machinery” to “next-token prediction.”

A tiny toy implementation

import numpy as np

def softmax(x):
    x = x - np.max(x, axis=-1, keepdims=True)
    e = np.exp(x)
    return e / np.sum(e, axis=-1, keepdims=True)

def attention(Q, K, V, causal=False):
    dk = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(dk)

    if causal:
        future = np.triu(
            np.ones(scores.shape, dtype=bool), k=1
        )
        scores = np.where(future, -np.inf, scores)

    weights = softmax(scores)
    return weights @ V, weights

# 3 tokens, hidden size 2
Q = np.array([[1.0, 0.0],
              [0.5, 1.0],
              [0.0, 1.0]])

K = np.array([[1.0, 0.0],
              [0.0, 1.0],
              [0.5, 0.5]])

V = np.array([[10.0, 0.0],
              [0.0, 10.0],
              [5.0, 5.0]])

output, weights = attention(Q, K, V, causal=True)

print("attention weights:")
print(weights)
print("output:")
print(output)

The weights form a small table of how each position mixes earlier value vectors. This is not a full transformer: it omits learned projections, multiple heads, MLPs, residual connections, normalization, positional information, batching, and training. It includes the causal mask because without it the generation example could look into the future.

Transformers as context builders

One useful way to think about transformers is as context builders rather than giant inscrutable text machines.

Each layer creates new token representations from the previous layer’s mixtures and position-wise computation. Later layers repeat the process over already contextualized vectors.

For language models, all of that eventually feeds into a score over the next token.

During autoregressive inference, implementations usually cache earlier keys and values instead of recomputing the entire prefix at every step. The new position still passes through every layer, then the final hidden state is projected into one logit per vocabulary item.

Those logits contain the model’s work so far. They are still not text.