What Transformers Really Do
Series parts
On this page
Before transformers, sequence models had a habit of squeezing language through a narrow pipe.
You would read tokens one by one, keep some hidden state, and hope that the model could still remember the important parts of the sentence by the time it reached the end.
The transformer changed that by making attention the main event. At a high level, each token asks a simple question:
Which other tokens matter for me right now?
That is much less mystical than the usual hype makes it sound.
Queries, keys, and values
The basic attention mechanism is usually written as
The formula is simpler than it looks.
The dot products inside measure how relevant other tokens are to the current token. The softmax turns those relevance scores into weights. The multiplication by mixes information from the relevant tokens into a new representation.
A useful mental model is:
- A query is what this token is looking for.
- A key is what each token offers.
- A value is the information each token can contribute.
If a query matches a key strongly, that token gets more say in the output.
Multi-head attention
The model does not do this only once. Transformers use multi-head attention: several attention mechanisms run in parallel, each looking at the same tokens through a slightly different learned projection.
That means one head can focus on local syntax, another on long-distance reference, another on something more lexical. The point is not that each head gets a clean human-readable job. The point is that the model gets several different ways to compare tokens in context.
The decoder is masked
This is the part that matters for language generation.
In an autoregressive language model, when predicting the token at position i, the model is allowed to look backward but not forward. That is what the causal mask enforces.
Without the mask, the model could peek at the answer. With the mask, each position can only build its representation using earlier positions. That is the bridge from “general attention machinery” to “next-token prediction.”
A tiny toy implementation
import numpy as np
def softmax(x): x = x - np.max(x, axis=-1, keepdims=True) e = np.exp(x) return e / np.sum(e, axis=-1, keepdims=True)
def attention(Q, K, V): dk = Q.shape[-1] scores = Q @ K.T / np.sqrt(dk) weights = softmax(scores) return weights @ V, weights
# 3 tokens, hidden size 2Q = np.array([[1.0, 0.0], [0.5, 1.0], [0.0, 1.0]])
K = np.array([[1.0, 0.0], [0.0, 1.0], [0.5, 0.5]])
V = np.array([[10.0, 0.0], [0.0, 10.0], [5.0, 5.0]])
output, weights = attention(Q, K, V)
print("attention weights:")print(weights)print("output:")print(output)If you print the weights, you get a little table saying who listened to whom. That is a transformer in miniature.
Transformers as context builders
One useful way to think about transformers is as context builders rather than giant inscrutable text machines.
Each layer updates each token representation by asking: what in the current context should influence me? Then later layers do the same thing again, but using richer token representations.
For language models, all of that eventually feeds into a score over the next token.
Once the model has built context-aware representations, it still has to do one very concrete thing: produce a distribution over the next token, and then choose from it. That sounds like a small final step. It is not. A lot of the model’s personality lives there. That is the next post.