Skip to main content

What Transformers Really Do

· 3 min read
Series parts
  1. Part 13What Transformers Really Do
On this page

Before transformers, sequence models had a habit of squeezing language through a narrow pipe.

You would read tokens one by one, keep some hidden state, and hope that the model could still remember the important parts of the sentence by the time it reached the end.

The transformer changed that by making attention the main event. At a high level, each token asks a simple question:

Which other tokens matter for me right now?

That is much less mystical than the usual hype makes it sound.

Queries, keys, and values

The basic attention mechanism is usually written as

Attention(Q,K,V)=softmax ⁣(QKdk)V.\operatorname{Attention}(Q, K, V) = \operatorname{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V.

The formula is simpler than it looks.

The dot products inside QKQK^\top measure how relevant other tokens are to the current token. The softmax turns those relevance scores into weights. The multiplication by VV mixes information from the relevant tokens into a new representation.

A useful mental model is:

  • A query is what this token is looking for.
  • A key is what each token offers.
  • A value is the information each token can contribute.

If a query matches a key strongly, that token gets more say in the output.

Multi-head attention

The model does not do this only once. Transformers use multi-head attention: several attention mechanisms run in parallel, each looking at the same tokens through a slightly different learned projection.

That means one head can focus on local syntax, another on long-distance reference, another on something more lexical. The point is not that each head gets a clean human-readable job. The point is that the model gets several different ways to compare tokens in context.

The decoder is masked

This is the part that matters for language generation.

In an autoregressive language model, when predicting the token at position i, the model is allowed to look backward but not forward. That is what the causal mask enforces.

Without the mask, the model could peek at the answer. With the mask, each position can only build its representation using earlier positions. That is the bridge from “general attention machinery” to “next-token prediction.”

A tiny toy implementation

import numpy as np
def softmax(x):
x = x - np.max(x, axis=-1, keepdims=True)
e = np.exp(x)
return e / np.sum(e, axis=-1, keepdims=True)
def attention(Q, K, V):
dk = Q.shape[-1]
scores = Q @ K.T / np.sqrt(dk)
weights = softmax(scores)
return weights @ V, weights
# 3 tokens, hidden size 2
Q = np.array([[1.0, 0.0],
[0.5, 1.0],
[0.0, 1.0]])
K = np.array([[1.0, 0.0],
[0.0, 1.0],
[0.5, 0.5]])
V = np.array([[10.0, 0.0],
[0.0, 10.0],
[5.0, 5.0]])
output, weights = attention(Q, K, V)
print("attention weights:")
print(weights)
print("output:")
print(output)

If you print the weights, you get a little table saying who listened to whom. That is a transformer in miniature.

Transformers as context builders

One useful way to think about transformers is as context builders rather than giant inscrutable text machines.

Each layer updates each token representation by asking: what in the current context should influence me? Then later layers do the same thing again, but using richer token representations.

For language models, all of that eventually feeds into a score over the next token.


Once the model has built context-aware representations, it still has to do one very concrete thing: produce a distribution over the next token, and then choose from it. That sounds like a small final step. It is not. A lot of the model’s personality lives there. That is the next post.