From Scores to Probabilities
Series parts
On this page
A classifier usually does not begin by saying “this is a cat with probability 0.93.”
It begins with something more primitive: a list of raw numbers, one per class, and leaves you to decide what they mean.
If the classes are cat, dog, and ship, the model might produce scores like:
- cat: 3.1
- dog: 1.8
- ship: -0.7
Those are not probabilities yet. They are just preferences. And once you say that out loud, the next question becomes unavoidable:
How do raw scores become something you can interpret, and maybe even sample from?
The simplest answer
The standard answer is softmax. Given a vector of scores , the softmax function turns them into positive numbers that sum to one:
The key phrase is unnormalized log-probabilities. The model is really saying: I like class A this much, class B somewhat less, class C not much at all. Softmax is the normalization step that turns those relative preferences into a proper categorical distribution.
Why exponentials show up
The exponentials do two useful things.
First, they make every output positive. Second, they amplify score differences before normalization. If one class has a score a bit larger than the others, exponentiation makes that advantage more pronounced.
Softmax is not merely “divide by the sum.” It is “exponentiate, then divide by the sum.” That is why it feels more decisive than a simple linear rescaling.
A useful invariance
If you add the same constant to every score, the probabilities do not change:
This sounds like a technical footnote, but it is very useful. It means softmax cares about relative scores, not absolute level. And it is why the standard numerically stable implementation subtracts the maximum score before exponentiating.
A tiny implementation
export function softmax(scores: number[]): number[] { const maxScore = Math.max(...scores); const exps = scores.map(s => Math.exp(s - maxScore)); const total = exps.reduce((a, b) => a + b, 0); return exps.map(v => v / total);}
export function sampleCategorical(probs: number[]): number { const u = Math.random(); let acc = 0;
for (let i = 0; i < probs.length; i++) { acc += probs[i]; if (u <= acc) return i; }
return probs.length - 1;}
const labels = ["cat", "dog", "ship"];const scores = [3.1, 1.8, -0.7];
const probs = softmax(scores);const chosen = labels[sampleCategorical(probs)];
console.log(probs, chosen);At the implementation level, this is the same cumulative-sum walk as sampleWeighted from the second post; only now the weights came from a model and were normalized by softmax.
What training does with this
Softmax is not only a decoding trick. It is part of a standard training story. A common multiclass loss is the negative log-probability of the correct class:
The model is rewarded when it assigns high probability to the right label, and punished when it spreads probability mass in the wrong places. That makes softmax feel less like an arbitrary formula and more like the natural bridge between scores and probabilistic supervision.
Why this belongs in a series about sampling
Because once the scores become probabilities, you have a real discrete distribution. You can take the argmax and pick the most likely class. Or you can actually sample from it.
A classifier is often used greedily, just take the top label. But the same machinery later powers stochastic choices in generative models, where sampling is the whole point.
Softmax solves a clean problem: given visible classes and visible scores, turn one into the other. But some models want more structure than that. They want to explain the visible world by introducing something unseen underneath it: a topic, a cluster, an intent, a state, a cause. That is the next post.