Skip to main content

From Scores to Probabilities

· 4 min read
Series parts
  1. Part 10From Scores to Probabilities
On this page

A classifier usually does not begin by saying “this is a cat with probability 0.93.”

It begins with something more primitive: a list of raw numbers, one per class, and leaves you to decide what they mean.

If the classes are cat, dog, and ship, the model might produce scores like:

  • cat: 3.1
  • dog: 1.8
  • ship: -0.7

Those are not probabilities yet. They are just preferences. And once you say that out loud, the next question becomes unavoidable:

How do raw scores become something you can interpret, and maybe even sample from?

The simplest answer

The standard answer is softmax. Given a vector of scores z1,,zKz_1,\dots,z_K, the softmax function turns them into positive numbers that sum to one:

softmax(zi)=ezijezj.\operatorname{softmax}(z_i)=\frac{e^{z_i}}{\sum_j e^{z_j}}.

The key phrase is unnormalized log-probabilities. The model is really saying: I like class A this much, class B somewhat less, class C not much at all. Softmax is the normalization step that turns those relative preferences into a proper categorical distribution.

Why exponentials show up

The exponentials do two useful things.

First, they make every output positive. Second, they amplify score differences before normalization. If one class has a score a bit larger than the others, exponentiation makes that advantage more pronounced.

Softmax is not merely “divide by the sum.” It is “exponentiate, then divide by the sum.” That is why it feels more decisive than a simple linear rescaling.

A useful invariance

If you add the same constant cc to every score, the probabilities do not change:

ezi+cjezj+c=eceziecjezj=ezijezj.\frac{e^{z_i+c}}{\sum_j e^{z_j+c}} = \frac{e^c e^{z_i}}{e^c \sum_j e^{z_j}} = \frac{e^{z_i}}{\sum_j e^{z_j}}.

This sounds like a technical footnote, but it is very useful. It means softmax cares about relative scores, not absolute level. And it is why the standard numerically stable implementation subtracts the maximum score before exponentiating.

A tiny implementation

export function softmax(scores: number[]): number[] {
const maxScore = Math.max(...scores);
const exps = scores.map(s => Math.exp(s - maxScore));
const total = exps.reduce((a, b) => a + b, 0);
return exps.map(v => v / total);
}
export function sampleCategorical(probs: number[]): number {
const u = Math.random();
let acc = 0;
for (let i = 0; i < probs.length; i++) {
acc += probs[i];
if (u <= acc) return i;
}
return probs.length - 1;
}
const labels = ["cat", "dog", "ship"];
const scores = [3.1, 1.8, -0.7];
const probs = softmax(scores);
const chosen = labels[sampleCategorical(probs)];
console.log(probs, chosen);

At the implementation level, this is the same cumulative-sum walk as sampleWeighted from the second post; only now the weights came from a model and were normalized by softmax.

What training does with this

Softmax is not only a decoding trick. It is part of a standard training story. A common multiclass loss is the negative log-probability of the correct class:

Li=log(efyijefj).L_i = -\log \left(\frac{e^{f_{y_i}}}{\sum_j e^{f_j}}\right).

The model is rewarded when it assigns high probability to the right label, and punished when it spreads probability mass in the wrong places. That makes softmax feel less like an arbitrary formula and more like the natural bridge between scores and probabilistic supervision.

Why this belongs in a series about sampling

Because once the scores become probabilities, you have a real discrete distribution. You can take the argmax and pick the most likely class. Or you can actually sample from it.

A classifier is often used greedily, just take the top label. But the same machinery later powers stochastic choices in generative models, where sampling is the whole point.


Softmax solves a clean problem: given visible classes and visible scores, turn one into the other. But some models want more structure than that. They want to explain the visible world by introducing something unseen underneath it: a topic, a cluster, an intent, a state, a cause. That is the next post.