Softmax Is Not Confidence — Alberto Schiabel

A classifier can print 0.93 and still be wrong much more than 7% of the time. The number may be a valid softmax probability and a poor confidence estimate at once.

The distinction starts one step earlier. A classifier produces one raw score per class. Those scores, usually called logits, are quantitative enough to compare and not probabilities yet.

If the classes are cat, dog, and ship, the model might produce scores like:

cat: 3.1
dog: 1.8
ship: -0.7

3.1 does not mean “3.1 cats.” The gap between 3.1 and 1.8 matters; adding 500 to every score does not. Softmax turns those relative scores into a categorical distribution.

The usual bridge

The standard answer is softmax. Given a vector of scores $z_1,\dots,z_K$ , the softmax function turns them into positive numbers that sum to one:

\operatorname{softmax}(z_i)=\frac{e^{z_i}}{\sum_j e^{z_j}}.

Within the categorical model, logits act as unnormalized log-weights. Softmax exponentiates them into positive weights and normalizes those weights to sum to one.

That is the bridge. Scores go in. A categorical distribution comes out.

Why exponentials show up

Exponentials make every finite output positive and turn additive score gaps into multiplicative odds. If one class is two logit units above another, its odds are $e^2 \approx 7.39$ times larger before normalization.

Softmax is not merely “divide by the sum.” It is “exponentiate, then divide by the sum.” That is why it feels more decisive than a simple linear rescaling.

A useful invariance

If you add the same constant $c$ to every score, the probabilities do not change:

\frac{e^{z_i+c}}{\sum_j e^{z_j+c}} = \frac{e^c e^{z_i}}{e^c \sum_j e^{z_j}} = \frac{e^{z_i}}{\sum_j e^{z_j}}.

Softmax cares about relative scores, not their absolute level. The invariance also gives us the standard numerical fix: subtract the largest logit before exponentiating, so the largest exponential becomes 1 instead of overflowing.

A tiny implementation

softmax-sample.ts

export function softmax(scores: number[]): number[] {
  const maxScore = Math.max(...scores);
  const exps = scores.map(s => Math.exp(s - maxScore));
  const total = exps.reduce((a, b) => a + b, 0);
  return exps.map(v => v / total);
}

export function sampleCategorical(probs: number[]): number {
  const u = Math.random();
  let acc = 0;

  for (let i = 0; i < probs.length; i++) {
    acc += probs[i];
    if (u < acc) return i;
  }

  return probs.length - 1;
}

const labels = ["cat", "dog", "ship"];
const scores = [3.1, 1.8, -0.7];

const probs = softmax(scores);
const chosen = labels[sampleCategorical(probs)];

console.log(probs, chosen);

At the implementation level, this is the same cumulative-sum walk as sampleWeighted from the second post; only now the weights came from a model and were normalized by softmax.

What training does with this

Softmax is not only a decoding trick. It is part of a standard training story. A common multiclass loss is the negative log-probability of the correct class:

L_i = -\log \left(\frac{e^{f_{y_i}}}{\sum_j e^{f_j}}\right).

The model is rewarded when it assigns high probability to the right label, and punished when it spreads probability mass in the wrong places. That makes softmax feel less like an arbitrary formula and more like the natural bridge between scores and probabilistic supervision.

The important caveat

A softmax probability is not automatically a calibrated promise about the world.

If a model assigns 0.93 to “cat,” the precise claim is narrower than people often assume: under this score vector, after this normalization, the cat class received 93% of the probability mass. That is not the same thing as saying that 93 out of 100 such images will be cats unless the model has also been calibrated and evaluated that way.

Calibration asks an empirical question: among predictions made with confidence near 0.93, is the model correct about 93% of the time? Modern neural networks can fail that test even when their accuracy is good.¹ Methods such as temperature scaling fit a correction on held-out data. Merely adding a softmax layer does not.

Softmax gives code a distribution it can sample from. It does not tell a person how much to trust the model. Keeping those contracts separate prevents a normalization function from masquerading as evidence.