Skip to main content

Matrix Calculus for Deep Learning, Without the Fog

The formulas I keep re-learning, rewritten as a visual memory palace.

· 12 min read

TL;DR

Treat a Jacobian as the local matrix of a nonlinear map. Memorize a small toolbox: sum, dot product, affine map, and element-wise nonlinearity. Then backprop is just local linear maps composed together, with two famous simplifications to remember: binary cross-entropy gives $a - y$, and softmax-cross-entropy gives $p - y$.

On this page

Every few months I need the same deep learning derivatives again. Not all of them. The same small cluster: affine maps, element-wise activations, Jacobians, chain rules, and the softmax-cross-entropy shortcut. I keep ending up back at Parr and Howard’s excellent article1. This post is my own rewrite of a large chunk of that material, organized the way I wish my notes looked when I am half-awake and trying to remember why a gradient has the shape it has.

The thread running through all of it is simple:

  • Geometry: a Jacobian is the best local linear map near the current point.
  • Algebra: backprop is repeated composition of those local maps.
  • Probability: the famous output-layer gradients are just “predicted distribution minus what reality said.”

Geometry

Columns tell you where the basis vectors go. That is true for ordinary matrices and for Jacobians viewed locally.

Algebra

Once you keep shapes straight, the chain rule stops feeling magical. It becomes matrix multiplication with a few diagonals.

Probability

Cross-entropy gradients are not random symbols. They are the correction that moves probability mass toward the event that actually happened.

The One Convention That Matters

Convention

I will use column vectors. For a scalar loss L(x)L(x), the gradient xL\nabla_x L has the same shape as xx. For a vector-valued function f:RnRmf : \mathbb{R}^n \to \mathbb{R}^m, the Jacobian is the m×nm \times n matrix

Jf(x)ij=fixj.J_f(x)_{ij} = \frac{\partial f_i}{\partial x_j}.

If your favorite notes use row-vector gradients, transpose everything consistently and the substance does not change.

With that convention, there are only a few cases to remember:

mappingobject you getshape
f:RRf : \mathbb{R} \to \mathbb{R}ordinary derivative f(x)f'(x)scalar
f:RnRf : \mathbb{R}^n \to \mathbb{R}gradient xf\nabla_x fn×1n \times 1
f:RnRmf : \mathbb{R}^n \to \mathbb{R}^mJacobian Jf(x)J_f(x)m×nm \times n

Most confusion in matrix calculus is not real math confusion. It is convention confusion. Once you settle the orientation, the rest is bookkeeping.

A Jacobian Is a Local Matrix

The Jacobian is not a bag of partial derivatives. It is the matrix that best describes how a nonlinear function behaves near the current point:

f(x+Δx)f(x)+Jf(x)Δx.f(x + \Delta x) \approx f(x) + J_f(x)\,\Delta x.

That is the entire geometric story. The Jacobian is the local linear approximation. If you zoom in far enough, a smooth nonlinear function looks linear, and the Jacobian is the matrix of that local linear map.

The columns of Jf(x)J_f(x) are especially useful to keep in mind. The jj-th column tells you what happens to the jj-th basis direction eje_j. If you know where the basis goes, you know where any vector goes, because any vector is a weighted sum of basis vectors.

Linear Map Explorer

The cleanest way to read a Jacobian is as a local linear map. Its columns tell you where the basis vectors go. Once you know that, the output of the map is just a weighted mix of those columns.

What to notice

Columns tell the whole story: one column tilts right, the other lifts upward.

det(W)

1.88

Orientation is preserved.

Input Space

Your vector x and the standard basis.

e₁e₂x

Output Space

auto-fit

The same map after the grid and basis have been transformed.

e₁e₂Wx

Read the columns

W e₁[1.10, 0.80]
W e₂[-0.70, 1.20]
x = 1.75 e₁ +1.10 e₂

Wx = 1.75 · W e₁ +1.10 · W e₂ = [1.16, 2.72]

Controls

Why this matters for deep learning

For a nonlinear function, the Jacobian J_f(x) plays the role of this matrix only near the current point. Backprop works because local linear maps compose cleanly.

That widget is literally the mental model I use for Jacobians. A matrix is a story about basis vectors. A Jacobian is the same story, but only locally.

Finite differences are a good way to keep yourself honest2. If you ever derive a Jacobian on paper and want a quick check, approximate each column numerically:

jacobian_fd.py
import numpy as np

def jacobian_fd(f, x, eps=1e-6):
    x = np.asarray(x, dtype=float)
    y = np.asarray(f(x), dtype=float)
    J = np.zeros((y.size, x.size))
    for j in range(x.size):
        step = np.zeros_like(x)
        step[j] = eps
        J[:, j] = (f(x + step) - f(x - step)) / (2 * eps)
    return J

def f(x):
    x1, x2 = x
    return np.array([
        x1 * x2,
        np.sin(x1) + x2**2,
    ])

x = np.array([1.2, -0.7])
J_analytic = np.array([
    [x[1], x[0]],
    [np.cos(x[0]), 2 * x[1]],
])

np.testing.assert_allclose(
    jacobian_fd(f, x),
    J_analytic,
    rtol=1e-5,
    atol=1e-5,
)

Four Local Maps Worth Memorizing

Backprop is easier when you stop memorizing giant end-to-end formulas and memorize the local pieces instead.

1. Sum Reduction

If

s=i=1nxi=1x,s = \sum_{i=1}^{n} x_i = \mathbf{1}^\top x,

then the gradient with respect to xx is just the all-ones vector:

xs=1.\nabla_x s = \mathbf{1}.

Each coordinate contributes equally to the sum. Nothing subtle is happening.

2. Dot Product

If

s=wx,s = w^\top x,

then

xs=w,ws=x.\nabla_x s = w, \qquad \nabla_w s = x.

This is one of the most reused facts in neural networks. A neuron pre-activation is just a dot product plus a bias term.

3. Affine Map

If

z=Wx+b,z = Wx + b,

then the Jacobian of zz with respect to xx is simply

Jz(x)=W.J_z(x) = W.

That is unsurprising: affine maps are already linear, up to a constant shift. The bias disappears in the derivative because constants do not change under infinitesimal perturbations.

4. Element-Wise Nonlinearity

If

a=ϕ(z),a = \phi(z),

where ϕ\phi acts coordinate-wise, then the Jacobian is diagonal:

Jϕ(z)=[ϕ(z1)000ϕ(z2)000ϕ(zn)].J_\phi(z) = \begin{bmatrix} \phi'(z_1) & 0 & \cdots & 0 \\ 0 & \phi'(z_2) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \phi'(z_n) \end{bmatrix}.

Why diagonal? Because each output coordinate depends only on the matching input coordinate. There is no cross-talk.

Put together, those four pieces cover a surprising amount of deep learning practice.

PrimitiveRuleMemory hook
s = Σᵢ xᵢ∇x s = 1The sum listens equally to every coordinate.
s = wᵀx∇x s = w, ∇w s = xDifferentiate with respect to one side, keep the other.
z = Wx + bJz(x) = WAffine maps already are linear maps.
a = φ(z)Jφ(z) = diag(φ′(z))Element-wise operations only touch the diagonal.

The Chain Rules That Power Backprop

Here is the version of the chain rule that shows up most often in neural nets.

Suppose

y=f(x),L=(y),y = f(x), \qquad L = \ell(y),

where LL is a scalar loss. Then

xL=Jf(x)yL.\nabla_x L = J_f(x)^\top \nabla_y L.

This is the backprop pattern in one line. A gradient flowing backward through a vector-valued function gets multiplied by the transpose of the local Jacobian.

Why the transpose? Because the Jacobian maps forward perturbations Δx\Delta x into forward perturbations Δy\Delta y. Gradients live in the dual direction: they tell you how a scalar changes with respect to those perturbations. Backward flow is the transpose action.

For two vector-valued functions,

z=g(x),y=f(z),z = g(x), \qquad y = f(z),

the forward Jacobian chain rule is

Jfg(x)=Jf(g(x))Jg(x).J_{f \circ g}(x) = J_f(g(x))\,J_g(x).

So the whole picture is:

  • forward pass: multiply Jacobians in forward order
  • backward pass: multiply transposed Jacobians in reverse order

If the function in the middle is element-wise, its Jacobian is diagonal, so the backward step becomes an element-wise multiplication:

zL=aLϕ(z),\nabla_z L = \nabla_a L \odot \phi'(z),

where \odot is the Hadamard product.

That is why backprop code is full of expressions like grad *= relu_prime(z) or dz = da * sigmoid_prime(z).

The easiest way to make that pattern concrete is to run one tiny network forward and then walk the scalar loss backward through each local map:

Backprop Flow Explorer

This is the scalar-after-vector chain rule as a concrete machine. A two-layer network runs forward, then one scalar loss walks backward through transposed local maps. Step through the reverse pass and keep the shapes straight.

What to notice

Both hidden units stay active, so every local map participates in the backward chain.

ŷ

0.934

target

0.350

loss

0.171

|∇xL|

0.141

x2×1

[0.900, -1.100]

forward

z = W₁x + b₁ (2×2)

backward

z2×1

[1.480, -0.690]

forward

a = σ(z) (2×2)

backward

a2×1

[0.815, 0.334]

forward

ŷ = w₂ᵀa + b₂ (1×2)

backward

ŷscalar

0.934

forward

L = 1/2 (ŷ - y)²

backward

← (ŷ - y)

Lscalar

0.171

Selected backward step

A scalar loss starts the backward pass with one scalar slope. Everything else is just that number being remapped.

prediction ŷ0.934
scalar slope0.584
∂L/∂ŷ0.584

local map shape

1×1

outgoing shape

1×1

Collapsed gradient

∇xL = W₁ᵀ diag(σ'(z)) w₂ (ŷ - y)

Final input gradient: [0.071, -0.121]

The stronger pull lands on x₂, because that coordinate receives the larger mixed signal after the transpose of W₁.

Hidden sensitivities: σ'(z) = [0.151, 0.222]

One Neuron, End to End

Take the smallest nontrivial neural building block:

z=wx+b,a=σ(z),z = w^\top x + b, \qquad a = \sigma(z),

with sigmoid

σ(z)=11+ez.\sigma(z) = \frac{1}{1 + e^{-z}}.

Assume a binary label y{0,1}y \in \{0,1\} and binary cross-entropy loss

L(a,y)=yloga(1y)log(1a).L(a, y) = -y \log a - (1-y)\log(1-a).

The derivative with respect to the activation is

La=ya+1y1a.\frac{\partial L}{\partial a} = -\frac{y}{a} + \frac{1-y}{1-a}.

The sigmoid derivative is

az=a(1a).\frac{\partial a}{\partial z} = a(1-a).

Multiply them:

Lz=(ya+1y1a)a(1a)=ay.\frac{\partial L}{\partial z} = \left( -\frac{y}{a} + \frac{1-y}{1-a} \right)a(1-a) = a - y.

That simplification is worth remembering. It is the binary cousin of the softmax-cross-entropy shortcut.

Once you have Lz\frac{\partial L}{\partial z}, the rest is immediate:

wL=Lzx=(ay)x,Lb=ay,xL=(ay)w.\nabla_w L = \frac{\partial L}{\partial z}\,x = (a-y)x, \qquad \frac{\partial L}{\partial b} = a-y, \qquad \nabla_x L = (a-y)w.

This is both algebraically useful and probabilistically intuitive.

  • If y=1y = 1 but the neuron predicts a=0.1a = 0.1, then ay=0.9a-y = -0.9. Gradient descent subtracts a negative quantity, pushing zz upward and making the positive class more likely.
  • If y=0y = 0 but the neuron predicts a=0.9a = 0.9, then ay=0.9a-y = 0.9. Gradient descent pushes zz downward.

The sign is not arbitrary. It is the correction needed to move the model’s probability toward the observed outcome.

single_neuron.py
import numpy as np

x = np.array([1.5, -2.0, 0.7])
w = np.array([0.4, -0.3, 0.2])
b = -0.1
y = 1.0

z = w @ x + b
a = 1.0 / (1.0 + np.exp(-z))
loss = -(y * np.log(a) + (1.0 - y) * np.log(1.0 - a))

dz = a - y
dw = dz * x
db = dz

def loss_from_w(w_):
    z_ = w_ @ x + b
    a_ = 1.0 / (1.0 + np.exp(-z_))
    return -(y * np.log(a_) + (1.0 - y) * np.log(1.0 - a_))

eps = 1e-6
dw_fd = np.zeros_like(w)
for i in range(w.size):
    step = np.zeros_like(w)
    step[i] = eps
    dw_fd[i] = (loss_from_w(w + step) - loss_from_w(w - step)) / (2 * eps)

np.testing.assert_allclose(dw, dw_fd, rtol=1e-5, atol=1e-5)

Softmax + Cross-Entropy: The Famous Collapse

For multiclass classification, the output layer is usually

p=softmax(z),pi=ezijezj,p = \operatorname{softmax}(z), \qquad p_i = \frac{e^{z_i}}{\sum_j e^{z_j}},

followed by cross-entropy with one-hot target vector yy:

L(p,y)=iyilogpi.L(p, y) = -\sum_i y_i \log p_i.

If the true class is tt, this becomes

L=logpt.L = -\log p_t.

That sentence already tells you the probabilistic meaning. Cross-entropy is the surprisal of the probability assigned to the event that actually happened. Low probability on the true class means high loss.

The full softmax Jacobian is

Jsoftmax(z)=diag(p)pp.J_{\mathrm{softmax}}(z) = \operatorname{diag}(p) - pp^\top.

Component-wise, that means

pizj={pi(1pi)i=j,pipjij.\frac{\partial p_i}{\partial z_j} = \begin{cases} p_i(1-p_i) & i=j, \\ -p_i p_j & i \neq j. \end{cases}

The diagonal terms are positive because raising a class’s own logit raises its probability. The off-diagonal terms are negative because probability mass is conserved: if one class goes up, others must come down.

The miracle is what happens after you compose that Jacobian with cross-entropy. The gradient with respect to logits collapses to

zL=py.\nabla_z L = p - y.

No giant expression. No explicit Jacobian matrix in your implementation. Just predicted probabilities minus target probabilities.

Softmax + Cross-Entropy Explorer

Softmax converts raw scores into probabilities. Cross-entropy then asks a blunt question: how much probability did the model assign to the true class? The famous gradient p - y is the whole story in one line.

What to notice

The true class is underweighted, so the loss sends a sharp corrective signal.

loss

2.968

predicted

otter

p(target)

0.051

margin

0.883

class

otter

logit z3.40
probability p0.935
gradient p - y0.935

class

lynx

logit z0.50
probability p0.051
gradient p - y-0.949

class

falcon

logit z-0.80
probability p0.014
gradient p - y0.014

Read the sign

Negative gradient means gradient descent will raise that logit, because the update is z ← z - η∇zL. Positive gradient means the optimizer pushes that logit down.

The shortcut

z[3.40, 0.50, -0.80]
p = softmax(z)[0.935, 0.051, 0.014]
y[0, 1, 0]
∇z L = p - y[0.935, -0.949, 0.014]

Softmax Jacobian

Before cross-entropy collapses the expression, the local linear map is diag(p) - ppᵀ. Diagonal terms are positive; off-diagonal terms are negative, because boosting one class steals probability mass from the others.

positive self-sensitivitynegative cross-couplingentropy = 0.276
otterlynxfalconΣ row
otter0.061-0.048-0.0130.000
lynx-0.0480.049-0.0010.000
falcon-0.013-0.0010.0140.000

Each row sums to zero because softmax only redistributes probability mass. The heatmap makes that conservation law visible: green lives on the diagonal, red fills the trade-offs.

This formula is easier to remember if you read it probabilistically:

  • pp is what the model currently believes.
  • yy is what reality said.
  • pyp-y is the discrepancy.

Gradient descent acts on that discrepancy.

If the model gave too much mass to the wrong classes, those coordinates are positive and get pushed down. If it gave too little mass to the true class, that coordinate is negative and gets pushed up.

One clean derivation starts from

pL=yp,\nabla_p L = -\frac{y}{p},

where the division is element-wise, and then applies the scalar-after-vector chain rule:

zL=Jsoftmax(z)pL.\nabla_z L = J_{\mathrm{softmax}}(z)^\top \nabla_p L.

Carrying out the algebra gives the compact result pyp-y. In practice, most libraries fuse softmax and cross-entropy into one numerically stable operation because that compact form is both cheaper and more stable.

softmax_ce.py
import numpy as np

def softmax(z):
    shifted = z - np.max(z)
    exp_z = np.exp(shifted)
    return exp_z / exp_z.sum()

z = np.array([2.0, -0.5, 0.2])
y = np.array([1.0, 0.0, 0.0])

def cross_entropy_from_logits(z_):
    p_ = softmax(z_)
    return -np.sum(y * np.log(p_))

p = softmax(z)
grad = p - y

eps = 1e-6
grad_fd = np.zeros_like(z)
for i in range(z.size):
    step = np.zeros_like(z)
    step[i] = eps
    grad_fd[i] = (
        cross_entropy_from_logits(z + step)
        - cross_entropy_from_logits(z - step)
    ) / (2 * eps)

np.testing.assert_allclose(grad, grad_fd, rtol=1e-5, atol=1e-5)

A Cheat Sheet I Actually Remember

Local maps

  • s = Σᵢxᵢ: gradient is all ones.
  • s = wᵀx: differentiate one side, keep the other.
  • z = Wx + b: Jacobian with respect to x is W.
  • a = φ(z): Jacobian is diagonal.

Backprop shortcuts

  • ∇xL = Jf(x)ᵀ ∇yL: scalar loss through a vector map.
  • ∇zL = ∇aL ⊙ φ′(z): element-wise nonlinearity backward pass.
  • ∂L/∂z = a - y: sigmoid + binary cross-entropy.
  • ∇zL = p - y: softmax + multiclass cross-entropy.

If I had to compress the whole page into one sentence, it would be this:

Deep learning derivatives look hard only until you see them as local linear maps plus probability corrections.

That is what a Jacobian is. That is what the chain rule composes. That is what the output-layer gradients are doing.

Further Reading

If you want the denser reference version, go straight to Parr and Howard’s original write-up3. If you want a practical habit that pays off immediately, keep finite-difference checks around in small NumPy scratch files. They are fast, they are humbling, and they save a lot of algebraic overconfidence.