Matrix Calculus for Deep Learning, Without the Fog

Every few months I need the same deep learning derivatives again. Not all of them. The same small cluster: affine maps, element-wise activations, Jacobians, chain rules, and the softmax-cross-entropy shortcut. I keep ending up back at Parr and Howard’s excellent article¹. This post is my own rewrite of a large chunk of that material, organized the way I wish my notes looked when I am half-awake and trying to remember why a gradient has the shape it has.

The thread running through all of it is simple:

Geometry: a Jacobian is the best local linear map near the current point.
Algebra: backprop is repeated composition of those local maps.
Probability: the famous output-layer gradients are just “predicted distribution minus what reality said.”

Geometry

Columns tell you where the basis vectors go. That is true for ordinary matrices and for Jacobians viewed locally.

Algebra

Once you keep shapes straight, the chain rule stops feeling magical. It becomes matrix multiplication with a few diagonals.

Probability

Cross-entropy gradients are not random symbols. They are the correction that moves probability mass toward the event that actually happened.

The One Convention That Matters

Convention

I will use column vectors. For a scalar loss $L(x)$ , the gradient $\nabla_x L$ has the same shape as $x$ . For a vector-valued function $f : \mathbb{R}^n \to \mathbb{R}^m$ , the Jacobian is the $m \times n$ matrix

J_f(x)_{ij} = \frac{\partial f_i}{\partial x_j}.

If your favorite notes use row-vector gradients, transpose everything consistently and the substance does not change.

With that convention, there are only a few cases to remember:

mapping	object you get	shape
$f : \mathbb{R} \to \mathbb{R}$	ordinary derivative $f'(x)$	scalar
$f : \mathbb{R}^n \to \mathbb{R}$	gradient $\nabla_x f$	$n \times 1$
$f : \mathbb{R}^n \to \mathbb{R}^m$	Jacobian $J_f(x)$	$m \times n$

Most confusion in matrix calculus is not real math confusion. It is convention confusion. Once you settle the orientation, the rest is bookkeeping.

A Jacobian Is a Local Matrix

The Jacobian is not a bag of partial derivatives. It is the matrix that best describes how a nonlinear function behaves near the current point:

f(x + \Delta x) \approx f(x) + J_f(x)\,\Delta x.

That is the entire geometric story. The Jacobian is the local linear approximation. If you zoom in far enough, a smooth nonlinear function looks linear, and the Jacobian is the matrix of that local linear map.

The columns of $J_f(x)$ are especially useful to keep in mind. The $j$ -th column tells you what happens to the $j$ -th basis direction $e_j$ . If you know where the basis goes, you know where any vector goes, because any vector is a weighted sum of basis vectors.

Linear Map Explorer

The cleanest way to read a Jacobian is as a local linear map. Its columns tell you where the basis vectors go. Once you know that, the output of the map is just a weighted mix of those columns.

What to notice

Columns tell the whole story: one column tilts right, the other lifts upward.

1.88

Orientation is preserved.

Input Space

Your vector x and the standard basis.

Output Space

auto-fit

The same map after the grid and basis have been transformed.

Read the columns

W e₁[1.10, 0.80]

W e₂[-0.70, 1.20]

x = 1.75 e₁ +1.10 e₂

Wx = 1.75 · W e₁ +1.10 · W e₂ = [1.16, 2.72]

Controls

W₁₁1.10

W₁₂-0.70

x₁1.75

W₂₁0.80

W₂₂1.20

x₂1.10

Why this matters for deep learning

For a nonlinear function, the Jacobian J_f(x) plays the role of this matrix only near the current point. Backprop works because local linear maps compose cleanly.

That widget is literally the mental model I use for Jacobians. A matrix is a story about basis vectors. A Jacobian is the same story, but only locally.

Finite differences are a good way to keep yourself honest². If you ever derive a Jacobian on paper and want a quick check, approximate each column numerically:

jacobian_fd.py

import numpy as np

def jacobian_fd(f, x, eps=1e-6):
    x = np.asarray(x, dtype=float)
    y = np.asarray(f(x), dtype=float)
    J = np.zeros((y.size, x.size))
    for j in range(x.size):
        step = np.zeros_like(x)
        step[j] = eps
        J[:, j] = (f(x + step) - f(x - step)) / (2 * eps)
    return J

def f(x):
    x1, x2 = x
    return np.array([
        x1 * x2,
        np.sin(x1) + x2**2,
    ])

x = np.array([1.2, -0.7])
J_analytic = np.array([
    [x[1], x[0]],
    [np.cos(x[0]), 2 * x[1]],
])

np.testing.assert_allclose(
    jacobian_fd(f, x),
    J_analytic,
    rtol=1e-5,
    atol=1e-5,
)

Four Local Maps Worth Memorizing

Backprop is easier when you stop memorizing giant end-to-end formulas and memorize the local pieces instead.

1. Sum Reduction

s = \sum_{i=1}^{n} x_i = \mathbf{1}^\top x,

then the gradient with respect to $x$ is just the all-ones vector:

\nabla_x s = \mathbf{1}.

Each coordinate contributes equally to the sum. Nothing subtle is happening.

2. Dot Product

s = w^\top x,

then

\nabla_x s = w, \qquad \nabla_w s = x.

This is one of the most reused facts in neural networks. A neuron pre-activation is just a dot product plus a bias term.

3. Affine Map

z = Wx + b,

then the Jacobian of $z$ with respect to $x$ is simply

J_z(x) = W.

That is unsurprising: affine maps are already linear, up to a constant shift. The bias disappears in the derivative because constants do not change under infinitesimal perturbations.

4. Element-Wise Nonlinearity

a = \phi(z),

where $\phi$ acts coordinate-wise, then the Jacobian is diagonal:

J_\phi(z) = \begin{bmatrix} \phi'(z_1) & 0 & \cdots & 0 \\ 0 & \phi'(z_2) & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \phi'(z_n) \end{bmatrix}.

Why diagonal? Because each output coordinate depends only on the matching input coordinate. There is no cross-talk.

Put together, those four pieces cover a surprising amount of deep learning practice.

Primitive	Rule	Memory hook
s = Σᵢ xᵢ	∇x s = 1	The sum listens equally to every coordinate.
s = wᵀx	∇x s = w, ∇w s = x	Differentiate with respect to one side, keep the other.
z = Wx + b	Jz(x) = W	Affine maps already are linear maps.
a = φ(z)	Jφ(z) = diag(φ′(z))	Element-wise operations only touch the diagonal.

The Chain Rules That Power Backprop

Here is the version of the chain rule that shows up most often in neural nets.

Suppose

y = f(x), \qquad L = \ell(y),

where $L$ is a scalar loss. Then

\nabla_x L = J_f(x)^\top \nabla_y L.

This is the backprop pattern in one line. A gradient flowing backward through a vector-valued function gets multiplied by the transpose of the local Jacobian.

Why the transpose? Because the Jacobian maps forward perturbations $\Delta x$ into forward perturbations $\Delta y$ . Gradients live in the dual direction: they tell you how a scalar changes with respect to those perturbations. Backward flow is the transpose action.

For two vector-valued functions,

z = g(x), \qquad y = f(z),

the forward Jacobian chain rule is

J_{f \circ g}(x) = J_f(g(x))\,J_g(x).

So the whole picture is:

forward pass: multiply Jacobians in forward order
backward pass: multiply transposed Jacobians in reverse order

If the function in the middle is element-wise, its Jacobian is diagonal, so the backward step becomes an element-wise multiplication:

\nabla_z L = \nabla_a L \odot \phi'(z),

where $\odot$ is the Hadamard product.

That is why backprop code is full of expressions like grad *= relu_prime(z) or dz = da * sigmoid_prime(z).

The easiest way to make that pattern concrete is to run one tiny network forward and then walk the scalar loss backward through each local map:

Backprop Flow Explorer

This is the scalar-after-vector chain rule as a concrete machine. A two-layer network runs forward, then one scalar loss walks backward through transposed local maps. Step through the reverse pass and keep the shapes straight.

What to notice

Both hidden units stay active, so every local map participates in the backward chain.

0.934

target

0.350

loss

0.171

|∇xL|

0.141

x2×1

[0.900, -1.100]

forward

z = W₁x + b₁ (2×2)

backward

←

z2×1

[1.480, -0.690]

forward

a = σ(z) (2×2)

backward

←

a2×1

[0.815, 0.334]

forward

ŷ = w₂ᵀa + b₂ (1×2)

backward

←

ŷscalar

0.934

forward

L = 1/2 (ŷ - y)²

backward

← (ŷ - y)

Lscalar

0.171

Selected backward step

A scalar loss starts the backward pass with one scalar slope. Everything else is just that number being remapped.

prediction ŷ0.934

scalar slope0.584

∂L/∂ŷ0.584

local map shape

1×1

outgoing shape

1×1

Collapsed gradient

∇xL = W₁ᵀ diag(σ'(z)) w₂ (ŷ - y)

Final input gradient: [0.071, -0.121]

The stronger pull lands on x₂, because that coordinate receives the larger mixed signal after the transpose of W₁.

Hidden sensitivities: σ'(z) = [0.151, 0.222]

One Neuron, End to End

Take the smallest nontrivial neural building block:

z = w^\top x + b, \qquad a = \sigma(z),

with sigmoid

\sigma(z) = \frac{1}{1 + e^{-z}}.

Assume a binary label $y \in \{0,1\}$ and binary cross-entropy loss

L(a, y) = -y \log a - (1-y)\log(1-a).

The derivative with respect to the activation is

\frac{\partial L}{\partial a} = -\frac{y}{a} + \frac{1-y}{1-a}.

The sigmoid derivative is

\frac{\partial a}{\partial z} = a(1-a).

Multiply them:

\frac{\partial L}{\partial z} = \left( -\frac{y}{a} + \frac{1-y}{1-a} \right)a(1-a) = a - y.

That simplification is worth remembering. It is the binary cousin of the softmax-cross-entropy shortcut.

Once you have $\frac{\partial L}{\partial z}$ , the rest is immediate:

\displaystyle \nabla_w L = \frac{\partial L}{\partial z}\,x = (a-y)x,

\displaystyle \frac{\partial L}{\partial b} = a-y,

\nabla_x L = (a-y)w.

This is both algebraically useful and probabilistically intuitive.

If $y = 1$ but the neuron predicts $a = 0.1$ , then $a-y = -0.9$ . Gradient descent subtracts a negative quantity, pushing $z$ upward and making the positive class more likely.
If $y = 0$ but the neuron predicts $a = 0.9$ , then $a-y = 0.9$ . Gradient descent pushes $z$ downward.

The sign is not arbitrary. It is the correction needed to move the model’s probability toward the observed outcome.

single_neuron.py

import numpy as np

x = np.array([1.5, -2.0, 0.7])
w = np.array([0.4, -0.3, 0.2])
b = -0.1
y = 1.0

z = w @ x + b
a = 1.0 / (1.0 + np.exp(-z))
loss = -(y * np.log(a) + (1.0 - y) * np.log(1.0 - a))

dz = a - y
dw = dz * x
db = dz

def loss_from_w(w_):
    z_ = w_ @ x + b
    a_ = 1.0 / (1.0 + np.exp(-z_))
    return -(y * np.log(a_) + (1.0 - y) * np.log(1.0 - a_))

eps = 1e-6
dw_fd = np.zeros_like(w)
for i in range(w.size):
    step = np.zeros_like(w)
    step[i] = eps
    dw_fd[i] = (loss_from_w(w + step) - loss_from_w(w - step)) / (2 * eps)

np.testing.assert_allclose(dw, dw_fd, rtol=1e-5, atol=1e-5)

Softmax + Cross-Entropy: The Famous Collapse

For multiclass classification, the output layer is usually

p = \operatorname{softmax}(z), \qquad p_i = \frac{e^{z_i}}{\sum_j e^{z_j}},

followed by cross-entropy with one-hot target vector $y$ :

L(p, y) = -\sum_i y_i \log p_i.

If the true class is $t$ , this becomes

L = -\log p_t.

That sentence already tells you the probabilistic meaning. Cross-entropy is the surprisal of the probability assigned to the event that actually happened. Low probability on the true class means high loss.

The full softmax Jacobian is

J_{\mathrm{softmax}}(z) = \operatorname{diag}(p) - pp^\top.

Component-wise, that means

\frac{\partial p_i}{\partial z_j} = \begin{cases} p_i(1-p_i) & i=j, \\ -p_i p_j & i \neq j. \end{cases}

The diagonal terms are positive because raising a class’s own logit raises its probability. The off-diagonal terms are negative because probability mass is conserved: if one class goes up, others must come down.

The miracle is what happens after you compose that Jacobian with cross-entropy. The gradient with respect to logits collapses to

\nabla_z L = p - y.

No giant expression. No explicit Jacobian matrix in your implementation. Just predicted probabilities minus target probabilities.

Softmax + Cross-Entropy Explorer

Softmax converts raw scores into probabilities. Cross-entropy then asks a blunt question: how much probability did the model assign to the true class? The famous gradient p - y is the whole story in one line.

What to notice

The true class is underweighted, so the loss sends a sharp corrective signal.

loss

2.968

predicted

otter

p(target)

0.051

margin

0.883

class

otter

logit z3.40

probability p0.935

gradient p - y0.935

class

lynx

logit z0.50

probability p0.051

gradient p - y-0.949

class

falcon

logit z-0.80

probability p0.014

gradient p - y0.014

Read the sign

Negative gradient means gradient descent will raise that logit, because the update is z ← z - η∇_zL. Positive gradient means the optimizer pushes that logit down.

The shortcut

z[3.40, 0.50, -0.80]

p = softmax(z)[0.935, 0.051, 0.014]

y[0, 1, 0]

∇z L = p - y[0.935, -0.949, 0.014]

Softmax Jacobian

Before cross-entropy collapses the expression, the local linear map is diag(p) - ppᵀ. Diagonal terms are positive; off-diagonal terms are negative, because boosting one class steals probability mass from the others.

positive self-sensitivitynegative cross-couplingentropy = 0.276

	otter	lynx	falcon
otter	0.061	-0.048	-0.013
lynx	-0.048	0.049	-0.001
falcon	-0.013	-0.001	0.014

Each row sums to zero because softmax only redistributes probability mass. The heatmap makes that conservation law visible: green lives on the diagonal, red fills the trade-offs.

This formula is easier to remember if you read it probabilistically:

$p$ is what the model currently believes.
$y$ is what reality said.
$p-y$ is the discrepancy.

Gradient descent acts on that discrepancy.

If the model gave too much mass to the wrong classes, those coordinates are positive and get pushed down. If it gave too little mass to the true class, that coordinate is negative and gets pushed up.

One clean derivation starts from

\nabla_p L = -\frac{y}{p},

where the division is element-wise, and then applies the scalar-after-vector chain rule:

\nabla_z L = J_{\mathrm{softmax}}(z)^\top \nabla_p L.

Carrying out the algebra gives the compact result $p-y$ . In practice, most libraries fuse softmax and cross-entropy into one numerically stable operation because that compact form is both cheaper and more stable.

softmax_ce.py

import numpy as np

def softmax(z):
    shifted = z - np.max(z)
    exp_z = np.exp(shifted)
    return exp_z / exp_z.sum()

z = np.array([2.0, -0.5, 0.2])
y = np.array([1.0, 0.0, 0.0])

def cross_entropy_from_logits(z_):
    p_ = softmax(z_)
    return -np.sum(y * np.log(p_))

p = softmax(z)
grad = p - y

eps = 1e-6
grad_fd = np.zeros_like(z)
for i in range(z.size):
    step = np.zeros_like(z)
    step[i] = eps
    grad_fd[i] = (
        cross_entropy_from_logits(z + step)
        - cross_entropy_from_logits(z - step)
    ) / (2 * eps)

np.testing.assert_allclose(grad, grad_fd, rtol=1e-5, atol=1e-5)

A Cheat Sheet I Actually Remember

Local maps

s = Σᵢxᵢ: gradient is all ones.
s = wᵀx: differentiate one side, keep the other.
z = Wx + b: Jacobian with respect to x is W.
a = φ(z): Jacobian is diagonal.

Backprop shortcuts

∇xL = Jf(x)ᵀ ∇yL: scalar loss through a vector map.
∇zL = ∇aL ⊙ φ′(z): element-wise nonlinearity backward pass.
∂L/∂z = a - y: sigmoid + binary cross-entropy.
∇zL = p - y: softmax + multiclass cross-entropy.

If I had to compress the whole page into one sentence, it would be this:

Deep learning derivatives look hard only until you see them as local linear maps plus probability corrections.

That is what a Jacobian is. That is what the chain rule composes. That is what the output-layer gradients are doing.

The One Convention That Matters

A Jacobian Is a Local Matrix

Four Local Maps Worth Memorizing

1. Sum Reduction

2. Dot Product

3. Affine Map

4. Element-Wise Nonlinearity

The Chain Rules That Power Backprop

One Neuron, End to End

Softmax + Cross-Entropy: The Famous Collapse

A Cheat Sheet I Actually Remember

Further Reading