Matrix Calculus for Deep Learning, Without the Fog
The formulas I keep re-learning, rewritten as a visual memory palace.
TL;DR
Treat a Jacobian as the local matrix of a nonlinear map. Memorize a small toolbox: sum, dot product, affine map, and element-wise nonlinearity. Then backprop is just local linear maps composed together, with two famous simplifications to remember: binary cross-entropy gives $a - y$, and softmax-cross-entropy gives $p - y$.
On this page
Every few months I need the same deep learning derivatives again. Not all of them. The same small cluster: affine maps, element-wise activations, Jacobians, chain rules, and the softmax-cross-entropy shortcut. I keep ending up back at Parr and Howard’s excellent article1. This post is my own rewrite of a large chunk of that material, organized the way I wish my notes looked when I am half-awake and trying to remember why a gradient has the shape it has.
The thread running through all of it is simple:
- Geometry: a Jacobian is the best local linear map near the current point.
- Algebra: backprop is repeated composition of those local maps.
- Probability: the famous output-layer gradients are just “predicted distribution minus what reality said.”
Geometry
Columns tell you where the basis vectors go. That is true for ordinary matrices and for Jacobians viewed locally.
Algebra
Once you keep shapes straight, the chain rule stops feeling magical. It becomes matrix multiplication with a few diagonals.
Probability
Cross-entropy gradients are not random symbols. They are the correction that moves probability mass toward the event that actually happened.
The One Convention That Matters
Convention
I will use column vectors. For a scalar loss , the gradient has the same shape as . For a vector-valued function , the Jacobian is the matrix
If your favorite notes use row-vector gradients, transpose everything consistently and the substance does not change.
With that convention, there are only a few cases to remember:
| mapping | object you get | shape |
|---|---|---|
| ordinary derivative | scalar | |
| gradient | ||
| Jacobian |
Most confusion in matrix calculus is not real math confusion. It is convention confusion. Once you settle the orientation, the rest is bookkeeping.
A Jacobian Is a Local Matrix
The Jacobian is not a bag of partial derivatives. It is the matrix that best describes how a nonlinear function behaves near the current point:
That is the entire geometric story. The Jacobian is the local linear approximation. If you zoom in far enough, a smooth nonlinear function looks linear, and the Jacobian is the matrix of that local linear map.
The columns of are especially useful to keep in mind. The -th column tells you what happens to the -th basis direction . If you know where the basis goes, you know where any vector goes, because any vector is a weighted sum of basis vectors.
The cleanest way to read a Jacobian is as a local linear map. Its columns tell you where the basis vectors go. Once you know that, the output of the map is just a weighted mix of those columns.
What to notice
Columns tell the whole story: one column tilts right, the other lifts upward.
1.88
Orientation is preserved.
Input Space
Your vector x and the standard basis.
Output Space
auto-fitThe same map after the grid and basis have been transformed.
Read the columns
Wx = 1.75 · W e₁ +1.10 · W e₂ = [1.16, 2.72]
Controls
Why this matters for deep learning
For a nonlinear function, the Jacobian J_f(x) plays the role of this matrix only near the current point. Backprop works because local linear maps compose cleanly.
That widget is literally the mental model I use for Jacobians. A matrix is a story about basis vectors. A Jacobian is the same story, but only locally.
Finite differences are a good way to keep yourself honest2. If you ever derive a Jacobian on paper and want a quick check, approximate each column numerically:
import numpy as np
def jacobian_fd(f, x, eps=1e-6):
x = np.asarray(x, dtype=float)
y = np.asarray(f(x), dtype=float)
J = np.zeros((y.size, x.size))
for j in range(x.size):
step = np.zeros_like(x)
step[j] = eps
J[:, j] = (f(x + step) - f(x - step)) / (2 * eps)
return J
def f(x):
x1, x2 = x
return np.array([
x1 * x2,
np.sin(x1) + x2**2,
])
x = np.array([1.2, -0.7])
J_analytic = np.array([
[x[1], x[0]],
[np.cos(x[0]), 2 * x[1]],
])
np.testing.assert_allclose(
jacobian_fd(f, x),
J_analytic,
rtol=1e-5,
atol=1e-5,
) Four Local Maps Worth Memorizing
Backprop is easier when you stop memorizing giant end-to-end formulas and memorize the local pieces instead.
1. Sum Reduction
If
then the gradient with respect to is just the all-ones vector:
Each coordinate contributes equally to the sum. Nothing subtle is happening.
2. Dot Product
If
then
This is one of the most reused facts in neural networks. A neuron pre-activation is just a dot product plus a bias term.
3. Affine Map
If
then the Jacobian of with respect to is simply
That is unsurprising: affine maps are already linear, up to a constant shift. The bias disappears in the derivative because constants do not change under infinitesimal perturbations.
4. Element-Wise Nonlinearity
If
where acts coordinate-wise, then the Jacobian is diagonal:
Why diagonal? Because each output coordinate depends only on the matching input coordinate. There is no cross-talk.
Put together, those four pieces cover a surprising amount of deep learning practice.
| Primitive | Rule | Memory hook |
|---|---|---|
| s = Σᵢ xᵢ | ∇x s = 1 | The sum listens equally to every coordinate. |
| s = wᵀx | ∇x s = w, ∇w s = x | Differentiate with respect to one side, keep the other. |
| z = Wx + b | Jz(x) = W | Affine maps already are linear maps. |
| a = φ(z) | Jφ(z) = diag(φ′(z)) | Element-wise operations only touch the diagonal. |
The Chain Rules That Power Backprop
Here is the version of the chain rule that shows up most often in neural nets.
Suppose
where is a scalar loss. Then
This is the backprop pattern in one line. A gradient flowing backward through a vector-valued function gets multiplied by the transpose of the local Jacobian.
Why the transpose? Because the Jacobian maps forward perturbations into forward perturbations . Gradients live in the dual direction: they tell you how a scalar changes with respect to those perturbations. Backward flow is the transpose action.
For two vector-valued functions,
the forward Jacobian chain rule is
So the whole picture is:
- forward pass: multiply Jacobians in forward order
- backward pass: multiply transposed Jacobians in reverse order
If the function in the middle is element-wise, its Jacobian is diagonal, so the backward step becomes an element-wise multiplication:
where is the Hadamard product.
That is why backprop code is full of expressions like grad *= relu_prime(z) or dz = da * sigmoid_prime(z).
The easiest way to make that pattern concrete is to run one tiny network forward and then walk the scalar loss backward through each local map:
This is the scalar-after-vector chain rule as a concrete machine. A two-layer network runs forward, then one scalar loss walks backward through transposed local maps. Step through the reverse pass and keep the shapes straight.
What to notice
Both hidden units stay active, so every local map participates in the backward chain.
ŷ
0.934
target
0.350
loss
0.171
|∇xL|
0.141
[0.900, -1.100]
forward
z = W₁x + b₁ (2×2)
backward
←
[1.480, -0.690]
forward
a = σ(z) (2×2)
backward
←
[0.815, 0.334]
forward
ŷ = w₂ᵀa + b₂ (1×2)
backward
←
0.934
forward
L = 1/2 (ŷ - y)²
backward
← (ŷ - y)
0.171
Selected backward step
A scalar loss starts the backward pass with one scalar slope. Everything else is just that number being remapped.
local map shape
1×1
outgoing shape
1×1
Collapsed gradient
∇xL = W₁ᵀ diag(σ'(z)) w₂ (ŷ - y)
Final input gradient: [0.071, -0.121]
The stronger pull lands on x₂, because that coordinate receives the larger mixed signal after the transpose of W₁.
Hidden sensitivities: σ'(z) = [0.151, 0.222]
One Neuron, End to End
Take the smallest nontrivial neural building block:
with sigmoid
Assume a binary label and binary cross-entropy loss
The derivative with respect to the activation is
The sigmoid derivative is
Multiply them:
That simplification is worth remembering. It is the binary cousin of the softmax-cross-entropy shortcut.
Once you have , the rest is immediate:
This is both algebraically useful and probabilistically intuitive.
- If but the neuron predicts , then . Gradient descent subtracts a negative quantity, pushing upward and making the positive class more likely.
- If but the neuron predicts , then . Gradient descent pushes downward.
The sign is not arbitrary. It is the correction needed to move the model’s probability toward the observed outcome.
import numpy as np
x = np.array([1.5, -2.0, 0.7])
w = np.array([0.4, -0.3, 0.2])
b = -0.1
y = 1.0
z = w @ x + b
a = 1.0 / (1.0 + np.exp(-z))
loss = -(y * np.log(a) + (1.0 - y) * np.log(1.0 - a))
dz = a - y
dw = dz * x
db = dz
def loss_from_w(w_):
z_ = w_ @ x + b
a_ = 1.0 / (1.0 + np.exp(-z_))
return -(y * np.log(a_) + (1.0 - y) * np.log(1.0 - a_))
eps = 1e-6
dw_fd = np.zeros_like(w)
for i in range(w.size):
step = np.zeros_like(w)
step[i] = eps
dw_fd[i] = (loss_from_w(w + step) - loss_from_w(w - step)) / (2 * eps)
np.testing.assert_allclose(dw, dw_fd, rtol=1e-5, atol=1e-5) Softmax + Cross-Entropy: The Famous Collapse
For multiclass classification, the output layer is usually
followed by cross-entropy with one-hot target vector :
If the true class is , this becomes
That sentence already tells you the probabilistic meaning. Cross-entropy is the surprisal of the probability assigned to the event that actually happened. Low probability on the true class means high loss.
The full softmax Jacobian is
Component-wise, that means
The diagonal terms are positive because raising a class’s own logit raises its probability. The off-diagonal terms are negative because probability mass is conserved: if one class goes up, others must come down.
The miracle is what happens after you compose that Jacobian with cross-entropy. The gradient with respect to logits collapses to
No giant expression. No explicit Jacobian matrix in your implementation. Just predicted probabilities minus target probabilities.
Softmax converts raw scores into probabilities. Cross-entropy then asks a blunt question: how much probability did the model assign to the true class? The famous gradient p - y is the whole story in one line.
What to notice
The true class is underweighted, so the loss sends a sharp corrective signal.
loss
2.968
predicted
otter
p(target)
0.051
margin
0.883
class
otter
class
lynx
class
falcon
Read the sign
Negative gradient means gradient descent will raise that logit, because the update is z ← z - η∇zL. Positive gradient means the optimizer pushes that logit down.
The shortcut
Softmax Jacobian
Before cross-entropy collapses the expression, the local linear map is diag(p) - ppᵀ. Diagonal terms are positive; off-diagonal terms are negative, because boosting one class steals probability mass from the others.
| otter | lynx | falcon | Σ row | |
|---|---|---|---|---|
| otter | 0.061 | -0.048 | -0.013 | 0.000 |
| lynx | -0.048 | 0.049 | -0.001 | 0.000 |
| falcon | -0.013 | -0.001 | 0.014 | 0.000 |
Each row sums to zero because softmax only redistributes probability mass. The heatmap makes that conservation law visible: green lives on the diagonal, red fills the trade-offs.
This formula is easier to remember if you read it probabilistically:
- is what the model currently believes.
- is what reality said.
- is the discrepancy.
Gradient descent acts on that discrepancy.
If the model gave too much mass to the wrong classes, those coordinates are positive and get pushed down. If it gave too little mass to the true class, that coordinate is negative and gets pushed up.
One clean derivation starts from
where the division is element-wise, and then applies the scalar-after-vector chain rule:
Carrying out the algebra gives the compact result . In practice, most libraries fuse softmax and cross-entropy into one numerically stable operation because that compact form is both cheaper and more stable.
import numpy as np
def softmax(z):
shifted = z - np.max(z)
exp_z = np.exp(shifted)
return exp_z / exp_z.sum()
z = np.array([2.0, -0.5, 0.2])
y = np.array([1.0, 0.0, 0.0])
def cross_entropy_from_logits(z_):
p_ = softmax(z_)
return -np.sum(y * np.log(p_))
p = softmax(z)
grad = p - y
eps = 1e-6
grad_fd = np.zeros_like(z)
for i in range(z.size):
step = np.zeros_like(z)
step[i] = eps
grad_fd[i] = (
cross_entropy_from_logits(z + step)
- cross_entropy_from_logits(z - step)
) / (2 * eps)
np.testing.assert_allclose(grad, grad_fd, rtol=1e-5, atol=1e-5) A Cheat Sheet I Actually Remember
Local maps
- s = Σᵢxᵢ: gradient is all ones.
- s = wᵀx: differentiate one side, keep the other.
- z = Wx + b: Jacobian with respect to x is W.
- a = φ(z): Jacobian is diagonal.
Backprop shortcuts
- ∇xL = Jf(x)ᵀ ∇yL: scalar loss through a vector map.
- ∇zL = ∇aL ⊙ φ′(z): element-wise nonlinearity backward pass.
- ∂L/∂z = a - y: sigmoid + binary cross-entropy.
- ∇zL = p - y: softmax + multiclass cross-entropy.
If I had to compress the whole page into one sentence, it would be this:
Deep learning derivatives look hard only until you see them as local linear maps plus probability corrections.
That is what a Jacobian is. That is what the chain rule composes. That is what the output-layer gradients are doing.
Further Reading
If you want the denser reference version, go straight to Parr and Howard’s original write-up3. If you want a practical habit that pays off immediately, keep finite-difference checks around in small NumPy scratch files. They are fast, they are humbling, and they save a lot of algebraic overconfidence.