This is just matrix multiplication.
It’s good to understand how to derive gradients for your neural network. It gets a little hairy when you have matrix matrix multiplication, such as $WX + b$. When I was reviewing Backpropagation in CS231n, they handwaved over this derivation of the loss function $L$ in respect to the weights matrix $W$:
\[\frac{\partial L}{\partial W} = \frac{\partial L}{\partial D} \frac{\partial D}{\partial W} \\ \frac{\partial L}{\partial W} = \frac{\partial L}{\partial D} X^{T}\]Let:
- $X = (m, n)$ input matrix with $m$ features and $n$ samples
- $W = (H, m)$ weight matrix with $H$ neurons
- $D = WX$
- $(H, n)$ matrix
- $L = f(D)$
- scalar value, $f$ is arbitrary loss function
Note that others may use $D = XW$ where $X$’s rows are samples and columns are feature dimensions. That’s ok, you can follow this math and switch the indices and find the result to be identical.
The canonical neuron is $Relu(D + b)$, but to make things simpler we’ll ignore the nonlinearity and bias and say the $L$ takes in $D$ instead of $Relu(D + b)$. We want to find the gradient of $L$ with respect to $W$ to do gradient descent.
We want to find $\frac{\partial L}{\partial W}$, so let’s start by looking at a specific weight $W_{dc}$. This way we can think more easily about the gradient of $L$ for a single weight and extrapolate for all weights $W$.
\[\frac{\partial L}{\partial W_{dc}}=\sum_{i,j} \frac{\partial L}{\partial D_{ij}}\frac{\partial D_{ij}}{\partial W_{dc}}\]Let’s look more closely at the partial of $D_{ij}$ with respect to $W_{dc}$. We know that $\frac{\partial D_{ij}}{\partial W_{dc}} = 0$ if $i\neq d$ because $D_{ij}$ is the dot product of row $i$ of $W$ and column $j$ of $X$. This means the summation can be simplified by only looking at cases where $\frac{\partial D_{ij}}{\partial W_{dc}} \neq 0$, which is when $i = d$.
\[\sum_{i,j} \frac{\partial L}{\partial D_{ij}}\frac{\partial D_{ij}}{\partial W_{dc}}=\sum_j \frac{\partial L}{\partial D_{dj}}\frac{\partial D_{dj}}{\partial W_{dc}}\]Finally, what is $\frac{\partial D_{dj}}{\partial W_{dc}}$?
\[D_{dj}=\sum_{k=1}^qW_{dk}X_{kj} \\ \frac{\partial D_{dj}}{\partial W_{dc}}=\frac{\partial}{\partial W_{dc}}\sum_{k=1}^qW_{dk}X_{kj} = \sum_{k=1}^q \frac{\partial}{\partial W_{dc}} W_{dk}X_{kj} \\ \therefore \frac{\partial D_{dj}}{\partial W_{dc}}=X_{cj}\]So to put it all together, we have:
\[\frac{\partial L}{\partial W_{dc}}=\sum_j \frac{\partial L}{\partial D_{dj}}X_{cj}\]Now how can we simplify this? Well, one quick way is see that the sum over $j$ is doing a dot product on with row $d$ and column $c$ if we transpose $X_{cj}$ to $X^T_{jc}$.
\[\frac{\partial L}{\partial W_{dc}}=\sum_j \frac{\partial L}{\partial D_{dj}}X^T_{jc}\]Now we want this for all weights in $W$, which means we can generalize this to:
\[\frac{\partial L}{\partial W} = \frac{\partial L}{\partial D} X^{T}\]