- #1
Zap
- 406
- 120
- TL;DR Summary
- Deriving the gradient of the cost function ##\nabla_{a}C##.
This is an issue I've been stuck on for about two weeks. No matter how many times I take this derivative, I keep getting the same answer. However, this answer is inevitably wrong. Please help me to understand why it incorrect.
To start, I will define an input matrix ##X##, where ##n## is the number of features or independent variables in the neural network model, and ##m## is the number of examples.
$$
X = \begin{bmatrix}
x_{1,1} & x_{1,2} & \dots & x_{1,n} \\
x_{2,1} & x_{2,2} & \dots & x_{2,n} \\
\vdots & \vdots & \ddots & \vdots \\
x_{m,1} & x_{m,2} & \dots & x_{m,n}
\end{bmatrix}
$$
Next, I will define an output matrix, which is the output of a feedforward neural network after forward propagation of the input matrix ##X##. In other words, this is the model's output ##\hat{Y}##, where ##m## is the number of examples in the input matrix, and ##k## is the number of perceptrons in the output layer.
$$
\hat{Y} = \begin{bmatrix}
\hat{y}_{1,1} & \hat{y}_{1,2} & \dots & \hat{y}_{1,k} \\
\hat{y}_{2,1} & \hat{y}_{2,2} & \dots & \hat{y}_{2,k} \\
\vdots & \vdots & \ddots & \vdots \\
\hat{y}_{m,1} & \hat{y}_{m,2} & \dots & \hat{y}_{m,k} \\
\end{bmatrix}
$$
The output matrix can also be written as a matrix of the activation or output of each perceptron in the output layer for each element ##/hat{y}_{m,k}## in the output vector for all examples ##m##, as seen below, where ##a^{(ℓ)}_{m,k}## is the activation value of the ##k^{th}## perceptron in the output layer ##ℓ## for the ##m^{th}## example.
$$
\hat{Y} = \begin{bmatrix}
a^{(ℓ)}_{1,1} & a^{(ℓ)}_{1,2} & \dots & a^{(ℓ)}_{1,k} \\
a^{(ℓ)}_{2,1} & a^{(ℓ)}_{2,2} & \dots & a^{(ℓ)}_{2,k} \\
\vdots & \vdots & \ddots & \vdots \\
a^{(ℓ)}_{m,1} & a^{(ℓ)}_{m,2} & \dots & a^{(ℓ)}_{m,k}
\end{bmatrix}
$$
There is also a target matrix ##Y## of the same form as ##/hat{Y}##, where ##m## is the number of examples, and ##k## is the number of dependent variables in the neural network model.
$$
Y = \begin{bmatrix}
y_{1,1} & y_{1,2} & \dots & y_{1,k} \\
y_{2,1} & y_{2,2} & \dots & y_{2,k} \\
\vdots & \vdots & \ddots & \vdots \\
y_{m,1} & y_{m,2} & \dots & y_{m,k} \\
\end{bmatrix}
$$
I will now define a cost function ##C(Y;\hat{Y})##, which uses the elements in the target matrix ##Y## as constant parameters and is a function of the output matrix ##\hat{Y}##.
$$C(Y;\hat{Y})$$
Let's now put our focus on the cost function's dependent variable, the output matrix ##\hat{Y}##, and make it more explicit that the cost function is a function of all the elements in the output matrix ##\hat{Y}##, as shown below.
$$C(\hat{y}_{1,1},...,\hat{y}_{1,k},...\hat{y}_{m,1},...\hat{y}_{m,k})$$
$$C(a^{(ℓ)}_{1,1} ,...,a^{(ℓ)}_{1,k},...a^{(ℓ)}_{m,1},...a^{(ℓ)}_{m,k})$$
Let's now define what is ##a^{(ℓ)}_{m,k}##. This is nothing but the value of the activation function ##f## given a weighted input ##z_{m,k}## at the ##k^{th}## perceptron in the output layer ##ℓ## for the ##m^{th}## example.
$$a^{(ℓ)}_{m,k}=f(z_{m,k})$$
Let's now define ##z_{m,k}##, the weighted input, where ##a^{(ℓ-1)}_{m,i}## is the activation value of the ##i^{th}## perceptron in the hidden layer ##ℓ-1## for the ##m^{th}## example, ##j## is the number of perceptrons in the hidden layer ##ℓ-1##, and ##\omega_{i,k}## is the ##i^{th}## weight for the ##k^{th}## perceptron in the output layer ##ℓ##.
$$z_{m,k}=\sum_{i=0}^{j}a^{(ℓ-1)}_{m,i}\omega^{(ℓ)}_{i,k}$$
So, we want to find the value of the derivative of the cost function with respect to a weight ##\omega^{(ℓ)}_{j,k}##, which is the ##j^{th}## weight of the ##k{th}## perceptron in the output layer ##ℓ##, denoted below.
$$\frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=?$$
So, first thing we can do is treat all activations ##a^{(ℓ)}_{m,k}## without a ##k## subscript as constants, since ##\omega^{(ℓ)}_{j,k}## is only relevant to the ##k^{th}## perceptron in the output layer ##ℓ##.
$$C(a^{(ℓ)}_{m,1},...a^{(ℓ)}_{m,k})$$
Next, we can note that $$a^{(ℓ)}_{m,k}=f(z_{m,k})$$ and $$z_{m,k}=\sum_{i=0}^{j}a^{(ℓ-1)}_{m,i}\omega^{(ℓ)}_{i,k}$$, so that
$$a^{(ℓ)}_{m,k}=f(\sum_{i=0}^{j}a^{(ℓ-1)}_{m,i}\omega^{(ℓ)}_{i,k})$$
and we write ##a^{(ℓ)}_{m,k}## as a function of ##\omega^{(ℓ)}_{j,k}##, like ##a^{(ℓ)}_{m,k}(\omega^{(ℓ)}_{j,k})##. Therefore
$$C(a^{(ℓ)}_{m,1}(\omega^{(ℓ)}_{j,k}) , ... a^{(ℓ)}_{m,k}(\omega^{(ℓ)}_{j,k}) )$$
and we can invoke the generalized chain rule.
$$\frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=\sum_{h=1}^{m} \frac{\partial a^{(ℓ)}_{h,k}}{\partial \omega^{(ℓ)}_{j,k}} \frac{\partial C}{\partial a^{(ℓ)}_{h,k}} $$
Now, applying the chain rule once more, we can separate this value into two parts.
$$(1) \frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=\sum_{h=1}^{m} \frac{\partial z^{(ℓ)}_{h,k}}{\partial \omega^{(ℓ)}_{j,k}} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} $$
$$(2) \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} = \frac{\partial a^{(ℓ)}_{h,k}}{\partial z^{(ℓ)}_{h,k}} \frac{\partial C}{\partial a^{(ℓ)}_{h,k}}$$
Which, we can simply to
$$(1) \frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=\sum_{h=1}^{m} a^{(ℓ-1)}_{h,j} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} $$
$$(2) \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} = f^{'}(z^{(ℓ)}_{h,k}) \frac{\partial C}{\partial a^{(ℓ)}_{h,k}}$$
Now, let us apply this result to all of the weights in the output layer ##ℓ## for all ##k## perceptrons, defined as ##\Omega^{(ℓ)}## below, and try to find the gradient ##\nabla^{(ℓ)}_{\Omega}C##.
$$
\Omega^{(ℓ)}=
\begin{bmatrix}
\omega_{1,1} & \dots & \omega_{1,k} \\
\vdots & \ddots & \vdots \\
\omega_{j,1} & \dots & \omega_{j,k}
\end{bmatrix}
$$
$$
\nabla^{(ℓ)}_{\Omega}C=
\begin{bmatrix}
\frac{\partial }{\partial \omega^{(ℓ)}_{1,1}} & \dots & \frac{\partial }{\partial \omega^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial }{\partial \omega^{(ℓ)}_{j,1}} & \dots & \frac{\partial }{\partial \omega^{(ℓ)}_{j,k}}
\end{bmatrix}
$$
First, we simply plug our result in ##(1)## matrix.
$$
\nabla^{(ℓ)}_{\Omega}C=
\begin{bmatrix}
\sum_{h=1}^{m} a^{(ℓ-1)}_{h,1} \frac{\partial C}{\partial z^{(ℓ)}_{h,1}} & \dots & \sum_{h=1}^{m} a^{(ℓ-1)}_{h,1} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} \\
\vdots & \ddots & \vdots \\
\sum_{h=1}^{m} a^{(ℓ-1)}_{h,j} \frac{\partial C}{\partial z^{(ℓ)}_{h,1}} & \dots & \sum_{h=1}^{m} a^{(ℓ-1)}_{h,j} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}}
\end{bmatrix}
$$
This is nothing but the matrix multiplication operation
$$
\nabla^{(ℓ)}_{\Omega}C=
\begin{bmatrix}
a^{(ℓ-1)}_{1,1} & \dots & a^{(ℓ-1)}_{m,1} \\
\vdots & \ddots & \vdots \\
a^{(ℓ-1)}_{1,j} & \dots & a^{(ℓ-1)}_{m,j} \\
\end{bmatrix}
\begin{bmatrix}
\frac{\partial C}{\partial z^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial z^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial C}{\partial z^{(ℓ)}_{m,1}} & \dots & \frac{\partial C}{\partial z^{(ℓ)}_{m,k}} \\
\end{bmatrix}
$$
Which is simplified below, where ##L^{(ℓ-1)}## is the input to the output layer ##ℓ## and ##\nabla^{(ℓ)}_{z}C## is the gradient of the cost function with respect to the weighted input to the output layer.
$$
\nabla^{(ℓ)}_{\Omega}C=(L^{(ℓ-1)})^{T}\nabla^{(ℓ)}_{z}C
$$
Now, let's just focus on ##\nabla^{(ℓ)}_{z}C##. We can solve ##\nabla^{(ℓ)}_{z}C## by plugging in our result ##(2)##.
$$
\nabla^{(ℓ)}_{z}C= \begin{bmatrix}
f^{'}(z^{(ℓ)}_{1,1}) \frac{\partial C}{\partial a^{(ℓ)}_{1,1}} & \dots & f^{'}(z^{(ℓ)}_{1,k}) \frac{\partial C}{\partial a^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
f^{'}(z^{(ℓ)}_{m,1}) \frac{\partial C}{\partial a^{(ℓ)}_{m,1}} & \dots & f^{'}(z^{(ℓ)}_{m,k}) \frac{\partial C}{\partial a^{(ℓ)}_{m,k}}
\end{bmatrix}
$$
We can split this into two matrices using the Hadamard product.
$$
\nabla^{(ℓ)}_{z}C= \begin{bmatrix}
f^{'}(z^{(ℓ)}_{1,1}) & \dots & f^{'}(z^{(ℓ)}_{1,k}) \\
\vdots & \ddots & \vdots \\
f^{'}(z^{(ℓ)}_{m,1}) & \dots & f^{'}(z^{(ℓ)}_{m,k})
\end{bmatrix}
⊙
\begin{bmatrix}
\frac{\partial C}{\partial a^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial C}{\partial a^{(ℓ)}_{m,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{m,k}}
\end{bmatrix}
$$
And, lo and behold, the gradient ##\nabla^{(ℓ)}_{a}C## is a matrix, as shown.
$$
\nabla^{(ℓ)}_{a}C=
\begin{bmatrix}
\frac{\partial C}{\partial a^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial C}{\partial a^{(ℓ)}_{m,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{m,k}}
\end{bmatrix}
$$
So, why is this incorrect? In the backpropagation algorithm, ##\nabla^{(ℓ)}_{a}C## is always either treated as a vector or some kind of average of vectors. Why is this? Why isn't ##\nabla^{(ℓ)}_{a}C## a matrix? Where did I go wrong in the math??
To start, I will define an input matrix ##X##, where ##n## is the number of features or independent variables in the neural network model, and ##m## is the number of examples.
$$
X = \begin{bmatrix}
x_{1,1} & x_{1,2} & \dots & x_{1,n} \\
x_{2,1} & x_{2,2} & \dots & x_{2,n} \\
\vdots & \vdots & \ddots & \vdots \\
x_{m,1} & x_{m,2} & \dots & x_{m,n}
\end{bmatrix}
$$
Next, I will define an output matrix, which is the output of a feedforward neural network after forward propagation of the input matrix ##X##. In other words, this is the model's output ##\hat{Y}##, where ##m## is the number of examples in the input matrix, and ##k## is the number of perceptrons in the output layer.
$$
\hat{Y} = \begin{bmatrix}
\hat{y}_{1,1} & \hat{y}_{1,2} & \dots & \hat{y}_{1,k} \\
\hat{y}_{2,1} & \hat{y}_{2,2} & \dots & \hat{y}_{2,k} \\
\vdots & \vdots & \ddots & \vdots \\
\hat{y}_{m,1} & \hat{y}_{m,2} & \dots & \hat{y}_{m,k} \\
\end{bmatrix}
$$
The output matrix can also be written as a matrix of the activation or output of each perceptron in the output layer for each element ##/hat{y}_{m,k}## in the output vector for all examples ##m##, as seen below, where ##a^{(ℓ)}_{m,k}## is the activation value of the ##k^{th}## perceptron in the output layer ##ℓ## for the ##m^{th}## example.
$$
\hat{Y} = \begin{bmatrix}
a^{(ℓ)}_{1,1} & a^{(ℓ)}_{1,2} & \dots & a^{(ℓ)}_{1,k} \\
a^{(ℓ)}_{2,1} & a^{(ℓ)}_{2,2} & \dots & a^{(ℓ)}_{2,k} \\
\vdots & \vdots & \ddots & \vdots \\
a^{(ℓ)}_{m,1} & a^{(ℓ)}_{m,2} & \dots & a^{(ℓ)}_{m,k}
\end{bmatrix}
$$
There is also a target matrix ##Y## of the same form as ##/hat{Y}##, where ##m## is the number of examples, and ##k## is the number of dependent variables in the neural network model.
$$
Y = \begin{bmatrix}
y_{1,1} & y_{1,2} & \dots & y_{1,k} \\
y_{2,1} & y_{2,2} & \dots & y_{2,k} \\
\vdots & \vdots & \ddots & \vdots \\
y_{m,1} & y_{m,2} & \dots & y_{m,k} \\
\end{bmatrix}
$$
I will now define a cost function ##C(Y;\hat{Y})##, which uses the elements in the target matrix ##Y## as constant parameters and is a function of the output matrix ##\hat{Y}##.
$$C(Y;\hat{Y})$$
Let's now put our focus on the cost function's dependent variable, the output matrix ##\hat{Y}##, and make it more explicit that the cost function is a function of all the elements in the output matrix ##\hat{Y}##, as shown below.
$$C(\hat{y}_{1,1},...,\hat{y}_{1,k},...\hat{y}_{m,1},...\hat{y}_{m,k})$$
$$C(a^{(ℓ)}_{1,1} ,...,a^{(ℓ)}_{1,k},...a^{(ℓ)}_{m,1},...a^{(ℓ)}_{m,k})$$
Let's now define what is ##a^{(ℓ)}_{m,k}##. This is nothing but the value of the activation function ##f## given a weighted input ##z_{m,k}## at the ##k^{th}## perceptron in the output layer ##ℓ## for the ##m^{th}## example.
$$a^{(ℓ)}_{m,k}=f(z_{m,k})$$
Let's now define ##z_{m,k}##, the weighted input, where ##a^{(ℓ-1)}_{m,i}## is the activation value of the ##i^{th}## perceptron in the hidden layer ##ℓ-1## for the ##m^{th}## example, ##j## is the number of perceptrons in the hidden layer ##ℓ-1##, and ##\omega_{i,k}## is the ##i^{th}## weight for the ##k^{th}## perceptron in the output layer ##ℓ##.
$$z_{m,k}=\sum_{i=0}^{j}a^{(ℓ-1)}_{m,i}\omega^{(ℓ)}_{i,k}$$
So, we want to find the value of the derivative of the cost function with respect to a weight ##\omega^{(ℓ)}_{j,k}##, which is the ##j^{th}## weight of the ##k{th}## perceptron in the output layer ##ℓ##, denoted below.
$$\frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=?$$
So, first thing we can do is treat all activations ##a^{(ℓ)}_{m,k}## without a ##k## subscript as constants, since ##\omega^{(ℓ)}_{j,k}## is only relevant to the ##k^{th}## perceptron in the output layer ##ℓ##.
$$C(a^{(ℓ)}_{m,1},...a^{(ℓ)}_{m,k})$$
Next, we can note that $$a^{(ℓ)}_{m,k}=f(z_{m,k})$$ and $$z_{m,k}=\sum_{i=0}^{j}a^{(ℓ-1)}_{m,i}\omega^{(ℓ)}_{i,k}$$, so that
$$a^{(ℓ)}_{m,k}=f(\sum_{i=0}^{j}a^{(ℓ-1)}_{m,i}\omega^{(ℓ)}_{i,k})$$
and we write ##a^{(ℓ)}_{m,k}## as a function of ##\omega^{(ℓ)}_{j,k}##, like ##a^{(ℓ)}_{m,k}(\omega^{(ℓ)}_{j,k})##. Therefore
$$C(a^{(ℓ)}_{m,1}(\omega^{(ℓ)}_{j,k}) , ... a^{(ℓ)}_{m,k}(\omega^{(ℓ)}_{j,k}) )$$
and we can invoke the generalized chain rule.
$$\frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=\sum_{h=1}^{m} \frac{\partial a^{(ℓ)}_{h,k}}{\partial \omega^{(ℓ)}_{j,k}} \frac{\partial C}{\partial a^{(ℓ)}_{h,k}} $$
Now, applying the chain rule once more, we can separate this value into two parts.
$$(1) \frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=\sum_{h=1}^{m} \frac{\partial z^{(ℓ)}_{h,k}}{\partial \omega^{(ℓ)}_{j,k}} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} $$
$$(2) \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} = \frac{\partial a^{(ℓ)}_{h,k}}{\partial z^{(ℓ)}_{h,k}} \frac{\partial C}{\partial a^{(ℓ)}_{h,k}}$$
Which, we can simply to
$$(1) \frac{\partial C}{\partial \omega^{(ℓ)}_{j,k}}=\sum_{h=1}^{m} a^{(ℓ-1)}_{h,j} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} $$
$$(2) \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} = f^{'}(z^{(ℓ)}_{h,k}) \frac{\partial C}{\partial a^{(ℓ)}_{h,k}}$$
Now, let us apply this result to all of the weights in the output layer ##ℓ## for all ##k## perceptrons, defined as ##\Omega^{(ℓ)}## below, and try to find the gradient ##\nabla^{(ℓ)}_{\Omega}C##.
$$
\Omega^{(ℓ)}=
\begin{bmatrix}
\omega_{1,1} & \dots & \omega_{1,k} \\
\vdots & \ddots & \vdots \\
\omega_{j,1} & \dots & \omega_{j,k}
\end{bmatrix}
$$
$$
\nabla^{(ℓ)}_{\Omega}C=
\begin{bmatrix}
\frac{\partial }{\partial \omega^{(ℓ)}_{1,1}} & \dots & \frac{\partial }{\partial \omega^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial }{\partial \omega^{(ℓ)}_{j,1}} & \dots & \frac{\partial }{\partial \omega^{(ℓ)}_{j,k}}
\end{bmatrix}
$$
First, we simply plug our result in ##(1)## matrix.
$$
\nabla^{(ℓ)}_{\Omega}C=
\begin{bmatrix}
\sum_{h=1}^{m} a^{(ℓ-1)}_{h,1} \frac{\partial C}{\partial z^{(ℓ)}_{h,1}} & \dots & \sum_{h=1}^{m} a^{(ℓ-1)}_{h,1} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}} \\
\vdots & \ddots & \vdots \\
\sum_{h=1}^{m} a^{(ℓ-1)}_{h,j} \frac{\partial C}{\partial z^{(ℓ)}_{h,1}} & \dots & \sum_{h=1}^{m} a^{(ℓ-1)}_{h,j} \frac{\partial C}{\partial z^{(ℓ)}_{h,k}}
\end{bmatrix}
$$
This is nothing but the matrix multiplication operation
$$
\nabla^{(ℓ)}_{\Omega}C=
\begin{bmatrix}
a^{(ℓ-1)}_{1,1} & \dots & a^{(ℓ-1)}_{m,1} \\
\vdots & \ddots & \vdots \\
a^{(ℓ-1)}_{1,j} & \dots & a^{(ℓ-1)}_{m,j} \\
\end{bmatrix}
\begin{bmatrix}
\frac{\partial C}{\partial z^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial z^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial C}{\partial z^{(ℓ)}_{m,1}} & \dots & \frac{\partial C}{\partial z^{(ℓ)}_{m,k}} \\
\end{bmatrix}
$$
Which is simplified below, where ##L^{(ℓ-1)}## is the input to the output layer ##ℓ## and ##\nabla^{(ℓ)}_{z}C## is the gradient of the cost function with respect to the weighted input to the output layer.
$$
\nabla^{(ℓ)}_{\Omega}C=(L^{(ℓ-1)})^{T}\nabla^{(ℓ)}_{z}C
$$
Now, let's just focus on ##\nabla^{(ℓ)}_{z}C##. We can solve ##\nabla^{(ℓ)}_{z}C## by plugging in our result ##(2)##.
$$
\nabla^{(ℓ)}_{z}C= \begin{bmatrix}
f^{'}(z^{(ℓ)}_{1,1}) \frac{\partial C}{\partial a^{(ℓ)}_{1,1}} & \dots & f^{'}(z^{(ℓ)}_{1,k}) \frac{\partial C}{\partial a^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
f^{'}(z^{(ℓ)}_{m,1}) \frac{\partial C}{\partial a^{(ℓ)}_{m,1}} & \dots & f^{'}(z^{(ℓ)}_{m,k}) \frac{\partial C}{\partial a^{(ℓ)}_{m,k}}
\end{bmatrix}
$$
We can split this into two matrices using the Hadamard product.
$$
\nabla^{(ℓ)}_{z}C= \begin{bmatrix}
f^{'}(z^{(ℓ)}_{1,1}) & \dots & f^{'}(z^{(ℓ)}_{1,k}) \\
\vdots & \ddots & \vdots \\
f^{'}(z^{(ℓ)}_{m,1}) & \dots & f^{'}(z^{(ℓ)}_{m,k})
\end{bmatrix}
⊙
\begin{bmatrix}
\frac{\partial C}{\partial a^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial C}{\partial a^{(ℓ)}_{m,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{m,k}}
\end{bmatrix}
$$
And, lo and behold, the gradient ##\nabla^{(ℓ)}_{a}C## is a matrix, as shown.
$$
\nabla^{(ℓ)}_{a}C=
\begin{bmatrix}
\frac{\partial C}{\partial a^{(ℓ)}_{1,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{1,k}} \\
\vdots & \ddots & \vdots \\
\frac{\partial C}{\partial a^{(ℓ)}_{m,1}} & \dots & \frac{\partial C}{\partial a^{(ℓ)}_{m,k}}
\end{bmatrix}
$$
So, why is this incorrect? In the backpropagation algorithm, ##\nabla^{(ℓ)}_{a}C## is always either treated as a vector or some kind of average of vectors. Why is this? Why isn't ##\nabla^{(ℓ)}_{a}C## a matrix? Where did I go wrong in the math??