I'm following the guide as outlined at this link: http://neuralnetworksanddeeplearning.com/chap2.html
For the purposes of this question, I've written a basic network 2 hidden layers, one with 2 neurons and one with one neuron. For a very basic task, the network will learn how to compute an OR logic gate so the training data will be:
X = [[0, 0], [0, 1], [1, 0], [1, 1]]Y = [0, 1, 1, 1]
For this example, the weights and biases are:
w = [[0.3, 0.4], [0.1]]b = [[1, 1], [1]]
The feedforward part was pretty easy to implement so I don't think I need to post that here. The tutorial I've been following summarises calculating the errors and the gradient descent algorithm with the following equations:
For each training example $x$, compute the output error $\delta^{x, L}$ where $L =$ Final layer (Layer 1 in this case). $\delta^{x, L} = \nabla_aC_x \circ \sigma'(z^{x, L})$ where $\nabla_aC_x$ is the differential of the cost function (basic MSE) with respect to the Layer 1 activation output, and $\sigma'(z^{x, L})$ is the derivative of the sigmoid function of the Layer 1 output i.e. $\sigma(z^{x, L})(1-\sigma(z^{x, L}))$.
That's all good so far and I can calculate that quite straightforwardly. Now for $l = L-1, L-2, ...$, the error for each previous layer can be calculated as
$\delta^{x, l} = ((w^{l+1})^T \delta^{x, l+1}) \circ \sigma(z^{x, l})$
Which again, is pretty straight forward to implement.
Finally, to update the weights (and bias), the equations are for $l = L, L-1, ...$:
$w^l \rightarrow w^l - \frac{\eta}{m}\sum_x\delta^{x,l}(a^{x, l-1})^T$
$b^l \rightarrow b^l - \frac{\eta}{m}\sum_x\delta^{x,l}$
What I don't understand is how this works with vectors of different numbers of elements (I think the lack of vector notation here confuses me).
For example, Layer 1 has one neuron, so $\delta^{x, 1}$ will be a scalar value since it only outputs one value. However, $a^{x, 0}$ is a vector with two elements since layer 0 has two neurons. Which means that $\delta^{x, l}(a^{x, l-1})^T$ will be a vector even if I sum over all training samples $x$. What am I supposed to do here? Am I just supposed to sum the components of the vector as well?
Hopefully my question makes sense; I feel I'm very close to implementing this entirely and I'm just stuck here.
Thank you
[edit] Okay, so I realised that I've been misrepresenting the weights of the neurons and have corrected for that.
weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]
Which has the output
[array([[0.27660583, 1.00106314], [0.34017727, 0.74990392]])array([[ 1.095244 , -0.22719165]])
Which means that layer0 has a weight matrix with shape 2x2 representing the 2 weights on neuron01 and the 2 weights on neuron02.
My understanding then is that $\delta^{x,l}$ has the same shape as the weights array because each weight gets updated indepedently. That's also fine.
But the bias term (according to the link I sourced) has 1 term for each neuron, which means layer 0 will has two bias terms (b00 and b01) and layer 1 has one bias term (b10).
However, to calculate the update for the bias terms, you sum the deltas over x i.e $\sum_x \delta^{x, l}$; if delta has the size of the weight matrix, then there are too many terms to update the bias terms. What have I missed here?
Many thanks