Quantcast
Channel: Implementing Gradient Descent Algorithm in Python, bit confused regarding equations - Artificial Intelligence Stack Exchange
Viewing all articles
Browse latest Browse all 2

Implementing Gradient Descent Algorithm in Python, bit confused regarding equations

$
0
0

I'm following the guide as outlined at this link: http://neuralnetworksanddeeplearning.com/chap2.html

For the purposes of this question, I've written a basic network 2 hidden layers, one with 2 neurons and one with one neuron. For a very basic task, the network will learn how to compute an OR logic gate so the training data will be:

X = [[0, 0], [0, 1], [1, 0], [1, 1]]Y = [0, 1, 1, 1]

And the diagram:enter image description here

For this example, the weights and biases are:

w = [[0.3, 0.4], [0.1]]b = [[1, 1], [1]]

The feedforward part was pretty easy to implement so I don't think I need to post that here. The tutorial I've been following summarises calculating the errors and the gradient descent algorithm with the following equations:

For each training example $x$, compute the output error $\delta^{x, L}$ where $L =$ Final layer (Layer 1 in this case). $\delta^{x, L} = \nabla_aC_x \circ \sigma'(z^{x, L})$ where $\nabla_aC_x$ is the differential of the cost function (basic MSE) with respect to the Layer 1 activation output, and $\sigma'(z^{x, L})$ is the derivative of the sigmoid function of the Layer 1 output i.e. $\sigma(z^{x, L})(1-\sigma(z^{x, L}))$.

That's all good so far and I can calculate that quite straightforwardly. Now for $l = L-1, L-2, ...$, the error for each previous layer can be calculated as

$\delta^{x, l} = ((w^{l+1})^T \delta^{x, l+1}) \circ \sigma(z^{x, l})$

Which again, is pretty straight forward to implement.

Finally, to update the weights (and bias), the equations are for $l = L, L-1, ...$:

$w^l \rightarrow w^l - \frac{\eta}{m}\sum_x\delta^{x,l}(a^{x, l-1})^T$

$b^l \rightarrow b^l - \frac{\eta}{m}\sum_x\delta^{x,l}$

What I don't understand is how this works with vectors of different numbers of elements (I think the lack of vector notation here confuses me).

For example, Layer 1 has one neuron, so $\delta^{x, 1}$ will be a scalar value since it only outputs one value. However, $a^{x, 0}$ is a vector with two elements since layer 0 has two neurons. Which means that $\delta^{x, l}(a^{x, l-1})^T$ will be a vector even if I sum over all training samples $x$. What am I supposed to do here? Am I just supposed to sum the components of the vector as well?

Hopefully my question makes sense; I feel I'm very close to implementing this entirely and I'm just stuck here.

Thank you

[edit] Okay, so I realised that I've been misrepresenting the weights of the neurons and have corrected for that.

weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]

Which has the output

[array([[0.27660583, 1.00106314],   [0.34017727, 0.74990392]])array([[ 1.095244  , -0.22719165]])

Which means that layer0 has a weight matrix with shape 2x2 representing the 2 weights on neuron01 and the 2 weights on neuron02.

My understanding then is that $\delta^{x,l}$ has the same shape as the weights array because each weight gets updated indepedently. That's also fine.

But the bias term (according to the link I sourced) has 1 term for each neuron, which means layer 0 will has two bias terms (b00 and b01) and layer 1 has one bias term (b10).

However, to calculate the update for the bias terms, you sum the deltas over x i.e $\sum_x \delta^{x, l}$; if delta has the size of the weight matrix, then there are too many terms to update the bias terms. What have I missed here?

Many thanks


Viewing all articles
Browse latest Browse all 2

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>