Blogs/Gradient Descent and Backpropagation Walkthrough

Gradient Descent and Backpropagation Walkthrough

peter.washington Oct 05 2021 11 min read 684 views

Stanford applied ML PhD


Gradient Descent and Backpropagation Walkthrough.png

Let’s train the following neural network architecture with initial edge weights all set to 1:

Gradient Descent and Backpropagation Walkthrough

While the weights are all initialized to 1, this is not always the case. Neural networks are often initialized with random weights. There are other decisions for initializing the weights, but we will not cover those here.

We will use the following data points to train the network:

Gradient Descent and Backpropagation Walkthrough

The rule is that the y value is 1 if the third number in the x vector is positive and 0 if it is negative. Of course, the neural network does not know this in advance. We will see if it can learn this through the gradient descent process.

The first thing we need to do is calculate the neural network’s output for each value. We will work this out for the first input [1, 2, 3], but will provide the answers for the rest (you should work them out to verify your understanding of how a neural network makes its predictions):

Gradient Descent and Backpropagation Walkthrough

Remember that we are setting the bias (the value of b in y = mx + b) to 0 for simplicity in this example. We apply ReLU activation to each node, but since each node is positive anyways, this activation does not change the values of any of the nodes. Since we are doing binary classification, we take the sigmoid activation of the output node:

Gradient Descent and Backpropagation Walkthrough

Here is the table for the network’s predictions for all 6 training points:

Gradient Descent and Backpropagation Walkthrough

Since are performing binary classification, we will use binary cross entropy loss:

Gradient Descent and Backpropagation Walkthrough

The total loss is the average of each individual cross entropy calculation. We add a column for the cross entropy calculations below:

Gradient Descent and Backpropagation Walkthrough

The final cross entropy is the average of each individual cross entropy value:

Gradient Descent and Backpropagation Walkthrough

Now, we can start the backpropagation. Let’s assign a name to each node and edge to help with understanding the calculations:

Gradient Descent and Backpropagation Walkthrough

Let’s start by updating the weight for edge EF, which is currently 1. Using standard gradient descent, we know that we update edge EF with the following formula:

Gradient Descent and Backpropagation Walkthrough

We will use a learning rate of 0.1 for our example. We need to calculate the derivative of the cross entropy loss with respect to edge EF. Using the chain rule, we know that:

Gradient Descent and Backpropagation Walkthrough

To calculate ∂(Cross Entropy Loss) / ∂F, let’s first write the formula for cross entropy loss:

Gradient Descent and Backpropagation Walkthrough

In our case, the output F is the prediction y_hat, so we can rewrite the cross entropy loss in terms of the neural network as:

Gradient Descent and Backpropagation Walkthrough

To calculate ∂(Cross Entropy Loss) / ∂F, we apply some derivative rules to get the result. Again, don’t worry if you haven’t taken Calculus; just know that it is possible, using Calculus rules, to calculate the derivative of a function. Using these rules, we get:

Gradient Descent and Backpropagation Walkthrough

To review where we are right now, let’s remember that according to gradient descent, we update edge EF as follows:

Gradient Descent and Backpropagation Walkthrough

We need to calculate ∂(Cross Entropy Loss) / ∂EF to apply this formula, and this derivative can be broken apart into:

Gradient Descent and Backpropagation Walkthrough

We above found out the formula for ∂(Cross Entropy Loss) / ∂F:

Gradient Descent and Backpropagation Walkthrough

Now, all that’s left is to calculate ∂F / ∂EF. To do this, we can write the formula for F as:

Gradient Descent and Backpropagation Walkthrough

We see this from the picture of the neural network: the value of the node F is the value of edge weight EF times the value of node E. Using Calculus rules, we can calculate the derivative of F with respect of EF as:

Gradient Descent and Backpropagation Walkthrough

Now, we have the final formula for the derivative:

Gradient Descent and Backpropagation Walkthrough

Therefore, our formula for updating edge EF can be filled in:

Gradient Descent and Backpropagation Walkthrough

We will use a pre-specified learning rate of 0.01.

Going back to our neural network, which we inputted the vector [1, 2, 3] into, we got the following values for the nodes:

Gradient Descent and Backpropagation Walkthrough

We see above that the current value of edge EF is 1, the current value of E is 6, and the current value of F is 0.999. We also know from out data table that ytrue is 0, and we have pre-specified a learning rate of 1. We can plug in all of these values to calculate the updated edge weight EF:

Gradient Descent and Backpropagation Walkthrough

That’s a very different weight!

We do this for all other edges in the network, moving from the rightmost edges towards the leftmost edges. We use the backpropagation formulae to calculate the needed derivatives as we move towards the front of the network.

After updating all the edges in the same way that edge EF was updated in, then we will have new model weighs for the whole neural network. We have now completed one iteration of machine learning. We rinse and repeat, cycling through moving forward through the network to calculate the value of each node, and backpropagating through the network to calculate the required derivatives needed for updating that particular edge weight.

And that’s how gradient descent works for neural networks!

 

Learn and practice this concept here: 

https://mlpro.io/problems/exploding-gradients/