Let’s train the following neural network architecture with initial edge weights all set to 1:
While the weights are all initialized to 1, this is not always the case. Neural networks are often initialized with random weights. There are other decisions for initializing the weights, but we will not cover those here.
We will use the following data points to train the network:
The rule is that the y value is 1 if the third number in the x vector is positive and 0 if it is negative. Of course, the neural network does not know this in advance. We will see if it can learn this through the gradient descent process.
The first thing we need to do is calculate the neural network’s output for each value. We will work this out for the first input [1, 2, 3], but will provide the answers for the rest (you should work them out to verify your understanding of how a neural network makes its predictions):
Remember that we are setting the bias (the value of b in y = mx + b) to 0 for simplicity in this example. We apply ReLU activation to each node, but since each node is positive anyways, this activation does not change the values of any of the nodes. Since we are doing binary classification, we take the sigmoid activation of the output node:
Here is the table for the network’s predictions for all 6 training points:
Since are performing binary classification, we will use binary cross entropy loss:
The total loss is the average of each individual cross entropy calculation. We add a column for the cross entropy calculations below:
The final cross entropy is the average of each individual cross entropy value:
Now, we can start the backpropagation. Let’s assign a name to each node and edge to help with understanding the calculations:
Let’s start by updating the weight for edge EF, which is currently 1. Using standard gradient descent, we know that we update edge EF with the following formula:
We will use a learning rate of 0.1 for our example. We need to calculate the derivative of the cross entropy loss with respect to edge EF. Using the chain rule, we know that:
To calculate ∂(Cross Entropy Loss) / ∂F, let’s first write the formula for cross entropy loss:
In our case, the output F is the prediction y_hat, so we can rewrite the cross entropy loss in terms of the neural network as:
To calculate ∂(Cross Entropy Loss) / ∂F, we apply some derivative rules to get the result. Again, don’t worry if you haven’t taken Calculus; just know that it is possible, using Calculus rules, to calculate the derivative of a function. Using these rules, we get:
To review where we are right now, let’s remember that according to gradient descent, we update edge EF as follows:
We need to calculate ∂(Cross Entropy Loss) / ∂EF to apply this formula, and this derivative can be broken apart into:
We above found out the formula for ∂(Cross Entropy Loss) / ∂F:
Now, all that’s left is to calculate ∂F / ∂EF. To do this, we can write the formula for F as:
We see this from the picture of the neural network: the value of the node F is the value of edge weight EF times the value of node E. Using Calculus rules, we can calculate the derivative of F with respect of EF as:
Now, we have the final formula for the derivative:
Therefore, our formula for updating edge EF can be filled in:
We will use a pre-specified learning rate of 0.01.
Going back to our neural network, which we inputted the vector [1, 2, 3] into, we got the following values for the nodes:
We see above that the current value of edge EF is 1, the current value of E is 6, and the current value of F is 0.999. We also know from out data table that ytrue is 0, and we have pre-specified a learning rate of 1. We can plug in all of these values to calculate the updated edge weight EF:
That’s a very different weight!
We do this for all other edges in the network, moving from the rightmost edges towards the leftmost edges. We use the backpropagation formulae to calculate the needed derivatives as we move towards the front of the network.
After updating all the edges in the same way that edge EF was updated in, then we will have new model weighs for the whole neural network. We have now completed one iteration of machine learning. We rinse and repeat, cycling through moving forward through the network to calculate the value of each node, and backpropagating through the network to calculate the required derivatives needed for updating that particular edge weight.
And that’s how gradient descent works for neural networks!