How does linear regression find the values of m and b that will minimize the mean square error? The process of doing this is called gradient descent, and this process is used across all of supervised machine learning, so we will spend some time on this section.
At a high level, gradient descent follows these steps:
The best way to understand the gradient descent process for the first time is with an example. Let’s look at an example of predicting performance on an exam from hours spent studying (using linear regression):
Step 1 is to choose random values for m and b, so let’s start with m = 0 and b=10. That would give us the following starting line:
Step 2 is to calculate the mean squared error. Let’s say the below table represents the dataset. We have a column for the x value, a column for the y value, a column for the regression prediction so far, and a column for the calculation of the squared error for that point:
Hopefully there’s a curve! The mean squared error is therefore:
That’s quite a big error! Step 3 is all about updating the values of m and b of the prediction line based on the error. This is done with some calculus, and in particular the derivative. For those who have not taken calculus, the derivative is a way of measuring the rate of change of an equation as one of the variables in that equation changes.
In the case of gradient descent, we want to measure the derivative, or rate of change, of the mean squared error with respect to each of the parameters of the model. So in the case of linear regression, we are taking the derivative of the mean squared error with respect to each of the model parameters (which are m and b). We are calculating exactly these two derivatives:
Using our discussion of the relevant calculus from section 1.4, we can calculate the partial derivative of the mean squared error with respect to both of the model parameters m and b:
Again, if you have not taken calculus, then don’t worry about how the derivative was calculated. Just know that we can calculate these derivatives. The specifics of the equations above are not important.
We update the values of m and b according to these calculated derivatives and a pre-specified learning rate:
The learning rate tells us how quickly we want to update the line.
Now, we understand the details of step 3 of our 4-step algorithm for gradient descent:
Let’s return to our example and remember that we had come up with the following initial line of best fit (which is not really good to start off):
Using our equation for the partial derivative of the mean squared error with respect to m, let’s update the value of m. To get a number for the partial derivative with respect to m and the partial derivative with respect to b, let’s calculate each part of their equations:
Using our formula for the partial derivative of m, we get:
And same for b:
In machine learning, we pre-specify a learning rate which specifies how quickly we update the parameter values each time. We will use a learning rate of 0.01:
The updated estimate for the line of best fit is then:
Let’s draw this updated line:
That line is closer to the right line! We continue the gradient descent process by again calculating the mean squared error (step 2) and then calculating the partial derivatives of the mean squared error with respect to each model parameter and updating the parameters accordingly (step 3). We repeat these two steps repeatedly until the mean squared error gets small enough for our satisfaction.
After several iterations, we should eventually end up with a pretty good line.
And that's fundamentally how machine learning works!