How does linear regression find this line of best fit given the data points? First, let’s get more formal about how we define the line of best fit.

For any given line that we can draw to fit the data, we can draw vertical lines from each data point to the fitted line:

The better the model is fitted to the data, the lower the average distance is from each point to the line. For example, this line contains a bad fit compared to the line above:

These vertical lines are called *residuals*. The goal of learning is to *minimize* the residual distance, because we want a line with the minimal possible distance to all of the data points. Such a line will fit the data better than any other possible line. If we call *Y**i *the true value of the *i ^{th}*

We can calculate the mean error as the sum of all errors divided by the total number of points:

\(\frac{\sum_{i=1}^N |Y_i - \widehat{Y_i}| }{N}\)

Above, we see our first example of a fancy formula for a pretty simple concept. The \(\sum\) symbol is the “summation” operator, and we specify that we want to sum all points from i=1 up to i=N, where N is the number of data points. This formula is simply the average value of those dotted lines, |*Y**i**- **Y**i**|*, for all data points.

The line of best fit is the line with the slope value m and y-intercept b that minimizes this mean absolute error:

\(min_{m,b} \frac{\sum_{i=1}^N |Y_i - (m x_i + b)| }{N}\)

All we did to get to the equation above is change two things: (1) we replaced *Y**i*with *mx**i**+b *since *Y**i*=*mx**i**+b*, and (2) say that we want to find the values of m and b that will minimize this mean absolute error. In fancy math terminology, we call this an *optimization problem*.

In practice, we find the mean *squared* error instead of the mean *absolute* error; the mean squared error is:

\(\frac{\sum_{i=1}^N (Y_i - \widehat{Y_i})^2 }{N}\)

All we did was square the length of each dotted line |*Y**i**- **Y**i**|*. We do this for two reasons. First and most importantly, squaring a number will penalize larger numbers. We know that 2 squared is 4, 3 squared is 9, and 4 squared is 16. As we increase the number we are squaring, the result gets exponentially larger. Squaring the error will result in increased penalization of larger distances from the line of best fit: