peter.washington

# What is a Line of Best Fit?

How does linear regression find this line of best fit given the data points? First, let’s get more formal about how we define the line of best fit.

For any given line that we can draw to fit the data, we can draw vertical lines from each data point to the fitted line:

The better the model is fitted to the data, the lower the average distance is from each point to the line. For example, this line contains a bad fit compared to the line above:

These vertical lines are called residuals. The goal of learning is to minimize the residual distance, because we want a line with the minimal possible distance to all of the data points. Such a line will fit the data better than any other possible line. If we call Ythe true value of the ith data point and Ythe value predicted by the linear regression model for the ith data point, then the error of our prediction is the absolute value of the distance between these points: |Yi- Yi|. This |Yi- Yi| error term can be visualized as the length of the dotted vertical lines from the line to the point:

We can calculate the mean error as the sum of all errors divided by the total number of points:

$$\frac{\sum_{i=1}^N |Y_i - \widehat{Y_i}| }{N}$$

Above, we see our first example of a fancy formula for a pretty simple concept. The $$\sum$$ symbol is the “summation” operator, and we specify that we want to sum all points from i=1 up to i=N, where N is the number of data points. This formula is simply the average value of those dotted lines, |Yi- Yi|, for all data points.

The line of best fit is the line with the slope value m and y-intercept b that minimizes this mean absolute error:

$$min_{m,b} \frac{\sum_{i=1}^N |Y_i - (m x_i + b)| }{N}$$

All we did to get to the equation above is change two things: (1) we replaced Yiwith mxi+b since Yi=mxi+b, and (2) say that we want to find the values of m and b that will minimize this mean absolute error. In fancy math terminology, we call this an optimization problem.

In practice, we find the mean squared error instead of the mean absolute error; the mean squared error is:

$$\frac{\sum_{i=1}^N (Y_i - \widehat{Y_i})^2 }{N}$$

All we did was square the length of each dotted line |Yi- Yi|. We do this for two reasons. First and most importantly, squaring a number will penalize larger numbers. We know that 2 squared is 4, 3 squared is 9, and 4 squared is 16. As we increase the number we are squaring, the result gets exponentially larger. Squaring the error will result in increased penalization of larger distances from the line of best fit: