The previous example was for regression. A popular method to prevent overfitting in regression is called regularization. Recall that a standard loss function for regression is the mean squared error:
In the above, N is the number of data points and M is the number of input variables to the regression model.
All we do in regularization is add the following term to this loss function:
Conceptually, this means we are summing up the weights of each input. Since we add this term to the loss function, and because the aim of gradient descent is to minimize the loss, having larger coefficients and more non-zero coefficients results in higher loss. Therefore, to minimize loss, many of the coefficients will be forced towards 0. λis a coefficient which tells us how strongly to regularize: larger values of λresult in stronger regularization (more 0 values for m weights).
The full loss function with regularization is as follows:
This type of regularization is called lasso regression or L1 regularization. Another type of regularization is called ridge regression or L2 regularization. Ridge regression involves simply squaring the model weights in the regression term instead of taking the raw absolute value:
This gives us the revised loss function:
L1 regularization tends to force the m coefficients towards 0, while L2 regularization does not. Therefore, L1 is the method of choice when aiming to perform feature selection, where we aim to select a few of the input variables to train the model. This can be useful when it is not feasible to collect too many input variables when the model is deployed in the real world. If feature selection is not required, then L2 regularization is great.