Blogs/Logistic Regression

# Logistic Regression

peterwashington Nov 10 2021 13 min read 0 views
Supervised

Logistic regression is one of many methods for doing classification and is usually the first method that you learn about. For simplicity, we will talk about logistic regression involving predicting between two categories. Logistic regression has two big differences from linear regression:

First, the output is not just any number – it is a number between 0 and 1. This number represents a probability. A probability of 1 corresponds to 100% chance of something happening. A probability of 0 corresponds to 0% chance of something happening. We will talk about probabilities in more detail later on in this chapter.

In our case, a logistic regression output of 0.3 for a prediction between two categories means the model is predicting that there is 30% chance that the input falls under one of the classes and a 70% chance that the input falls under the other class.

Second, logistic regression involves applying the sigmoid function. This is the sigmoid function:

The sigmoid function looks like this:

The sigmoid function has two nice properties. First, it outputs a number between 0 and 1. This is useful because probabilities are between 0 and 1. Second, the probabilities get closer to 1 more slowly as x gets larger and closer to 0 more slowly as x gets smaller. This is useful because data points far away from the majority of the training data (outliers) won’t be assigned disproportionately large probabilities.

Recall that in linear regression, the central learning task is to learn the parameters m and b of the equation y = mx + b to find the line of best fit. In logistic regression, we do the exact same thing: we find the

One crucial difference from linear regression is that during the gradient descent process, m and b values that maximize the model’s performance. Except instead of a linear equation, we have a sigmoid equation, where we plug in mx + b to the sigmoid equation we just discussed to get this final equation:

All we are doing is applying the sigmoid equation to mx + b.

A second crucial difference from linear regression is the loss function. Instead of measuring the mean squared error from each training point to the line of best fit, we use a different approach called binary cross entropy, also known as log loss. This involves the following formula for each data point (for binary classification):

Using the symbols $$\hat{y}$$ to denote the predicted y value and ytrue to denote the true y value, we can start to rewrite this more mathematically as:

An arguably simpler way to communicate this same formula is:

We sum this log loss up for all data points, giving us this final formula for the loss:

Coming back to the calculus side of things, in order to determine how much we update the parameters, we once again have to calculate the derivative of the loss function  (binary cross entropy) with respect to parameter m and the derivative of the loss function with respect to the parameter b:

And:

No need to do that math here, as the purpose of this tutorial series is not working out calculus problems. Just know that we can solve those derivatives and update m and b accordingly, just like we did in the first Introduction to Machine Learning course:

At the end of all of that, we have learned a logistic curve like this:

If the corresponding y-value for the hours spent studying is above 0.5 (above that middle dotted line), then we might interpret that as predicting that the student passed the exam.

In practice, 0.5 might not be the best cutoff for the classifier. We will discuss later about how we can comprehensively evaluate the classifier taking this into account.

When predicting between multiple categories, and not only 2 categories, we replace sigmoid activation with softmax activation:

The general formula for softmax activation is:

In this case, we could use a different kind of cross entropy loss called categorical cross entropy loss, with the following formula: