Blogs/Evaluating Classifiers

# Evaluating Classifiers

peterwashington Nov 01 2021 9 min read 0 views
Supervised

It is important to properly measure how well our machine learning classifiers do. While we may think that we can just measure the model’s accuracy, it turns out that only looking at accuracy can be incredibly misleading. Let’s see why with an example.

Let’s say we are trying to predict whether someone is sick with coronavirus or not, and we measure their cough intensity (x1 axis) and their temperature (x2 axis). We plot the testing data points:

Notice that there are 9 data points for “no coronavirus” and 1 data point for “coronavirus”. Now let’s say we use the following function to predict the diagnosis:

def has_coronavirus(x1, x2):

    return "coronavirus"

The above “classifier” predicts that the person does not have coronavirus regardless of their cough intensity or temperature. Pretty horrible strategy, right? Well, let’s measure its accuracy. Accuracy is defined as the percentage of data points correctly predicted. For the above dataset, the accuracy is:

90% accurate?! This is pretty clearly wrong, since the classifier just predicts “no coronavirus” no matter what. This is why machine learning scientists are not huge fans of accuracy. It can be an incredibly misleading metric.

To arrive at more robust machine learning metrics, we must first discuss four important sub-metrics:

• True positive (TP): A true positive occurs when the data point is actually a “positive” point (in this case, coronavirus) and the classifier correctly predicts that it is “positive”.
• True negative (TN): A true negative occurs when the data point is actually a “negative” point (in this case, not coronavirus) and the classifier correctly predicts that it is “negative”.
• False positive (FP): A false positive occurs when the data point is actually a “negative” point (in this case, not coronavirus) but the classifier falsely predicts that it is “positive”.
• False negative (FN): A false negative occurs when the data point is actually a “positive” point (in this case, coronavirus) but the classifier false predicts that it is “negative”.

Using these metrics, we can derive the formulas for commonly used evaluation metrics in machine learning: specificity and sensitivity.

Specificity has the following formula:

Conceptually, this is the amount of data points which were actually negative (TN + FP) that were predicted to be negative (TN).

Specificity has the following formula:

Conceptually, this is the amount of data points which were actually positive (TP + FN) that were predicted to be positive (TP).

Let’s see how these metrics compare to accuracy for the coronavirus patient dataset from above:

The specificity is:

The sensitivity is:

So in sum, we have an accuracy of 90%, a specificity of 0% (because the classifier misses all negative examples), and a sensitivity of 100% (because the classifier identifies all positive examples). These 3 metrics, when reported together, give us a much more comprehensive understanding of the classifier’s performance. In the real world, we do not only report accuracy – we report all 3 metrics (and potentially other metrics).

Another common pair of evaluation metrics are called precision and recall. Recall is actually just another name of sensitivity as described above, and they have the exact same formula. Precision has the following formula:

Conceptually, precision is the fraction of data points which were predicted to be positive (TP + FP) which were actually positive (TP).