Balanced Datasetspeterwashington Nov 01 2021 3 min read 0 views
Bias and Fairness
It is crucial that data sets are balanced across class. A balanced data set contains the same number of data points for each output category. In section 2.7, we saw that if a dataset exclusively contains examples from one category, then the classifier can always predict that category and will perform well on the training data.
The same principle applies to achieving unbiased models. If 99% of the data in the dataset are from white males, then the model could very likely fail for non white males. Given that the data collection process may result in biased data samples, how can we algorithmically correct for this lack of equal representation in the dataset?
The first approach is called undersampling. To undersample, first determine the class with the smallest number of data points (samples). Then, take the number of data points in this class with the least examples and randomly sample data points from the other classes so that each class has the same number of data points as the class with least examples. For example:
The opposite approach is called oversampling. Oversampling involves finding the class with the maximum number of data points and repeating the data points in the other classes until all classes have the same number of data points. For example:
There are advantages and disadvantages to both undersampling and oversampling. Oversampling will allow you to keep all data points in the dataset, but any biases in the smaller classes will be magnified due to the data points being repeated many times. On the other hand, undersampling will not magnify biases of the underrepresented class, but several data points will be thrown out, which of course is not an ideal situation (it is very difficult to collect good data). It is also possible to combine oversampling and undersampling by selecting a cutoff, above which you undersample, and below which you oversample.