Blogs/Measures of Central Tendency
Measures of Central Tendency
Jun 07 2021
3 min read
Hi, I am a data scientist who believes that 'Torture the data and It'll confess everything'. I am working on image processing and Natural Language Processing.
Prob. and Stats
A measure of central tendency is a single number that attempts to define a dataset by spotting the central position in that dataset. There are three measures of central tendency. these are mean, mode and median. These are also termed as summary statistics.
Mean or Average is the most common measure of central tendency. It is equal to the sum of the values in the dataset divided by the number of values in the dataset and is denoted by \(\bar x\). Mean can be used for both discrete and continuous data. Mathematically,
\(\bar x = \) \(x_1+x_2+. . . . . +x_n \over n\)
It can also be represented in a slightly different form, which includes the use of greek letter \(\Sigma\), called as Sigma.
\(\bar x = \) \(\Sigma x \over n\)
Both of these representations refer to the same thing. However, these are used to calculate the sample mean and not the population mean. A sample is a subset of the dataset that can represent the whole dataset and is used to gain insights about the datset. Population refers to the whole data and population mean is used for this and is denoted by mu (\(\mu\)).
\(\mu = \) \(\Sigma x\over n\)
Mean is highly useful because it takes each value of the dataset into calculation. It is possible that the mean you are getting is not even present in your dataset. Mean cannot be used in a dataset where there are outliers in the dataset because It'll distort the mean value. So, care should be taken if you are trying to find mean in such a dataset.
The median is the middle value when the data points are arranged in an increasing order. The very basic benefit of using median is that it does not get affected by the outliers. Median can be easily calculated by arranging the data points in increasing order and then taking out the middle value from this. If there are even number of values, you need to take the mean of the two values in the middle.
Mode is the most frequent value in your dataset. It is best for categorical data when we have to find which feature is the most affected/most used etc. The highest bar in the histogram represents the mode value in the corresponsing dataset. There are some disadvantages that are associated with mode. Mode cannot be used effectively with the continuous data. There are very less chances that the mode will represent the full dataset. Another problem arises when the most frequent point is far away from the rest of the data points. Thus, using mode should be avoided in these cases.
Learn and practice this concept here: