Chebyshev’s inequality and their application in Data Science


Chebyshev’s inequality in Data Science

The part of any arrangement of numbers lying within k standard deviations of those numbers of the mean of those numbers is at least .

Formula :- 1-1/K²

where –

  •        K = the within number /the standard deviations

and K must be greater than 1

Let us understand this using example :-

Suppose we had a data set of minimum value of 200 , maximum value of 1500 ,average is 600 and standard deviation of 80 .We want to know how many data points will fall within 440 and 760 ?

  •  We first subtract 600 – 440 and get 160 i.e 440 is 160 points below the mean.
  •  Subtract 760 – 600 and get 160 i.e 760 is 160 points above the mean.
  •  From above we can say that value between 440 and 760 within 160 units of the mean.

As per formula we can get the value of K  i.e

K = within number /standard deviation = 160/80 = 2

  • Since K>1 , K = 2 standard deviation of the mean.  1-1/K² = 1- 1/4 = 3/4 .So 3/4 of the data lie between 440 and 760. And since 3/4=75% that implies that 75% of the data values are between 440 and 760. Similarly we can say that at least 89% of all data points this data series will lie within 360, 840.

Application :

Let us suppose in a street, there are two clothing shops one is branded Roxy and another one is unbranded Palika. Average spend by customer in Roxy was $150 with standard deviation of 35 $ where as average spend by customer in Palika was $145 with standard deviation of 15 $ .We want to know which stores are sales higher? The average for Roxy is certainly higher than the average for Palika but the standard deviation of Roxy is much higher than the standard deviation of Palika. What does this tell us? It tells us that there is little variation or a less variation in the spend per customer in Palika relative to Roxy which means that they may be more people with higher spends in Palika relative to Roxy.

Leave a Reply