What are Percentiles and how they can be used to handle outliers in numerical data?

Rohan Paris
3 min readApr 26, 2022

--

image credits: https://www.cartoonistgroup.com/subject/The-Behind+Bars-Comics-and-Cartoons-by-Chip+Bok%27s+Editorial+Cartoons.php

Percentage vs Percentiles:

The percentage is a mathematical value that is written out of a total of 100. It is used to compare quantities. The percentile is a value below which a certain percentage of observations lie. It is used to display the rank of an observation.

For example, student ‘A’ has scored 70% on a science test and student ‘B’ belongs to the 70th percentile on the same test.

Here student ‘A’ scoring 70% means he has got 70/100 marks on the test.

And student ‘B’ being in the 70th percentile means that she has scored more marks than 70% of the students who have taken the same test.

formula to calculate percentage and percentile

Solved example of Percentiles:

Consider X = { 2,2,3,4,6,7,9,10,10,13,15,19,20,21,22 }

Here the count of sample ’n’ is 15

value ‘13’ lies in the 60th percentile

To find out what value lies at the kth percentile, the formula is modified as follows

Five Number Summary:

The five-number summary is a term that includes the minimum value, 1st quartile value, median value, 3rd quartile value, and the maximum value for a set of observations.

Quartile 1 or Q1 is the 25th percentile of a set and Quartile 3 or Q3 is the 75th percentile.

Interquartile range (IQR) = Q3 — Q1

image source: https://www.w3schools.com/statistics/statistics_box_plots.php

The lowest or minimum boundary of observations within a set ‘lower fence’ is calculated as Q1–1.5*IQR

The highest or maximum boundary of observations within a set ‘upper fence’ is calculated as Q3+1.5*IQR

Data points or observations that fall below the lower fence or above the upper fence are known as Outliers. An outlier is an observation that lies at an abnormal distance from other values in a random sample of a population.

Handle outliers using Python:

Here we will take the example of the ‘sepal_width’ feature from the iris dataset. The iris dataset is loaded using the seaborn library.

df=sns.load_dataset(‘iris’)

The boxplot is plotted using the following code

sns.boxplot(df[‘sepal_width’])
boxplot showing outliers

In the above boxplot, we can see that there is an outlier before the minimum value and there are 3 outliers after the maximum value

IQR = df[‘petal_width’].quantile(0.75) -   df[‘petal_width’].quantile(0.25)lower_fence=df[‘petal_width’].quantile(0.25)-(IQR*1.5)upper_fence=df[‘petal_width’].quantile(0.75)+(IQR*1.5)

IQR: 1.5
lower fence: -1.95
upper fence: 4.05

we will handle the outliers by substituting the outliers above the upper fence with the upper_fence value and those below the lower fence with lower_fence value

df.loc[df[‘petal_width’]>=upper_fence,’petal_width’] = upper_fencedf.loc[df[‘petal_width’]<=lower_fence,’petal_width’] = lower_fence

Now, if we try to plot the boxplot again for the ‘petal_width’ feature, we can see that the outliers have been handled

sns.boxplot(df[‘petal_width’])
boxplot with no outliers

--

--