What do you mean by skewed data?

Rohan Paris
2 min readApr 2, 2022

--

image credit: theeconomyofmeaning.com

Skewness refers to a distortion or asymmetry that deviates from the symmetrical bell curve, or normal distribution, in a set of data. If the curve is shifted to the left or to the right, it is said to be skewed. A symmetrical distribution curve will have a skewness of 0.

We can plot the numerical data distribution using a histogram.

Right-skewed data:

A histogram in which most of the data falls to the right of the graph’s peak is known as a right-skewed histogram. It is also known as a positively skewed histogram.

For right-skewed data, mean > median > mode

Wealth distribution is a classic example of this type of data, where vastly-rich people belong to the far left of the graph.

Left-skewed data:

A histogram in which most of the data falls to the left of the graph’s peak is known as a left-skewed histogram. It is also known as a negatively skewed histogram.

For left-skewed data, mode > median > mean

The average lifespan of human beings can be considered an example where the curve is left-skewed.

Check for skewness using Python’s Scipy library:

import pandas as pd
import scipy.stats
#Get all the numerical features except the target feature
numerical_cols = [cname for cname in df_merge_clean.columns if df_merge_clean[cname].dtypes!='object' and cname!='SalePrice']
#Create a dataframe named skew_df
skew_df = pd.DataFrame(numerical_cols, columns=['Feature'])
#Create a feature named Skew in skew_df
skew_df['Skew'] = skew_df['Feature'].apply(lambda feature: scipy.stats.skew(df_merge_clean[feature]))
#Create a feature named Absolute Skew in skew_df
skew_df['Absolute Skew'] = skew_df['Skew'].apply(abs)
#Create a feature named Skewed in skew_df where classify features with absolute skewness greater than 0.5 as skewed
skew_df['Skewed'] = skew_df['Absolute Skew'].apply(lambda x: True if x >= 0.5 else False)
skew_df

--

--

Rohan Paris
Rohan Paris

No responses yet