What do you mean by skewed data?
Skewness refers to a distortion or asymmetry that deviates from the symmetrical bell curve, or normal distribution, in a set of data. If the curve is shifted to the left or to the right, it is said to be skewed. A symmetrical distribution curve will have a skewness of 0.
We can plot the numerical data distribution using a histogram.
Right-skewed data:
A histogram in which most of the data falls to the right of the graph’s peak is known as a right-skewed histogram. It is also known as a positively skewed histogram.
For right-skewed data, mean > median > mode
Wealth distribution is a classic example of this type of data, where vastly-rich people belong to the far left of the graph.
Left-skewed data:
A histogram in which most of the data falls to the left of the graph’s peak is known as a left-skewed histogram. It is also known as a negatively skewed histogram.
For left-skewed data, mode > median > mean
The average lifespan of human beings can be considered an example where the curve is left-skewed.
Check for skewness using Python’s Scipy library:
import pandas as pd
import scipy.stats#Get all the numerical features except the target feature
numerical_cols = [cname for cname in df_merge_clean.columns if df_merge_clean[cname].dtypes!='object' and cname!='SalePrice']#Create a dataframe named skew_df
skew_df = pd.DataFrame(numerical_cols, columns=['Feature'])#Create a feature named Skew in skew_df
skew_df['Skew'] = skew_df['Feature'].apply(lambda feature: scipy.stats.skew(df_merge_clean[feature]))#Create a feature named Absolute Skew in skew_df
skew_df['Absolute Skew'] = skew_df['Skew'].apply(abs)#Create a feature named Skewed in skew_df where classify features with absolute skewness greater than 0.5 as skewed
skew_df['Skewed'] = skew_df['Absolute Skew'].apply(lambda x: True if x >= 0.5 else False)skew_df