Feature Transformation Techniques using Python
Table of contents:
1. Log transformation
2. Reciprocal transformation
3. Square root transformation
4. Exponential transformation
5. BoxCox transformation
Feature transformation is an important step in feature engineering of numeric data and it is used to handle skewed data. Machine Learning & Deep Learning algorithms are highly dependent on the input data quality. If the data quality is not good then even the high-performance algorithms are of no use.
Please refer to the following blog to get a better understanding of data skewness.
In order to understand feature transformation let us consider UCI Machine Learning’s Pima Indians Diabetes dataset. This dataset has eight independent numeric features and one dependent feature. The ‘Outcome’ i.e target feature is distributed as follows
The distribution of numeric data is as follows
From the above histograms, we can observe that few of the features are right-skewed.
Now we will use some feature transformation techniques and try to get the feature ‘Age’ in a normal distribution curve.
We will use the following function to plot the graphs
def plot_data(df,feature):
plt.figure(figsize=(10,6))
plt.subplot(1,2,1)
df[feature].hist()
plt.subplot(1,2,2)
stat.probplot(df[feature],dist='norm',plot=pylab)
plt.show()
1. Logarithmic Transformation
df_data[‘Age_log’]=np.log(df_data[‘Age’])
plot_data(df_data,’Age_log’)
2. Reciprocal Transformation
df_data[‘Age_reciprocal’]=1/df_data.Age
plot_data(df_data,’Age_reciprocal’)
3. Square Root Transformation
df_data[‘Age_sqaure’]=df_data.Age**(1/2)
plot_data(df_data,’Age_sqaure’)
4. Exponential Transformation
df_data[‘Age_exponential’]=df_data.Age**(1/1.2)
plot_data(df_data,’Age_exponential’)
5. BoxCOx Transformation
df_data[‘Age_Boxcox’],parameters=stat.boxcox(df_data[‘Age’])
plot_data(df_data,’Age_Boxcox’)
All the available features need to be experimented with to get the best transformation for each feature.
If we use log transformation or reciprocal transformation on a feature having 0 as a value then we get an error like
we get such an error because log(0) and 1/0 are not defined
In order to prevent getting such errors, it is better to use ‘DataFrame.describe()’ to get the max and min values of each feature and then use a ‘+1’ in feature value to avoid getting not defined error.
df_data[‘Pregnancies_log’]=np.log(df_data[‘Pregnancies’]+1)
plot_data(df_data,’Pregnancies_log’)
Link to the notebook:
References: