Feature Transformation Techniques using Python

3 min readApr 2, 2022

Table of contents:
1. Log transformation
2. Reciprocal transformation
3. Square root transformation
4. Exponential transformation
5. BoxCox transformation

Feature transformation is an important step in feature engineering of numeric data and it is used to handle skewed data. Machine Learning & Deep Learning algorithms are highly dependent on the input data quality. If the data quality is not good then even the high-performance algorithms are of no use.

Please refer to the following blog to get a better understanding of data skewness.

What do you mean by skewed data?

Skewness refers to a distortion or asymmetry that deviates from the symmetrical bell curve, or normal distribution, in…

parisrohan.medium.com

In order to understand feature transformation let us consider UCI Machine Learning’s Pima Indians Diabetes dataset. This dataset has eight independent numeric features and one dependent feature. The ‘Outcome’ i.e target feature is distributed as follows

The distribution of numeric data is as follows

From the above histograms, we can observe that few of the features are right-skewed.

Now we will use some feature transformation techniques and try to get the feature ‘Age’ in a normal distribution curve.

We will use the following function to plot the graphs

def plot_data(df,feature):
    plt.figure(figsize=(10,6))
    plt.subplot(1,2,1)
    df[feature].hist()
    plt.subplot(1,2,2)
    stat.probplot(df[feature],dist='norm',plot=pylab)
    plt.show()

‘Age’ Feature before applying any transformation technique

1. Logarithmic Transformation

df_data[‘Age_log’]=np.log(df_data[‘Age’])
plot_data(df_data,’Age_log’)

2. Reciprocal Transformation

df_data[‘Age_reciprocal’]=1/df_data.Age
plot_data(df_data,’Age_reciprocal’)

3. Square Root Transformation

df_data[‘Age_sqaure’]=df_data.Age**(1/2)
plot_data(df_data,’Age_sqaure’)

4. Exponential Transformation

df_data[‘Age_exponential’]=df_data.Age**(1/1.2)
plot_data(df_data,’Age_exponential’)

5. BoxCOx Transformation

df_data[‘Age_Boxcox’],parameters=stat.boxcox(df_data[‘Age’])
plot_data(df_data,’Age_Boxcox’)

All the available features need to be experimented with to get the best transformation for each feature.

If we use log transformation or reciprocal transformation on a feature having 0 as a value then we get an error like

we get such an error because log(0) and 1/0 are not defined

In order to prevent getting such errors, it is better to use ‘DataFrame.describe()’ to get the max and min values of each feature and then use a ‘+1’ in feature value to avoid getting not defined error.

df_data[‘Pregnancies_log’]=np.log(df_data[‘Pregnancies’]+1)
plot_data(df_data,’Pregnancies_log’)

Link to the notebook:

DataScience_Projects/rp-pima-diabetes-1-eda-and-feature-transformation.ipynb at main ·…

This repository contains all the Data Science related project codes that I have worked on …

github.com

References:

Types-Of-Trnasformation/All types Of Feature Transformation.ipynb at main ·…

Contribute to krishnaik06/Types-Of-Trnasformation development by creating an account on GitHub.