Feature Transformation Techniques using Python

Rohan Paris
3 min readApr 2, 2022

--

Table of contents:
1. Log transformation
2. Reciprocal transformation
3. Square root transformation
4. Exponential transformation
5. BoxCox transformation

Feature transformation is an important step in feature engineering of numeric data and it is used to handle skewed data. Machine Learning & Deep Learning algorithms are highly dependent on the input data quality. If the data quality is not good then even the high-performance algorithms are of no use.

Please refer to the following blog to get a better understanding of data skewness.

In order to understand feature transformation let us consider UCI Machine Learning’s Pima Indians Diabetes dataset. This dataset has eight independent numeric features and one dependent feature. The ‘Outcome’ i.e target feature is distributed as follows

The distribution of numeric data is as follows

From the above histograms, we can observe that few of the features are right-skewed.

Now we will use some feature transformation techniques and try to get the feature ‘Age’ in a normal distribution curve.

We will use the following function to plot the graphs

def plot_data(df,feature):
plt.figure(figsize=(10,6))
plt.subplot(1,2,1)
df[feature].hist()
plt.subplot(1,2,2)
stat.probplot(df[feature],dist='norm',plot=pylab)
plt.show()
‘Age’ Feature before applying any transformation technique

1. Logarithmic Transformation

df_data[‘Age_log’]=np.log(df_data[‘Age’])
plot_data(df_data,’Age_log’)

2. Reciprocal Transformation

df_data[‘Age_reciprocal’]=1/df_data.Age
plot_data(df_data,’Age_reciprocal’)

3. Square Root Transformation

df_data[‘Age_sqaure’]=df_data.Age**(1/2)
plot_data(df_data,’Age_sqaure’)

4. Exponential Transformation

df_data[‘Age_exponential’]=df_data.Age**(1/1.2)
plot_data(df_data,’Age_exponential’)

5. BoxCOx Transformation

df_data[‘Age_Boxcox’],parameters=stat.boxcox(df_data[‘Age’])
plot_data(df_data,’Age_Boxcox’)

All the available features need to be experimented with to get the best transformation for each feature.

If we use log transformation or reciprocal transformation on a feature having 0 as a value then we get an error like

we get such an error because log(0) and 1/0 are not defined

In order to prevent getting such errors, it is better to use ‘DataFrame.describe()’ to get the max and min values of each feature and then use a ‘+1’ in feature value to avoid getting not defined error.

df_data[‘Pregnancies_log’]=np.log(df_data[‘Pregnancies’]+1)
plot_data(df_data,’Pregnancies_log’)

Link to the notebook:

References:

--

--

Rohan Paris
Rohan Paris

No responses yet