Customer Churn analysis and prediction using Python
Customer attrition, also known as customer churn, customer turnover, or customer defection, is the loss of clients or customers. A high churn means that more customers no longer want to purchase goods and services from the business. The customer churn rate can be calculated by dividing the total number of customers who have left the service by the total number of active customers present at the start of the period. For example, if you got 1000 customers and lost 50 last month, your monthly churn rate is 5 percent.
The primary objective of building a customer churn predictive model is to retain customers at the highest risk of churn by proactively engaging with them. For example: Offer a gift voucher or any promotional pricing and lock them in for an additional year or two to extend their lifetime value to the company.
To get a better understanding of this concept, we will use the Telco customer churn dataset from Kaggle.
1. Dataset description
The dataset contains 7043 rows and 21 features. Each row represents a customer, each column contains the customer’s attributes described in the column Metadata. The data set includes information about:
- Customers who left within the last month — the column is called Churn — this is the target feature
- Services that each customer has signed up for — phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
- Customer account information — how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
- Demographic info about customers — gender, age range, and if they have partners and dependents
2. Data preprocessing
From the below image we can see that there seem to be no null values in this dataset as the ‘Non-Null Count’ for each feature is 7043, equal to the total number of rows available.
2.1. TotalCharges feature:
The data type of the ‘TotalCharges’ feature is an object. So we will try to convert it into a float type using the below code
#Convert to float type
df_data[‘TotalCharges’] = df_data[‘TotalCharges’].astype(‘float64’)
On trying this we get an error like
This means there must be blank values in this feature. We will try to find the blank spaces using the ‘isspace()’ function
#Index of rows that have a blank space i.e. it is a null value
na_index = df_data[df_data[‘TotalCharges’].apply(lambda x: x.isspace())==True].index
print(na_index)
The blank spaces can be handled by first replacing the spaces with a NULL value and then typecasting the feature to a float type.
# Fill the 11 blank values with the np.nan
df_data[‘TotalCharges’] = df_data[‘TotalCharges’].replace(‘ ‘, np.nan)#Convert to float type
df_data[‘TotalCharges’] = df_data[‘TotalCharges’].astype(‘float64’)
The null values can now be handled by imputing them with the median value of the feature.
2.2. SeniorCitizen feature:
Apart from the ‘SeniorCitizen’ feature, all the other features have values like Yes/No. So we will map 0 to No and 1 to Yes for the ‘SeniorCitizen’ feature.
df_data[‘SeniorCitizen’]=df_data[‘SeniorCitizen’].map({0:’No’, 1:’Yes’})
3. EDA
3.1. Target feature distribution:
The dataset is an imbalanced dataset as the number of non-churning customer classes far exceeds the churning customer class. Out of 7043 records, 5174 are of the ‘No’ class and the remaining 1869 are ‘Yes’ class. A machine-learning algorithm trained on this data will mostly predict a biased result of No class which is wrong as our main objective is to identify the customers who are likely to be churned (i.e Yes class).
We will later apply resampling techniques to handle this imbalanced data. Please refer to the following blog on how to handle imbalanced datasets.
3.2. Univariate analysis:
Next, we will try to perform a univariate analysis on the categorical features w.r.t the target feature
#Plotting the impact of categorical features on ‘Churn’
plt.figure(figsize=(25,25))
for i,cat in enumerate(cat_cols):
plt.subplot(6,3,i+1)
sns.countplot(data = df_data, x= cat, hue = “Churn”)
plt.show()
From figure 3 we can generate the following insights-
- Gender seems to play no role in churn
- Customers having no dependents are more likely to be churned than the ones who have dependents.
From figure 4 we can generate the following insights-
- Customers having Fiber Optic cables for the internet are more likely to be churned and the ones that have DSL are less likely to be churned.
- Customers who do not have Online security, Online data backup, Device protection, and tech support services are more likely to be churned.
From figure 5 we can generate the following insights-
- Customers on a month-to-month contract are more likely to be churned as they are not bonded with any policy.
3.3. Create a new feature from tenure:
Next, we will create a new feature named ‘tenure_grp’ by grouping the ‘Tenure’ feature into bins
df_data[‘tenure_grp’] = pd.cut(df_data[‘tenure’], bins=[0,12,24,36,48,60,np.inf], labels=[‘0–12’, ‘13–24’, ‘25–36’, ‘37–48’, ‘49–60’, ‘60+’])
From figure 6 we can infer that customers are more likely to be churned within the first 12 months.
sns.scatterplot(x=’MonthlyCharges’, y=’TotalCharges’, data=df_data_dummy, hue=’Churn’)
4. Model Building
4.1. Compare classification algorithms using PyCaret:
Different classification models were compared using the PyCaret library. PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows.
from pycaret.classification import *
clf = setup(data=df_data_model,target='Churn')
best = compare_models()
print(best)
It was found that the AdaBoostClassifier algorithm gives the best results for this data.
4.2. Baseline model:
So now as our algorithm has been finalized we will separate the dependent and independent features into two variables.
X = df_data_model.loc[:, df_data_model.columns!=’Churn’]
y = df_data_model[‘Churn’]
SMOTE technique is used to handle the imbalanced dataset.
smote = SMOTE()
# fit predictor and target variable
X_smote, y_smote = smote.fit_resample(X,y)
print(‘Original dataset shape’, Counter(y))
print(‘Resample dataset shape’, Counter(y_smote))
The resampled data is then split into train and test data for validation
# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X_smote, y_smote, train_size=0.7, test_size=0.3, random_state=0)
# summarize
print(‘Train’, X_train.shape, y_train.shape)
print(‘Test’, X_valid.shape, y_valid.shape)
#adaboost model training
ada_clf = AdaBoostClassifier(random_state=0)kfold = KFold(n_splits = 10, random_state = 5)results = cross_val_score(ada_clf, X_train, y_train, cv = kfold)
print(results.mean())
The results give a mean of 0.8224557058487332
#train model
ada_clf.fit(X_train, y_train)#make predictions
y_pred = ada_clf.predict(X_valid)#metrics
print('Model accuracy score: ',accuracy_score(y_valid,y_pred))
print('Confusion matrix: ')
print(confusion_matrix(y_valid,y_pred))
print(classification_report(y_valid,y_pred))
4.3. Hyperparameter tuning:
We will now try hyperparameter tuning to improve the model’s performance
#Hyper parameter optimization
params={
“learning_rate”: [0.5,0.7,1,1.2,1.5],
“n_estimators”: [50,100,150,200],
“algorithm”: [‘SAMME’, ‘SAMME.R’]
}ada_clf = AdaBoostClassifier(random_state=1347)grid_search = GridSearchCV(estimator=ada_clf, param_grid=params, cv=kfold, n_jobs=-1,verbose=0)grid_search.fit(X_train, y_train)
print(‘Model accuracy score: ‘,accuracy_score(y_valid,y_pred))
print(‘Confusion matrix: ‘)
print(confusion_matrix(y_valid,y_pred))
print(classification_report(y_valid,y_pred))
4.4. Deployment:
- The application is dockerized in image rparis97/customer-churn-docker
- Please refer to the following blog to better understand Docker -
Github link to notebook:-