Top 3% rankšŸ„‡: šŸ Kaggle House Prices ā€” Advanced Regression Techniques (Using Bagging Ensemble)

Rohan Paris
9 min readMar 15, 2022

--

Kaggle Dataset link: https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Github Solution link:

Problem statement: Predict the final price of each home relying on the dataset provided.

Approach:
1. Load the required libraries
2. Load the provided train and test datasets
3. Exploratory Data Analysis
4. Merge train and test data for data preprocessing
5. Handle categorical and numerical missing values
6. Feature Engineering
7. Feature Transformation
8. Encode categorical features
9. Split train and test data
10. Target feature encoding
11. Build a Baseline model and evaluate it using K Fold cross-validation
12. Bagging Ensemble
13. Submit test data prediction

Load the required libraries

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
%matplotlib inline
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings('ignore')
pd.pandas.set_option('display.max_columns',None)

Load the provided train and test datasets

#Train data
df_train=pd.read_csv(ā€˜/kaggle/input/house-prices-advanced-regression-techniques/train.csvā€™)

df_train.head()
#Test data
df_test=pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

df_test.head()
df_train.shape, df_test.shape

Observation:

The train data has 1460 rows and 81 columns
The test data has 1459 rows and 80 columns

Exploratory Data Analysis

#distribution of values in target feature
sns.distplot(df_train.get("SalePrice"), kde=False)
plt.show()

Observation: The dependent feature ā€˜SalePriceā€™ is right-skewed. We will later have to perform a log-normal transformation on this feature.

# find outliers for all the numerical dataset (before handling missing values)

numerical_df = df_train.select_dtypes(exclude=['object'])
numerical_df = numerical_df.drop(["Id"], axis=1)

for column in numerical_df:
plt.figure(figsize=(16, 4))
sns.set_theme(style="whitegrid")
sns.boxplot(numerical_df[column])

Observation: From the above images it can be inferred that are outliers in our data. We will not delete the rows containing the outliers as the chances are it might destroy some important data. We will handle the outliers later using the cross-validation technique.

#features present in train data that are not present in test datafeature_train_not_test = [col for col in df_train.columns if col not in df_test.columns and col != 'SalePrice']

print(feature_train_not_test)

Observation: Test data has all the features which are present in Train data (excluding the target feature ā€˜SalePriceā€™)

#features present in test data that are not present in train data

feature_test_not_train = [col for col in df_test.columns if col not in df_train.columns]

print(feature_test_not_train)

Observation: Train data has all the features which are present in Test data

Merge train and test data for data preprocessing

Instead of applying the data-preprocessing steps first on train data and then the same steps on test data, we will merge the two datasets then perform data preprocessing and feature encoding, and then split the data before building a model.

We will combine the two datasets row-wise and add an extra feature named ā€˜indā€™ which can be used later while splitting the two datasets.

#combine train and test data for data preprocessing

df_merge=pd.concat([df_test.assign(ind="test"), df_train.assign(ind="train")])

df_merge.head()

The merged data now has 2919 rows and 82 columns.

Handle categorical and numerical missing values

#Function to get count of missing values in each column

def get_cols_with_missing_values(DataFrame):
missing_na_columns=(DataFrame.isnull().sum())
return missing_na_columns[missing_na_columns > 0]
print(get_cols_with_missing_values(df_merge))

i. Impute missing categorical features

On exploring the ā€˜data_exploration.txtā€™ we can see that there are few features that contain features with rating or quality.

#Get a list of all the categorical features that have the keyword 'Qual' OR 'Cond' OR 'Qu' OR 'QC' in the feature namefeature_rating_Qual = [col for col in df_merge.columns if 'Qual' in col and df_merge[col].dtypes=='object']feature_rating_Cond = [col for col in df_merge.columns if 'Cond' in col and col not in ['Condition1', 'Condition2', 'SaleCondition'] and df_merge[col].dtypes=='object']feature_rating_Qu = [col for col in df_merge.columns if 'Qu' in col and df_merge[col].dtypes=='object' and col not in feature_rating_Qual]feature_rating_QC = [col for col in df_merge.columns if 'QC' in col and df_merge[col].dtypes=='object']

cat_feature_with_rating = feature_rating_Qual + feature_rating_Cond + feature_rating_Qu + feature_rating_QC
for x in cat_feature_with_rating:
print(x)

Categorical features with ratings

#Categorical features who have NA as a correct value

cat_feature_with_legit_na = ['Alley', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'GarageType', 'GarageFinish', 'Fence', 'MiscFeature']
df_merge[cat_feature_with_legit_na].head()
ordinal_cat_features = cat_feature_with_rating + cat_feature_with_legit_nadf_merge[ordinal_cat_features].head()

We will handle the missing values in the ā€˜ordinal_cat_featuresā€™ by replacing the NAN value with the keyword ā€˜Missingā€™. We are doing so because having a missing value in these features makes sense. Eg ā€” Having a NaN value in the feature ā€˜GarageTypeā€™ means that the house has no garage.

#On checking the data description, Missing is valid in some categorical

#Handling Missing Values in Ordinal Categorical features by replacing them with 'Missing' keyword

df_merge[ordinal_cat_features] = df_merge[ordinal_cat_features].fillna("Missing")
#Making sure the missing values have been handled
print(get_cols_with_missing_values(df_merge[ordinal_cat_features]))

We will replace the missing values in the remaining categorical features with the mode.

categorical_cols=[cname for cname in df_merge.columns if df_merge[cname].dtypes=='object' and cname!='ind']remaining_cat_cols = [cname for cname in categorical_cols if cname not in ordinal_cat_features]#Handling Missing Values in Categorical features by replacing them with the feature mode value
for col in remaining_cat_cols:
df_merge[col] = df_merge[col].fillna(df_merge[col].mode()[0])

ii. Impute missing numerical features

#Handling Missing Values in Numerical features by replacing them with Mean value
df_merge[numerical_cols]=df_merge[numerical_cols].fillna(df_merge[numerical_cols].mean())

We will have to one-hot encode the non-ordinal categorical features , so we will choose only the features with cardinality(number of unique values) <10

#Select categorical columns with low cardinality
categorical_cols=[cname for cname in df_merge.columns if df_merge[cname].dtypes=='object' and df_merge[cname].nunique()<10]
numerical_cols=[cname for cname in df_merge.columns if df_merge[cname].dtypes!='object']

# Keep selected columns only
my_cols = numerical_cols + categorical_cols

df_merge_clean = df_merge[my_cols].copy()

Feature Engineering

#Drop 'Id' feature
df_merge_clean.drop('Id', axis=1, inplace=True)

We have dropped the ā€˜Idā€™ feature as it is not required.

Handling features with Year in it:

df_merge_clean['GarageYrBlt'] = df_merge_clean['GarageYrBlt'].astype('int')df_merge_clean['GarageYrBlt'] = df_merge_clean['YrSold'] - df_merge_clean['GarageYrBlt']df_merge_clean['YearBuilt'] = df_merge_clean['YrSold'] - df_merge_clean['YearBuilt']df_merge_clean['YearRemodAdd'] = df_merge_clean['YrSold'] - df_merge_clean['YearRemodAdd']df_merge_clean.drop(["YrSold"], axis=1, inplace=True)
df_merge_clean.drop(["MoSold"], axis=1, inplace=True)

Handling features with square footage:

#TotalBsmtSF(Total square feet of basement area) = BsmtFinSF1(Type 1 finished square feet) + BsmtFinSF2(Type 2 finished square feet) + BsmtUnfSF(Unfinished square feet of basement area)df_merge_clean.drop(["TotalBsmtSF"], axis=1, inplace=True)#Basement finished area
df_merge_clean['BsmtFinSF'] = df_merge_clean['BsmtFinSF1'] + df_merge_clean['BsmtFinSF2']
df_merge_clean.drop(["BsmtFinSF1"], axis=1, inplace=True)
df_merge_clean.drop(["BsmtFinSF2"], axis=1, inplace=True)
#Total floor square feet
df_merge_clean['TotalFlrSF'] = df_merge_clean['1stFlrSF'] + df_merge_clean['2ndFlrSF']

df_merge_clean.drop(["1stFlrSF"], axis=1, inplace=True)
df_merge_clean.drop(["2ndFlrSF"], axis=1, inplace=True)

Handling Bathroom features:

df_merge_clean['Total_Bath'] = (df_merge_clean['FullBath'] 
+ (0.5*df_merge_clean['HalfBath'])
+ df_merge_clean['BsmtFullBath']
+ (0.5*df_merge_clean['BsmtHalfBath']))

df_merge_clean.drop(["FullBath"], axis=1, inplace=True)
df_merge_clean.drop(["HalfBath"], axis=1, inplace=True)
df_merge_clean.drop(["BsmtFullBath"], axis=1, inplace=True)
df_merge_clean.drop(["BsmtHalfBath"], axis=1, inplace=True)

Feature Transformation

numerical_cols = [cname for cname in df_merge_clean.columns if df_merge_clean[cname].dtypes!='object' and cname!='SalePrice']skew_df = pd.DataFrame(numerical_cols, columns=['Feature'])skew_df['Skew'] = skew_df['Feature'].apply(lambda feature: scipy.stats.skew(df_merge_clean[feature]))skew_df['Absolute Skew'] = skew_df['Skew'].apply(abs)skew_df['Skewed'] = skew_df['Absolute Skew'].apply(lambda x: True if x >= 0.5 else False)skew_df
df_merge_clean[numerical_cols].describe()

There are few numerical features whose minimum value is 0. Here we cannot apply log transformation as the log(0) = infinity. So we will apply log1p transformation

log0 stretches to infinity
log1p
#Apply log1p transformation
for
column in skew_df.query("Skewed == True")['Feature'].values:
df_merge_clean[column] = np.log1p(df_merge_clean[column])

Encode categorical features

#Before encoding - features with rating
df_merge_clean[cat_feature_with_rating]
for col in cat_feature_with_rating:
if 'Missing' in df_merge_clean[col].value_counts().index:
df_merge_clean[col] = df_merge_clean[col].map({"Missing":0,"Po":1,"Fa":2,"TA":3,"Gd":4,"Ex":5})
else:
df_merge_clean[col] = df_merge_clean[col].map({"Po":1,"Fa":2,"TA":3,"Gd":4,"Ex":5})

In features where the missing value is applicable, we are mapping value ā€˜missingā€™: 0.
And the rest of the features are mapped as-
ā€˜poorā€™: 1, ā€˜fairā€™: 2, ā€˜averageā€™: 3, ā€˜goodā€™:4, ā€˜excellentā€™: 5

#After encoding - features with rating
df_merge_clean[cat_feature_with_rating]

Now, we will encode the categorical features that have a valid missing value and follow an order.

#features with legit na
df_merge_clean[cat_feature_with_legit_na]
#Exclude 'Alley', MiscFeature' and 'GarageType' feature as they are not ordinaldf_merge_clean['BsmtExposure'] = df_merge_clean['BsmtExposure'].map({"Missing":0,"No":1,"Mn":2,"Av":3,"Gd":4}).astype('int')df_merge_clean['BsmtFinType1'] = df_merge_clean['BsmtFinType1'].map({"Missing":0,"Unf":1,"LwQ":2,"Rec":3,"BLQ":4,"ALQ":5,"GLQ":6}).astype('int')df_merge_clean['BsmtFinType2'] = df_merge_clean['BsmtFinType2'].map({"Missing":0,"Unf":1,"LwQ":2,"Rec":3,"BLQ":4,"ALQ":5,"GLQ":6}).astype('int')df_merge_clean['GarageFinish'] = df_merge_clean['GarageFinish'].map({"Missing":0,"Unf":1,"RFn":2,"Fin":3}).astype('int')df_merge_clean['Fence'] = df_merge_clean['Fence'].map({"Missing":0,"MnWw":1,"GdWo":2,"MnPrv":3,"GdPrv":4}).astype('int')

There are other features where rank can be applied-

df_merge_clean['LotShape'] = df_merge_clean['LotShape'].map({"IR3":1,"IR2":2,"IR1":3,"Reg":4}).astype('int')df_merge_clean['LandContour'] = df_merge_clean['LandContour'].map({"Low":1,"Bnk":2,"HLS":3,"Lvl":4}).astype('int')df_merge_clean['Utilities'] = df_merge_clean['Utilities'].map({"ELO":1,"NoSeWa":2,"NoSewr":3,"AllPub":4}).astype('int')df_merge_clean['LandSlope'] = df_merge_clean['LandSlope'].map({"Sev":1,"Mod":2,"Gtl":3}).astype('int')df_merge_clean['CentralAir'] = df_merge_clean['CentralAir'].map({"N":0,"Y":1}).astype('int')df_merge_clean['PavedDrive'] = df_merge_clean['PavedDrive'].map({"N":0,"P":1,"Y":2}).astype('int')

Now we will apply encoding on the remaining categorical features

cat_remaining_to_encode = [col for col in df_merge_clean.columns if df_merge_clean[col].dtypes=='object' and col !='ind']print(cat_remaining_to_encode)
df_merge_clean_dummies = pd.get_dummies(df_merge_clean[cat_remaining_to_encode],drop_first=True)df_merge_clean.drop(cat_remaining_to_encode,axis=1,inplace=True)df_merge_clean = pd.concat([df_merge_clean,df_merge_clean_dummies],axis=1)

Split train and test data

test, train= df_merge_clean[df_merge_clean["ind"].eq("test")], df_merge_clean[df_merge_clean["ind"].eq("train")]test.drop(["SalePrice", "ind"], axis=1, inplace=True)

train.drop(["ind"], axis=1, inplace=True)

Target feature encoding

log_target = np.log(train['SalePrice'])

train.drop(["SalePrice"], axis=1, inplace=True)

Baseline Model

Once our data is ready, we can apply an algorithm to it and get the test predictions. This will be a baseline score which we have to improve in subsequent versions.

Bagging Ensemble

catboost_params = {
'iterations': 5000,
'learning_rate': 0.02,
'depth': 4,
'eval_metric':'RMSE',
'early_stopping_rounds': 20
}

xgboost_params = {
'n_estimators': 5000,
'learning_rate': 0.02,
'colsample_bytree': 0.5,
'subsample': 0.5,
'min_child_weight': 2,
'early_stopping_rounds': 20
}

I have used the Catboost and XGBoost algorithms to get the predictions. I have also modified the parameters manually to get the best personal score.

Instead of manually tuning the hyperparameters, we can first use ā€˜RandomizedSearchCVā€™ to get a range of parameters and then use ā€˜GridSearchCVā€™ to get the best parameters.

Using this technique causes the program to run for a longer time.

Bag the model using the two algorithms ā€”

models = {
"catboost": CatBoostRegressor(**catboost_params, verbose=0),
"xgb": XGBRegressor(**xgboost_params, verbose=0)
}

The ā€˜**ā€™ operator automatically unpacks the parameters given above.

for name, model in models.items():
model.fit(train, log_target)
print(name + " trained.")

Instead of using train-test split on our data and then applying the model, we will use K Fold cross-validation on the train data (this will help us to deal with the outliers in numerical features we saw during the EDA)

results = {}
kf = KFold(n_splits=10)
for name, model in models.items():
result = np.exp(np.sqrt(-cross_val_score(model, train, log_target, scoring='neg_mean_squared_error', cv=kf)))
results[name] = result

Evaluate the model:

for name, result in results.items():
print("----------\n" + name)
print(np.mean(result))
print(np.std(result))
#Combine predictions
final_predictions = (
0.5 * np.exp(models['catboost'].predict(test)) +
0.5 * np.exp(models['xgb'].predict(test))
)

Submit test data prediction

# Save test predictions to file
output = pd.DataFrame({'Id': test.index+1461,
'SalePrice': final_predictions})
output.to_csv('submission.csv', index=False)

We have used test.index + 1461 as the sample_submission file has index starting from 1461.

Leaderboard ranking -

References:

1. Gabriel Atkin Youtube channel (https://www.youtube.com/watch?v=zwYHloLXH0c&t=2376s)

2. Krish Naik Youtube channel (https://www.youtube.com/watch?v=ioN1jcWxbv8)

--

--

Rohan Paris
Rohan Paris

No responses yet