Exploratory Data Analysis on Zomato Dataset

Rohan Paris
4 min readFeb 26, 2022

--

Problem Statement: To perform EDA on zomato dataset

Github Project Link:

Solution:

#import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#load the dataset
df_dataset=pd.read_csv("C:/Users/Admin/Documents/Coding/Jupyter_Projects/Zomato DataSet/zomato.csv")

While trying to load this dataset, we get an error like — UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xed in position 3: invalid continuation byte

#load the dataset with encoding parameter
df_dataset=pd.read_csv("C:/Users/Admin/Documents/Coding/Jupyter_Projects/Zomato DataSet/zomato.csv", encoding='latin1')
#Check first fiv rows of the data
df_dataset.head()
#Function to get count of missing values in each column
def get_cols_with_missing_values(DataFrame):
missing_na_columns=(DataFrame.isnull().sum())
return missing_na_columns[missing_na_columns > 0]
#Get count of missing values in each column of 'df_dataset'
print(get_cols_with_missing_values(df_dataset))
#filling missing data with most common class
df_dataset_clean= df_dataset.apply(lambda x: x.fillna(x.value_counts().index[0]))
#Check if there are no more missing columns
print(get_cols_with_missing_values(df_dataset_clean))

Now we have no null values in our dataset

#Get name of all the columns
df_dataset_clean.columns
df_dataset_clean.info()
#load the country code dataset
df_countrycode= pd.read_excel("C:/Users/Admin/Documents/Coding/Jupyter_Projects/Zomato DataSet/Country-Code.xlsx")
df_countrycode.head()
#Join the two loaded dataset on 'Country Code' and save the result in a new dataframe
df_final= pd.merge(df_dataset_clean, df_countrycode, on='Country Code', how='left')
df_final.head()
New column named ‘Country’ is added
#countries in which zomato is present
df_final.Country.value_counts()
#Store the country name
country_names=df_final.Country.value_counts().index
#Store the count value
country_count=df_final.Country.value_counts().values
#Plot a pie chart showing the countries that use Zomato
plt.pie(country_count, labels=country_names, autopct='%1.1f%%', shadow=True)

The above pie chart looks crowded as the majority of area is covered by India

#Plot a pie chart showing the top 3 countries that use Zomato
plt.pie(country_count[:3], labels=country_names[:3], autopct='%1.2f%%', shadow=True)

Setting the ‘autopct’ parameter to 1.2 gets the data rounded up to two decimal places. Setting it to 1.1 will round up the percentage distribution to one decimal place.

From the above pie-chart, it is observed that the top three countries that use Zomato are India, United States and United Kingdom

#Grouping data 
df_final.groupby(['Aggregate rating', 'Rating color', 'Rating text']).size()
df_final.groupby(['Aggregate rating', 'Rating color', 'Rating text']).size().reset_index()
reset_index() is used to reset the starting index number
#Rename the last count column from '0' to 'Rating Count' 
df_final_ratings=df_final.groupby(['Aggregate rating', 'Rating color', 'Rating text']).size().reset_index().rename(columns={0: 'Rating Count'})

df_final_ratings
import matplotlib
matplotlib.rcParams['figure.figsize']=(12,8)
sns.barplot(x='Aggregate rating', y='Rating Count', data=df_final_ratings, hue='Rating color', palette=['Blue', 'Red', 'Orange', 'Yellow', 'Green', 'Green'])

rcParams is used to adjust the plot size. hue sets the color as per the dataframe. palette maps the colors in the given order.

#Countries that did not give any rating
df_final[df_final['Rating color']=='White'].Country.value_counts()
#Availability of online delivery
df_online_delivery=df_final[['Has Online delivery','Country']].groupby(['Has Online delivery','Country']).size()
print(df_online_delivery)

--

--

Rohan Paris
Rohan Paris

No responses yet