Twitter sentiment analysis and classification

Rohan Paris
16 min readJul 24, 2022

--

This is a multiclass classification problem and to solve it I have performed text cleaning, analysis, and classification based on the sentiments.

Table of Contents:
1. Business Problem
2. Dataset Information
3. Evaluation Metrics
4. EDA
5. Data Preprocessing
6. Model Building
7. Conclusion and Future scope
8. Code links

1. Business Problem

This case study is based on Kaggle’s Twitter Sentiment Analysis dataset. This project aims to perform different text cleaning techniques to extract relevant information from the Tweets. Given a message and an entity, the task is to judge the message's sentiment about the topic.

2. Dataset Information

  • The dataset can be downloaded from this Kaggle link.
  • The dataset contains 74682 rows and 4 columns.
  • I have renamed the dataset’s column to {0:’Tweet_ID’, 1:’Topic’, 2:’Sentiment’, 3:’Tweet’} to get a better sense of the data.
  • As the name suggests ‘Tweet_ID’ feature contains the tweet’s ID.
    The ‘Topic’ feature gives the topic of the tweet.
    The ‘Tweet’ feature contains the tweet’s contents.
  • The target feature ‘Sentiment’ has four classes namely ‘Positive’, ‘Negative’, ‘Neutral’, and ‘Irrelevant’.
  • As seen from the above graphs, this is a balanced dataset.
  • Following libraries have been used to complete this project.
#Data-preprocessing libraries
import pandas as pd
import numpy as np

#Text processing libraries
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer #feature extraction

#Load data-visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

#model building
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB

#evaluation metrics
from sklearn.metrics import confusion_matrix,f1_score,accuracy_score

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

lemmatizer = WordNetLemmatizer()

pd.pandas.set_option('display.max_columns',None)

3. Evaluation Metrics

As this is a multiclass classification problem, we will evaluate the model’s performance based on the weighted F1 score.

In the above equation, W stands for the normalized weight of each class.

F1 score is the harmonic mean of Precision and Recall and is calculated as follows

image credits: https://datascience103579984.wordpress.com/2019/04/30/balanced-accuracy-and-f1-score/

4. EDA

  • 0.9% of the ‘Tweet’ feature contained null values. So we have dropped the null values.
#code to check the percentage of missing data
(df_data.isnull().sum()/len(df_data))*100
#drop the null values
df_data.dropna(axis=0,inplace=True)
  • The features had correct data types. So there is no need to change them.
  • Distinct values in the ‘Topic’ feature are
  • As mentioned earlier, this is an example of a balanced dataset as there is a balanced distribution of classes in the target feature ‘Sentiment’.
  • Next, we will try to analyze the distribution of tokens in each tweet. To do this, a new feature named ‘Tweet_word_count’ has been created which contains the total number of tokens/words in each tweet.
#Get the count of words in each tweet
df_data['Tweet_word_count']=df_data['Tweet'].apply(lambda x: len(x.split()))
plt.figure(figsize=(15,10))#code to plot boxplot
plt.subplot(2,1,1)
sns.boxplot(x=df_data['Tweet_word_count'])
plt.title(('Distribution of number of tokens in tweets'))
#code to plot a histogram
plt.subplot(2,1,2)
sns.distplot(x=df_data['Tweet_word_count'])
  • From the above plots, it can be observed that the mean number of tokens is around 23 in each tweet. The boxplot shows that there are some extreme outliers and the histogram shows that the data is positively skewed.
  • The following code is used to print the tweets that contain the extreme outliers (token count>125)
#Extreme outliers
extreme_outliers = df_data['Tweet'][df_data['Tweet_word_count']>125]

for i in extreme_outliers.index:
print(i,'Tweet Sentiment: ',df_data['Sentiment'][i])
print(extreme_outliers[i])
print('\n')

1826 Tweet Sentiment: Neutral
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

8546 Tweet Sentiment: Positive
I REALLY HAVE THE OVERWATCH RN GAME. SEA SEA SEA SEA SEA SEA SEA SEA SEA SEA SEA SEA SEA SEA SEA SEA SEA SEA SEA SEE E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E E

10454 Tweet Sentiment: Positive
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

32186 Tweet Sentiment: Neutral
= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

43712 Tweet Sentiment: Negative
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

52136 Tweet Sentiment: Neutral
There was a meeting with the interns on their upcoming projects, and my background was a TV showing my ped Red Dead Reduction 2 after the pillaging of the dead body, so I’m just trying to hide it from everyone, as I am: / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /

61388 Tweet Sentiment: Irrelevant
(PC) Come Vibe With Me. Messing Around in GTA!!!!!!!!!!!!!!!!!!!!!!!!!!!!! _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

68078 Tweet Sentiment: Neutral
@ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

68576 Tweet Sentiment: Negative
When fear is raised that “punk-2077” may be postponed for several years, which I hope will not happen, but here is another powerful blow to the Bollocks in 2020: / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / / /

68624 Tweet Sentiment: Neutral
I’m a little disappointed, but my schedule approves this decision entirely. ^ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

70940 Tweet Sentiment: Neutral
The event dedicated to Victory Day in the Great Patriotic War was held as part of the celebration of the 70th anniversary of Victory in the Great Patriotic War of 1941–1945, which was attended by veterans of the Great Patriotic War, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home front workers, home

  • We can observe that tweets 8546 and 10454 mostly contain ‘E’ and ‘-’ and they have been classified as ‘Positive’
  • On the other hand, tweets 43712 and 68576 mostly contain ‘#’ and ‘/’ and they have been classified as ‘Negative’

5. Data preprocessing

  • In this section, the main task is to clean the ‘Tweet’ feature in order to extract important information from it using regular expressions.
  • The ‘sub()’ function from regular expressions will be used to achieve this task. This function takes the following parameters
  • re.sub(substring 1, substring 2, original text)
    where,
    substring 1 = string that is to be replaced
    substring 2 = string that substring 1 will be replaced with

5.1. Remove user mentions

  • User mentions start with ‘@’. We will replace all the words that start with the ‘@’ symbol followed by alphanumeric characters until whitespace with whitespace.
  • Eg: ‘I want to join @google’ will become ‘I want to join ’
df_data['Tweet_clean'] = df_data['Tweet'].apply(lambda x: re.sub(r'@[A-Za-z0-9]+','',x))
  • In the above code, the regular expression ‘@[A-Za-z0–9]+’ means all the strings starting with @ symbol followed by a combination of alphabets and numbers.

5.2. Remove hashtags

  • Tweets contain ‘#’ to better classify the content. We will replace the hashtags with whitespaces.
  • Eg: ‘I want to join google #goals’ will become ‘I want to join goggle’
df_data['Tweet_clean']=df_data['Tweet_clean'].apply(lambda x: re.sub('#','',x))

5.3. Remove contractions

  • “Don’t” or “could’ve” are examples of contractions. Such contractions will be replaced with the whole word. Eg: “Don’t” will become “Do not” and “could’ve” will become “could have”.
  • A dictionary named ‘contraction_mapping’ has been prepared that contains the list of some of the possible contractions. You can find the dictionary by checking the Github link available at the end of this post.
df_data['Tweet_clean'] = df_data['Tweet_clean'].apply(lambda x: ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in x.split(" ")]))

5.4. Remove hyperlinks

  • Hyperlinks or URLs starting with ‘http’ followed by alphanumeric characters will be removed from the text.
df_data['Tweet_clean']=df_data['Tweet_clean'].apply(lambda x: re.sub(r'http\S+','',x))
  • In the above code, the regular expression ‘http\S+’ means all strings starting with http followed by any character except whitespace.

5.5. Other text processing

  • Here we have performed some other text processing tasks like fetching only the letters and ignoring special characters, converting the string to lower case, and removing whitespaces.
#function to perform text conversion
def txt_conversion(sentence):
#Getting only the letters from the tweets
sentence=re.sub(r'[^a-zA-Z ]','',sentence)
#Converting them to lowercase
sentence=sentence.lower()
#split based on space to remove multiple spaces
words=sentence.split()
#combining to form sentence
return (" ".join(words)).strip()
df_data['Tweet_clean']=df_data['Tweet_clean'].apply(lambda x: txt_conversion(x))

5.6. Stop words removal and text normalization

  • Stop words are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. Such kinds of words are eliminated as they carry very little useful information.
  • Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word. Eg: words like ‘multiplication’, ‘multiplicative’, and ‘multipliers’ are converted to lemma ‘multiple’.
def stop_wrds_lemma_convert(sentence):
#stopwords removal
tokens = [w for w in sentence.split() if not w in stop_words]
newString=''
#converting words to lemma
for i in tokens:
newString=newString+lemmatizer.lemmatize(i)+' '
return newString.strip()
df_data['Tweet_clean']=df_data['Tweet_clean'].apply(lambda x: stop_wrds_lemma_convert(x))

5.7. Generate word clouds

  • A ‘word cloud’ is a visual representation of word frequency. The more commonly the term appears within the text being analyzed, the larger the word appears in the image generated.
  • The below code is used to generate word clouds for each sentiment based on the ‘Tweet_clean’ feature.
plt.figure(figsize=(15,10))plt.subplot(2,2,1)
all_words=' '.join([text for text in df_data[df_data['Sentiment']=='Positive']['Tweet_clean']])
wordcloud=WordCloud(width=800,height=500,random_state=21,max_font_size=110).generate(all_words)
plt.title('Sentiment: Positive')
plt.imshow(wordcloud)
plt.axis('off')
plt.subplot(2,2,2)
all_words=' '.join([text for text in df_data[df_data['Sentiment']=='Negative']['Tweet_clean']])
wordcloud=WordCloud(width=800,height=500,random_state=21,max_font_size=110).generate(all_words)
plt.title('Sentiment: Negative')
plt.imshow(wordcloud)
plt.axis('off')
plt.subplot(2,2,3)
all_words=' '.join([text for text in df_data[df_data['Sentiment']=='Neutral']['Tweet_clean']])
wordcloud=WordCloud(width=800,height=500,random_state=21,max_font_size=110).generate(all_words)
plt.title('Sentiment: Neutral')
plt.imshow(wordcloud)
plt.axis('off')
plt.subplot(2,2,4)
all_words=' '.join([text for text in df_data[df_data['Sentiment']=='Irrelevant']['Tweet_clean']])
wordcloud=WordCloud(width=800,height=500,random_state=21,max_font_size=110).generate(all_words)
plt.title('Sentiment: Irrelevant')
plt.imshow(wordcloud)
plt.axis('off')

6. Model Building

  • In this part, we will first separate the dependent and independent features and then perform a train-test-split to generate training and validation sets.
#Seperate dependent and independent features
X=df_data.loc[:,df_data.columns!='Sentiment']
y=df_data['Sentiment']

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=0)
# summarize
print('Train', X_train.shape, y_train.shape)
print('Test', X_valid.shape, y_valid.shape)
the shape of the train and validation sets
  • TF-IDF vector has been used to generate the Bag of Words from the cleaned tweets.
  • TF stands for ‘Term Frequency’ and IDF stands for ‘Inverse Document Frequency’
  • TF-IDF score is higher for terms that occur quite often in a document but are not present in most of the other documents. Similarly, the score is lower for terms that occur frequently in most of the documents.
#TF-IDF
vectorizer = TfidfVectorizer(stop_words='english',ngram_range=(1,3),min_df=10,max_features=10000)
#Train on train data
features_train= vectorizer.fit_transform(X_train['Tweet_clean'])
#Apply on test data
features_valid= vectorizer.transform(X_valid['Tweet_clean'])
#check shape
features_train.shape, features_valid.shape
the shape of train and validation sets after creating bag of words using TF-IDF
  • It must be noted that we use ‘fit_transform’ only on the training set and ‘transform’ on the validation set. This is to prevent data leakage as the model must only learn data from the training set.
  • Now, we will use the below function to train different classification algorithms on the available dataset.
#Function to fit and apply a model
def model_apply(model):
#train the model
model.fit(features_train,y_train)
#make predictions
pred=model.predict(features_valid)
#model evaluation
print(model)
print('Accuracy score: ',accuracy_score(pred,y_valid))
print('Weighted F1 score:',
f1_score(y_pred=pred,y_true=y_valid,average='weighted'))
print('Confusion Matrix: \n',confusion_matrix(pred,y_valid))
  • Model 1: Multinomial Naive Bayes
nb = MultinomialNB()
model_apply(nb)
  • Model 2: Logistic Regression
lr = LogisticRegression(random_state=10,max_iter=500)
model_apply(lr)
  • Model 3: Decision Tree
dtc = DecisionTreeClassifier(random_state=10)
model_apply(dtc)
  • Model 4: Random Forest
rf = RandomForestClassifier(random_state=101,n_jobs=-1)
model_apply(rf)

7. Conclusion and Future scope

  • Random Forest Classifier model generates the best result on the validation data.
  • The future scope will include applying hyperparameter tuning to the available models to improve the results.

8. Code Links

Github:

Kaggle:

--

--

Rohan Paris
Rohan Paris

No responses yet