You are on page 1of 9

Volume 8, Issue 5, May 2023 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

A Sentiment Analysis Case Study to Understand


How a Youtuber can Derive Decision Insights
from Comments
Chandreyi Chowdhury Baibhav Pathy
School of Advanced Sciences School of Electrical Engineering
Vellore Institute of Technology Vellore Institute of Technology
Vellore, India Vellore, India

Abstract:- YouTube is considered the biggest platform strategy. For YouTube viewers to assess the significance of
for content creators to share their content with the the uploaded video based on user opinion comments, the
world. Usually, a YouTuber aims to give his/her viewers categorization of positive and negative content becomes
the best content possible by going through the comments extremely crucial.[7].
of their past videos. On average, the comments can go up
to 10 thousand; hence, it becomes practically impossible This study focuses on utilising several classifiers from
to go through every comment and get an idea of what the Python's Scikit-learn module to do sentiment analysis using
viewers want or expect. a machine learning method on two case study datasets
scraped from YouTube. Functions from the Python libraries
Our work provides a model based on Python that Selenium and BeautifulSoup are used in the text mining
extracts the comments of a YouTube video which then portion. On the scraped data, stemming, lemmatization, and
becomes our dataset. A Machine Learning text pre-processing are carried out utilising a variety of
techniqueknown as Sentiment Analysis (Classification Natural Language Toolkit methods.
Model) is applied to the dataset extracted to provide the
YouTuber with a better understanding of the II. RELATED WORKS
distribution of the sentiment of his/ her viewers, which in
turn helps them get an idea of the thoughts of the viewers Numerous studies have explored the use of various
and also what the viewers expect from their future sentiment analysis approaches, including machine learning,
videos. Naive Bayes, Lexicon-based techniques, and mBERT. Abbi
Nizar Muhammad, Saiful Bukhori, and Priza Pandunata
Keywords:- YouTube, Sentiment Analysis, Classification, developed a sentiment analysis strategy by combining Naive
Decision Insights, Case Study. Bayes-Support Vector Machine (NBSVM) with a binary
classification approach, achieving 91% accuracy [7].
I. INTRODUCTION Sudhanshu Ranjan, Dheeraj Mekala, and Jingbo Shang
applied the mBERT approach to code-switching data,
Whether one’s material is intended for the entire public improving the performance across multiple datasets [20].
or is tailored to a community, age group, etc., social media Tanvi Mehta and Ganesh Deshmukh compared various
platforms are thought to be the simplest and fastest method methods such as linear regression, SVM, decision trees,
to simultaneously reach millions of individuals. Through random forests, and artificial neural networks, and found
social media, one may communicate with someone who is decision trees and ANN had the lowest root mean square
thousands of miles away. Non-textual information, such errors [23]. Ritika Singh and Ayushka Tiwari used six
videos, photos, and animations, is shared on many websites machine learning techniques, including Gaussian Naive
using systems that let people leave comments on individual Bayes, SVM, logistic regression, decision trees, KNN, and
items. With millions of videos submitted by its users and random forest, to develop a sentiment analysis system,
billions of comments for each one, YouTube is the most evaluating accuracy and F-score [19].
well-known of these programmes for sharing material in the
form of videos [4]. These clips include material that might Hanif Bhuiyan, Rajon Bardhan, Jinat Ara, and M.
significantly harm a person's or a company's reputation. By Rashedul Islam studied retrieving YouTube videos using
counting the likes and dislikes of a video, it is easy to sentiment analysis on user comments. Their approach
determine its reputation. It is good material if there are utilized natural language processing (NLP) and sentiment
significantly more likes than dislikes, while typically terrible analysis to find relevant and well-liked videos on YouTube
content has many more dislikes than likes[1]. that match the search criteria [2]. Martin Wöllmer, Felix
Weninger, Tobias Knaup, BjörnSchuller, and Congkai Sun
There are, however, a few approaches to quantify aimed to analyze voice sentiment in internet cinema review
reputation. This highlights how crucial it is to automatically recordings, taking into account the speaker's positive
extract thoughts and views shared on social media[5].In valence content and speech-based emotional audio elements
order to ascertain how YouTube viewers feel about video [3]. FazalMasudKundi, AfsanaMarwat, Shakeel Ahmad, and
content, sentiment analysis may be used to text data from the Muhammad Zubair Asghar employed sentiment lexicons,
comments area of a video. The most effective option for such as word net and senti-word-net, to detect the polarity of
deciphering each comment's significance is a text-mining

IJISRT23MAY963 www.ijisrt.com 919


Volume 8, Issue 5, May 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
feeling in their research paper [4]. Alessandro Moschitti, and fit them into a machine learning model to provide a
Aliaksei Severyn, Barbara Plank, Agata Rotondi, and Olga comprehensive study of how content creators can leverage
Uryupina discussed the "SenTube" dataset, which comprises sentiment analysis to their advantage and understand the
user-generated comments on YouTube videos that have been users' overall viewpoint.
catalogued for informativeness and sentiment polarizability,
to create classifiers for several critical NLP tasks [5]. III. METHODOLOGY

S. Nawaz, M. Rafiq, and M. Rizwan proposed a unique This section elaborates on the step–by–step procedure
approach for calculating the suggestions of efficacy of a we have used in order to arrive at the final model. The
YouTube video content by using the Google API to procedure being proposed operates through a sequence of
determine the total sentiment text analytics of comments and four steps, which are presented visually in Figure 1. The
responses [9]. Philipp A. Toussaint, Sebastian Lins, initial step is to select the video one wants to perform
Maximilian Renner, Ali Sunyaev, and Scott Thiebes used sentiment analysis onextraction code so that the comments
Bing (binary), National Research Council Canada (NRC) can be downloaded and stored as a CSV (Comma Separated
emotion, and 9-level sentiment analysis to determine user Values) file. Then using Python packages like Selenium and
behaviors toward videos about DTC genetic testing, as Beautiful Soup, one can extract the comments. Furthermore,
expressed in the comment section of their research paper one can perform Sentiment Analysis using our proposed
[13]. However, there is still no specific method designed for model to get almost accurate classifications on those
YouTubers/Influencers to understand the sentiment comments. The accuracy metric also has to be decided, and
distribution of their video's comments to make informed the validation score needs to be over ninety per cent to be
decisions based on user feedback. This work presents an called almost accurate. The steps are described more
initial attempt to automatically extract or scrape the elaborately in the following sub-section.
comments from YouTube, apply text processing techniques,

Fig. 1: Sentiment Analysis Procedure

A. Comments Extraction websites by providing Python-based idioms for parsing


Comments extraction is a vital process in our research HTML or XML code. Our model code, which carries out the
that involves extracting comments from selected YouTube comments extraction process, is available online, along with
videos and storing them in a CSV file. We utilize two web- Figures 1, 2, and 3, which illustrate the steps involved in the
scraping libraries in Python, Selenium and BeautifulSoup, in process and some relevant code snippets. However, we note
our model for this task. Selenium is used to automate web that we only extract a random subset of comments from each
browser interaction through Python scripts, while video, as scraping the entire comments section violates
BeautifulSoup simplifies the process of extracting data from YouTube's policies.

Fig. 2: User Comments on YouTube

IJISRT23MAY963 www.ijisrt.com 920


Volume 8, Issue 5, May 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

Fig. 3: Scrapped Comments

Fig. 4: Extraction Code Snip

B. Text Preprocessing and Model Building and is included in the NLTK package. VADER uses a
Once the comments were extracted, we imported the dictionary that assigns sentiment scores to lexical data,
CSV file into Python and performed the necessary pre- allowing us to determine the intensity of emotion in a text by
processing steps. Non-English comments and symbolic summing up the intensity of each word. For example, words
comments were removed from the dataset, although emoji- like "love," "enjoy," "glad," and "like" all convey positive
based comments could have been valuable in conveying feelings. VADER also understands the underlying meanings
users' perspectives, they were not included due to the of these words, such as the negative connotation of "did not
limitations of our model's scope and context. We love" and how capitalization and punctuation can emphasize
transformed the dataset to classify sentiments as Positive, words, such as "ENJOY."
Negative, or Neutral.
The Sentiment Intensity Analyzer() function of
To perform text sentiment analysis, we used the VADER takes a string as input and returns a dictionary
VADER (Valence Aware Dictionary for Sentiment containing scores for each category of negative, neutral,
Reasoning) model, which considers both the polarity positive, and compound (normalized by the above scores).
(positive/negative) and intensity (strong) of emotions. This The implementation of this function is shown in Figure 4.
model can be applied to unlabelled text data immediately

IJISRT23MAY963 www.ijisrt.com 921


Volume 8, Issue 5, May 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

Fig. 4: Sentiment Intensity Analyser() Function Usage

Finally, we get a complete analysis of every review as positive, negative, or neutral. Figure 5 shows the result quite
evidently.

Fig. 5: Sentiments Classification

In the Sentiments Classification stage, text employs three standard stemming methods:
normalization techniques of stemming and lemmatization PorterStemmer(), LancasterStemmer(), and
are applied. These techniques are widely used by natural SnowballStemmer(). On the other hand, lemmatization
language processing experts to prepare text, words, and applies a morphological analysis to words by considering
documents for processing. Stemming is the process of the entire Lexicon of a language and reducing words to their
generating morphological variations of a root or base word base form. Lemmatization does not classify sentences;
to help improve search accuracy when looking up instead, it analyzes the context of a word to determine its
information in the text. Stemming algorithms, also known as part of speech. In our model, the standard Word Net
stemmers, are frequently used for this purpose. Our model Lemmatizer() method is used for lemmatization.

IJISRT23MAY963 www.ijisrt.com 922


Volume 8, Issue 5, May 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 Upon completing the necessary data transformations, we 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
Accuracy = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 
employed a range of machine learning models to perform

sentiment analysis on our dataset. Specifically, we utilized The method that gives the accuracy score in Python is
the Gaussian Naïve Bayes, Logistic Regression, AdaBoost accuracy_score(), found in the package sklearn. Moreover,
Classifier, Random Forest Classifier, K-Nearest for the root mean squared error, the intuitive formula is:
Neighbors (K-NN), and Decision Tree Classifier
algorithms. ∑𝑁 2
𝑖=1(𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒−𝐴𝑐𝑡𝑢𝑎𝑙 𝑉𝑎𝑙𝑢𝑒)
 The Gaussian Naïve Bayes Classifier was our first choice RMSE =√ 𝑁
(2)
due to its simplicity and effectiveness in classification
tasks. It is capable of producing quick, accurate The method that gives the root mean squared error in
predictions by assuming that the features are independent Python is mean_squared_error(), found in the package
of one another, which makes it ideal for high-dimensional sklearn. We used a scoring function to give out the accuracy
datasets. score, and the root mean squared error value for any model
 Our second model, Logistic Regression, is a widely used inserted. Figure 6 shows the body of the function used.
supervised learning algorithm that predicts categorical
outcomes based on a predefined set of independent
variables. It is often used in machine learning due to its
simplicity, interpretability, and ability to handle linearly
separable data.
 The AdaBoost Classifier, our third model, employs a
boosting technique called the AdaBoost algorithm to
create an ensemble model. It operates by redistributing
weights to each instance, with higher weights assigned to
misclassified instances, to produce a series of learners that
gradually improve on one another. This approach Fig. 6: Scoring Function Used In Our Model
effectively reduces bias and variance in the predictions.
 We also employed the Random Forest Classifier, which D. Final Model
uses a large number of decision trees to improve the The final model is selected based on the accuracy
accuracy of predictions. It avoids overfitting by training metric's scores. Whichever model gives the highest score is
on different subsets of the input dataset and combining chosen as the final model. The selection and the scores will
their results to obtain the final prediction. vary for every dataset and hence cannot be predicted
 The K-NN algorithm, our fifth model, relies on the beforehand. So, the final model selection will be discussed
assumption that new and existing cases are comparable, further in the case studies section (Section IV).
and places the new instance in the category most similar
to existing ones. It is commonly used for classification IV. CASE STUDIES
tasks but can also be used for regression. A case study of two YouTube videos is presented to
 Finally, our last model was the Decision Tree Classifier, understand how a YouTuber can derive decision insights
which creates a tree structure that represents the features from the comments on his/her videos to improve the content
of a dataset, decision-making branches, and classification or to know what his/her viewers want.Below are the two
results at each leaf node. It is a widely used algorithm in enlisted analyses and inferences of the sentiment
machine learning for classification and regression distributions of the respective videos.
problems.

We will discuss the application of each of these


models in detail in the case studies section (Section IV).

C. Validation Steps
The process of ensuring that a model truly serves its
intended function is known as model validation. This usually
entails verifying that the model is accurate in the
circumstances of its intended application. Validation of a
model is done using an accuracy metric. Several accuracy
metrics include accuracy score, logarithmic loss, confusion
matrix, root mean squared error, the area under the curve,
etc. Our model used accuracy score, and root mean squared
error as our accuracy metrics. The basic intuitive notion of
the formula for accuracy score is:

IJISRT23MAY963 www.ijisrt.com 923


Volume 8, Issue 5, May 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
A. Movie Reviews by Sucharita -> Darlings (the 5th of August, 2022)

Fig. 8: Video 1 Snip

We chose our first case study for the account we proceeded to analyse the video for the movie review:
"Sucharita" on YouTube, shown in Figure 8. She is a full- "Darlings", which is among the most viewed videos on her
time social media influencer whose content mainly concerns YouTube channel. The video was published on YouTube
reviewing recent movies, series and online stream-able through the internet on the 5th of August, 2022. There has
entertainment. We chose her because, being an influencer, been a collection of 193 comments on this video over the
she has to create content now and then and keep her three months, and we have scrapped 119 comments to
audience engaged. It is highly beneficial for her to analyse work.The exact caption for the video is: “Darlings movie
and see the distribution of her viewers’ opinions. What they REVIEW | Sucharita Tyagi | Alia Bhatt, Shefali Shah, Vijay
want more of, what they detest, and so forth.For her account, Varma | Netflix India”.

Table 1: First Case Study Video Details


Account Sucharita Tyagi
Subscribers 62.5K
Video Caption Darlings movie REVIEW | Sucharita Tyagi | Alia Bhatt,
Shefali Shah, Vijay Varma | Netflix India
Published Date The 5th of August, 2022
Video URL https://www.youtube.com/watch?v=lPvXZz7m9sI&t=2s
Total Comments 193 (as of the 2nd of November 2022)
Scrapped Comments 119

Proceeding towards the model description and analysis Figure 7 shows the same distribution result we got in the
of this dataset, we found that the processed data had 20 code output.
negative classes, 63 positive classes and 36 neutral classes.

Fig. 7: Processed Data Distribution For Video 1

Note that 0 denotes the 'negative' class, 1 denotes the distribution of data between different classes.The model
'neutral' class, and 2 denotes the 'positive' class in the Figure. does not tend to favour the class with most members, thanks
After this step, we unsample the minority classes, the to this equalisation process and concatenates the new
'neutral' and 'negative' classes. Upsampling is a technique dataframes to the majority class. So finally, our data at this
used to balance imbalanced datasets. It involves generating stage contains 205 negative classes, 205 neutral classes and
new data points for the minority class to increase its 63 positive classes.
representation in the dataset. This results in a more balanced

IJISRT23MAY963 www.ijisrt.com 924


Volume 8, Issue 5, May 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Applying several machine learning classifiers on the initially got is almost correct. It means that approximately
data by training the models with 70 per cent of the data and 53 per cent of her viewers have a favourable opinion, 30 per
saving the rest 30 per cent for testing, we get the best cent have a neutral opinion, and the rest 17 per cent have a
accuracy score of 0.9577 from the Gaussian Naïve Bayes negative opinion about her video. Figure 9 shows the same
Classifier. Thus, we conclude that the distribution we diagrammatically.

Fig. 9: Video 1 Sentiments Distribution

B. Nutshell -> Black Money (the 21st of May, 2021)

Fig. 10: Video 2 Snip

Secondly, we proceeded to analyse and draw viewed video on their channel, which was about "Black
inferences from one of the channel's " Nutshell " videos. It is Money". The exact caption of the video is "What is black
shown in Figure 10. This channel's main aim is to bring our money? How does it circulate? | Ft. Andre Borges |
long lost and un-digged history stories or awareness videos Nutshell”.The video was published on YouTube through the
about current affairs in a relatable way, revealing new fact internet on the 21st of May, 2021. There are 228 comments
survey videos etc. It can be called an encyclopaedia of on the video for the seventeen months, and our algorithm
knowledge of current and past news. We again chose a top- has scrapped 93 comments to work upon.

Table 2: Second Case Study Video Details


Account Nutshell
Subscribers 243K
Video Caption What is black money? How does it circulate? | Ft. Andre Borges | Nutshell
Published Date The 21st of May, 2021
Video URL https://www.youtube.com/watch?v=uEawmeO2gOY&t=7s
Total Comments 228 (as of the 2nd of November 2022)
Scrapped Comments 93

IJISRT23MAY963 www.ijisrt.com 925


Volume 8, Issue 5, May 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
For this video, proceeding towards the model classes. Figure 11 shows the same distribution result we got
description and analysis, we found that the processed data in the code output.
had 11 negative classes, 45 positive classes and 35 neutral

Fig. 11: Processed Data Distribution for Video 2

Again, note that 0 denotes the 'negative' class, 1 testing, we get the best accuracy score of 0.9781 from the
denotes the 'neutral' class, and 2 denotes the 'positive' class Gaussian Naïve Bayes Classifier. Hence, we can also
in the Figure. Unsampling the minority classes, which are concludethat the distribution we initially got is almost
the classes for ‘neutral’ and ‘negative’ here (again), we get a correct. So, it means that approximately 48 per cent of its
dataset containing 205 negative classes, 205 neutral classes viewers have a favourable opinion, 37 per cent have a
and 45 positive classes. neutral opinion and the rest 15 per cent have a negative
opinion about the video. Figure 12 shows the same
Finally, on this dataset, applying several machine diagrammatically.
learning classifiers by similarly training the models with 70
per cent of the data and saving the rest 30 per cent for

Fig. 12: Video 2 Sentiments Distribution

V. CONCLUSION REFERENCES

After performing these two case studies, we can say [1.] Alexandre Ashade Lassance Cunha(B), Melissa
that most of the comments were on the positive side of both Carvalho Costa, and Marco Aur'elio C. Pacheco
videos. Moreover, the respective viewers liked the contents "Sentiment Analysis of YouTube Video Comments
of the two videos and want more of the same kind in future. Using Deep Neural Networks" the 24th of May 2019.
The ratio of negative comments was meagre, i.e., less than [2.] Hanif Bhuiyan; Jinat Ara, Rajon Bardhan, Md.
20 per cent for both videos, which is an ideal scenario for Rashedul Islam "Retrieving YouTube video by
the content creators. These creators or YouTubers want the sentiment analysis on user comment" the 14th of
least number of negative comments, ideally. Hence, through September 2017.
this project, we have concluded that by performing [3.] Martin Wöllmer,Felix Weninger,Tobias Knaup,Björn
sentiment analysis on the comments of one's videos, they Schuller,Congkai Sun “YouTube Movie Reviews:
can understand their viewers' opinions as a gist. And then Sentiment Analysis in an Audio-Visual Context" the
further, if they want, they can go through the positive, 27th of March 2013.
negative and the neutral opinions separately according to [4.] Muhammad Zubair Asghar, Shakeel Ahmad, Afsana
their purpose. When they would like to thank the viewers for Marwat, Fazal Masud Kundi “Sentiment Analysis on
their appreciation, they can reply the positive comments. YouTube: A Brief Survey”the 30th of November
They can go through the negative comments when they want 2015.
to improve their content. Furthermore, when they want [5.] Olga Uryupina, Barbara Plank, Aliaksei Severyn,
suggestions for other content creation, they can go through Agata Rotondi, Alessandro Moschitti “SenTube: A
the neutral comments, which generally contain suggestions Corpus for Sentiment Analysis on YouTube Social
from the viewers. Media” May 2014.

IJISRT23MAY963 www.ijisrt.com 926


Volume 8, Issue 5, May 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
[6.] Rawan Fahad Alhujaili, "Sentiment Analysis for [24.] https://www.youtube.com/watch?v=lPvXZz7m9sI&t
Youtube Videos with user comments" the 12th of =2s
April 2021. [25.] https://www.youtube.com/watch?v=uEawmeO2gOY
[7.] Abbi Nizar Muhammad,Saiful Bukhori,Priza &t=7s
Pandunata "Sentiment Analysis of Positive and
Negative of YouTube Comments Using Naïve Bayes"
the 5th of December 2019.
[8.] Annamaria Porreca, Francesca Scozzari&Marta Di
Nicola "Using text mining and sentiment analysis to
analyse YouTube Italian videos concerning
vaccination" the 19th of February 2020.
[9.] S. Nawaz, M. Rizwan and M. Rafiq
“Recommendation Of Effectiveness Of YouTube
Video Contents By Qualitative Sentiment Analysis Of
Its Comments And Replies” December 2019.
[10.] Shanta RangaswamyShubham Ghosh,Srishti
Jha,Soodamani Ramalingam"Metadata extraction and
classification of YouTube videos using sentiment
analysis" the 16th of January 2017.
[11.] Lakshmish Kaushik, Abhijeet Sangwan, John H. L.
Hansen“Metadata extraction and classification of
YouTube videos using sentiment analysis” the 9th of
January 2014.
[12.] Amar Krishna, Joseph Zambreno, Sandeep
Krishnan"Polarity Trend Analysis of Public
Sentiment on YouTube" the 1st of January 2014.
[13.] Philipp A Toussaint Maximilian Renner,Sebas, Scott
Thiebes Ali Sunyaev“Direct-to-Consumer Genetic
Testing on Social Media: Topic Modeling and
Sentiment Analysis of YouTube Users' Comments”
the 15th of September 2022.
[14.] Rahul Pradhan "Extracting Sentiments from YouTube
Comments" the 10th of February 2022.
[15.] F Peng, A McCallum “Accurate Information
Extraction from Research Papers using Conditional
Random Fields” January 2006.
[16.] Mike Thelwali “Social media analytics for YouTube
comments: potential and limitations” September
2017.
[17.] Rhitabrat Pokharel Dixit Bhatta "Classifying
YouTube Comments Based on Sentiment and Type of
Sentence" the 31st of October 2021.
[18.] M. Viny Christanti, Walda1, Tri Sutrisno,
“Comments Scraping Application For Review
Youtube Content” 2019.
[19.] Ritika Singh “Youtube comment analysis” May 2021.
[20.] Sudhansu Ranjan, Dheeraj Mekala, Jingbo, Shang
“Progressive Sentiment Analysis for Code-Switched
Text Data” the 25th of October 2022.
[21.] Arpit Khare, Amisha Gangwar, Sudhakar Singh, Shiv
Prakash "Sentiment Analysis and Sarcasm Detection
of Indian General Election Tweets" the 3rd of January
2022.
[22.] Rhitabrat Pokharel Dixit Bhatta "Classifying
YouTube Comments Based on Sentiment and Type of
Sentence" the 31st of October 2021.
[23.] Tanvi Mehta, Ganesh Deshmukh “YouTube Ad View
Sentiment Analysis using Deep Learning and
Machine Learning” the 11th of May 2022.

IJISRT23MAY963 www.ijisrt.com 927

You might also like