Professional Documents
Culture Documents
ISSN No:-2456-2165
Abstract:- YouTube is considered the biggest platform strategy. For YouTube viewers to assess the significance of
for content creators to share their content with the the uploaded video based on user opinion comments, the
world. Usually, a YouTuber aims to give his/her viewers categorization of positive and negative content becomes
the best content possible by going through the comments extremely crucial.[7].
of their past videos. On average, the comments can go up
to 10 thousand; hence, it becomes practically impossible This study focuses on utilising several classifiers from
to go through every comment and get an idea of what the Python's Scikit-learn module to do sentiment analysis using
viewers want or expect. a machine learning method on two case study datasets
scraped from YouTube. Functions from the Python libraries
Our work provides a model based on Python that Selenium and BeautifulSoup are used in the text mining
extracts the comments of a YouTube video which then portion. On the scraped data, stemming, lemmatization, and
becomes our dataset. A Machine Learning text pre-processing are carried out utilising a variety of
techniqueknown as Sentiment Analysis (Classification Natural Language Toolkit methods.
Model) is applied to the dataset extracted to provide the
YouTuber with a better understanding of the II. RELATED WORKS
distribution of the sentiment of his/ her viewers, which in
turn helps them get an idea of the thoughts of the viewers Numerous studies have explored the use of various
and also what the viewers expect from their future sentiment analysis approaches, including machine learning,
videos. Naive Bayes, Lexicon-based techniques, and mBERT. Abbi
Nizar Muhammad, Saiful Bukhori, and Priza Pandunata
Keywords:- YouTube, Sentiment Analysis, Classification, developed a sentiment analysis strategy by combining Naive
Decision Insights, Case Study. Bayes-Support Vector Machine (NBSVM) with a binary
classification approach, achieving 91% accuracy [7].
I. INTRODUCTION Sudhanshu Ranjan, Dheeraj Mekala, and Jingbo Shang
applied the mBERT approach to code-switching data,
Whether one’s material is intended for the entire public improving the performance across multiple datasets [20].
or is tailored to a community, age group, etc., social media Tanvi Mehta and Ganesh Deshmukh compared various
platforms are thought to be the simplest and fastest method methods such as linear regression, SVM, decision trees,
to simultaneously reach millions of individuals. Through random forests, and artificial neural networks, and found
social media, one may communicate with someone who is decision trees and ANN had the lowest root mean square
thousands of miles away. Non-textual information, such errors [23]. Ritika Singh and Ayushka Tiwari used six
videos, photos, and animations, is shared on many websites machine learning techniques, including Gaussian Naive
using systems that let people leave comments on individual Bayes, SVM, logistic regression, decision trees, KNN, and
items. With millions of videos submitted by its users and random forest, to develop a sentiment analysis system,
billions of comments for each one, YouTube is the most evaluating accuracy and F-score [19].
well-known of these programmes for sharing material in the
form of videos [4]. These clips include material that might Hanif Bhuiyan, Rajon Bardhan, Jinat Ara, and M.
significantly harm a person's or a company's reputation. By Rashedul Islam studied retrieving YouTube videos using
counting the likes and dislikes of a video, it is easy to sentiment analysis on user comments. Their approach
determine its reputation. It is good material if there are utilized natural language processing (NLP) and sentiment
significantly more likes than dislikes, while typically terrible analysis to find relevant and well-liked videos on YouTube
content has many more dislikes than likes[1]. that match the search criteria [2]. Martin Wöllmer, Felix
Weninger, Tobias Knaup, BjörnSchuller, and Congkai Sun
There are, however, a few approaches to quantify aimed to analyze voice sentiment in internet cinema review
reputation. This highlights how crucial it is to automatically recordings, taking into account the speaker's positive
extract thoughts and views shared on social media[5].In valence content and speech-based emotional audio elements
order to ascertain how YouTube viewers feel about video [3]. FazalMasudKundi, AfsanaMarwat, Shakeel Ahmad, and
content, sentiment analysis may be used to text data from the Muhammad Zubair Asghar employed sentiment lexicons,
comments area of a video. The most effective option for such as word net and senti-word-net, to detect the polarity of
deciphering each comment's significance is a text-mining
S. Nawaz, M. Rafiq, and M. Rizwan proposed a unique This section elaborates on the step–by–step procedure
approach for calculating the suggestions of efficacy of a we have used in order to arrive at the final model. The
YouTube video content by using the Google API to procedure being proposed operates through a sequence of
determine the total sentiment text analytics of comments and four steps, which are presented visually in Figure 1. The
responses [9]. Philipp A. Toussaint, Sebastian Lins, initial step is to select the video one wants to perform
Maximilian Renner, Ali Sunyaev, and Scott Thiebes used sentiment analysis onextraction code so that the comments
Bing (binary), National Research Council Canada (NRC) can be downloaded and stored as a CSV (Comma Separated
emotion, and 9-level sentiment analysis to determine user Values) file. Then using Python packages like Selenium and
behaviors toward videos about DTC genetic testing, as Beautiful Soup, one can extract the comments. Furthermore,
expressed in the comment section of their research paper one can perform Sentiment Analysis using our proposed
[13]. However, there is still no specific method designed for model to get almost accurate classifications on those
YouTubers/Influencers to understand the sentiment comments. The accuracy metric also has to be decided, and
distribution of their video's comments to make informed the validation score needs to be over ninety per cent to be
decisions based on user feedback. This work presents an called almost accurate. The steps are described more
initial attempt to automatically extract or scrape the elaborately in the following sub-section.
comments from YouTube, apply text processing techniques,
B. Text Preprocessing and Model Building and is included in the NLTK package. VADER uses a
Once the comments were extracted, we imported the dictionary that assigns sentiment scores to lexical data,
CSV file into Python and performed the necessary pre- allowing us to determine the intensity of emotion in a text by
processing steps. Non-English comments and symbolic summing up the intensity of each word. For example, words
comments were removed from the dataset, although emoji- like "love," "enjoy," "glad," and "like" all convey positive
based comments could have been valuable in conveying feelings. VADER also understands the underlying meanings
users' perspectives, they were not included due to the of these words, such as the negative connotation of "did not
limitations of our model's scope and context. We love" and how capitalization and punctuation can emphasize
transformed the dataset to classify sentiments as Positive, words, such as "ENJOY."
Negative, or Neutral.
The Sentiment Intensity Analyzer() function of
To perform text sentiment analysis, we used the VADER takes a string as input and returns a dictionary
VADER (Valence Aware Dictionary for Sentiment containing scores for each category of negative, neutral,
Reasoning) model, which considers both the polarity positive, and compound (normalized by the above scores).
(positive/negative) and intensity (strong) of emotions. This The implementation of this function is shown in Figure 4.
model can be applied to unlabelled text data immediately
Finally, we get a complete analysis of every review as positive, negative, or neutral. Figure 5 shows the result quite
evidently.
In the Sentiments Classification stage, text employs three standard stemming methods:
normalization techniques of stemming and lemmatization PorterStemmer(), LancasterStemmer(), and
are applied. These techniques are widely used by natural SnowballStemmer(). On the other hand, lemmatization
language processing experts to prepare text, words, and applies a morphological analysis to words by considering
documents for processing. Stemming is the process of the entire Lexicon of a language and reducing words to their
generating morphological variations of a root or base word base form. Lemmatization does not classify sentences;
to help improve search accuracy when looking up instead, it analyzes the context of a word to determine its
information in the text. Stemming algorithms, also known as part of speech. In our model, the standard Word Net
stemmers, are frequently used for this purpose. Our model Lemmatizer() method is used for lemmatization.
C. Validation Steps
The process of ensuring that a model truly serves its
intended function is known as model validation. This usually
entails verifying that the model is accurate in the
circumstances of its intended application. Validation of a
model is done using an accuracy metric. Several accuracy
metrics include accuracy score, logarithmic loss, confusion
matrix, root mean squared error, the area under the curve,
etc. Our model used accuracy score, and root mean squared
error as our accuracy metrics. The basic intuitive notion of
the formula for accuracy score is:
We chose our first case study for the account we proceeded to analyse the video for the movie review:
"Sucharita" on YouTube, shown in Figure 8. She is a full- "Darlings", which is among the most viewed videos on her
time social media influencer whose content mainly concerns YouTube channel. The video was published on YouTube
reviewing recent movies, series and online stream-able through the internet on the 5th of August, 2022. There has
entertainment. We chose her because, being an influencer, been a collection of 193 comments on this video over the
she has to create content now and then and keep her three months, and we have scrapped 119 comments to
audience engaged. It is highly beneficial for her to analyse work.The exact caption for the video is: “Darlings movie
and see the distribution of her viewers’ opinions. What they REVIEW | Sucharita Tyagi | Alia Bhatt, Shefali Shah, Vijay
want more of, what they detest, and so forth.For her account, Varma | Netflix India”.
Proceeding towards the model description and analysis Figure 7 shows the same distribution result we got in the
of this dataset, we found that the processed data had 20 code output.
negative classes, 63 positive classes and 36 neutral classes.
Note that 0 denotes the 'negative' class, 1 denotes the distribution of data between different classes.The model
'neutral' class, and 2 denotes the 'positive' class in the Figure. does not tend to favour the class with most members, thanks
After this step, we unsample the minority classes, the to this equalisation process and concatenates the new
'neutral' and 'negative' classes. Upsampling is a technique dataframes to the majority class. So finally, our data at this
used to balance imbalanced datasets. It involves generating stage contains 205 negative classes, 205 neutral classes and
new data points for the minority class to increase its 63 positive classes.
representation in the dataset. This results in a more balanced
Secondly, we proceeded to analyse and draw viewed video on their channel, which was about "Black
inferences from one of the channel's " Nutshell " videos. It is Money". The exact caption of the video is "What is black
shown in Figure 10. This channel's main aim is to bring our money? How does it circulate? | Ft. Andre Borges |
long lost and un-digged history stories or awareness videos Nutshell”.The video was published on YouTube through the
about current affairs in a relatable way, revealing new fact internet on the 21st of May, 2021. There are 228 comments
survey videos etc. It can be called an encyclopaedia of on the video for the seventeen months, and our algorithm
knowledge of current and past news. We again chose a top- has scrapped 93 comments to work upon.
Again, note that 0 denotes the 'negative' class, 1 testing, we get the best accuracy score of 0.9781 from the
denotes the 'neutral' class, and 2 denotes the 'positive' class Gaussian Naïve Bayes Classifier. Hence, we can also
in the Figure. Unsampling the minority classes, which are concludethat the distribution we initially got is almost
the classes for ‘neutral’ and ‘negative’ here (again), we get a correct. So, it means that approximately 48 per cent of its
dataset containing 205 negative classes, 205 neutral classes viewers have a favourable opinion, 37 per cent have a
and 45 positive classes. neutral opinion and the rest 15 per cent have a negative
opinion about the video. Figure 12 shows the same
Finally, on this dataset, applying several machine diagrammatically.
learning classifiers by similarly training the models with 70
per cent of the data and saving the rest 30 per cent for
V. CONCLUSION REFERENCES
After performing these two case studies, we can say [1.] Alexandre Ashade Lassance Cunha(B), Melissa
that most of the comments were on the positive side of both Carvalho Costa, and Marco Aur'elio C. Pacheco
videos. Moreover, the respective viewers liked the contents "Sentiment Analysis of YouTube Video Comments
of the two videos and want more of the same kind in future. Using Deep Neural Networks" the 24th of May 2019.
The ratio of negative comments was meagre, i.e., less than [2.] Hanif Bhuiyan; Jinat Ara, Rajon Bardhan, Md.
20 per cent for both videos, which is an ideal scenario for Rashedul Islam "Retrieving YouTube video by
the content creators. These creators or YouTubers want the sentiment analysis on user comment" the 14th of
least number of negative comments, ideally. Hence, through September 2017.
this project, we have concluded that by performing [3.] Martin Wöllmer,Felix Weninger,Tobias Knaup,Björn
sentiment analysis on the comments of one's videos, they Schuller,Congkai Sun “YouTube Movie Reviews:
can understand their viewers' opinions as a gist. And then Sentiment Analysis in an Audio-Visual Context" the
further, if they want, they can go through the positive, 27th of March 2013.
negative and the neutral opinions separately according to [4.] Muhammad Zubair Asghar, Shakeel Ahmad, Afsana
their purpose. When they would like to thank the viewers for Marwat, Fazal Masud Kundi “Sentiment Analysis on
their appreciation, they can reply the positive comments. YouTube: A Brief Survey”the 30th of November
They can go through the negative comments when they want 2015.
to improve their content. Furthermore, when they want [5.] Olga Uryupina, Barbara Plank, Aliaksei Severyn,
suggestions for other content creation, they can go through Agata Rotondi, Alessandro Moschitti “SenTube: A
the neutral comments, which generally contain suggestions Corpus for Sentiment Analysis on YouTube Social
from the viewers. Media” May 2014.