Movie Reviews Sentiment Analysis Using BERT
A thesis submitted in partial fulfillment
of the requirements for the degree of
Master of Science in Computer Science
by
Gibson Nkhata
Mzuzu University
Bachelor of Science in Information and Communication Technology, 2018
W
December 2022
IE
University of Arkansas
EV
This thesis is approved for recommendation to the Graduate Council.
PR
Susan Gauch, Ph.D.
Thesis Director
Justin Zhan, Ph.D.
Committee member
Ukash Nakarmi, Ph.D.
Committee member
Mph
Yanjun Pan, Ph.D.
Committee member
ABSTRACT
Sentiment analysis (SA) or opinion mining is analysis of emotions and opinions from
texts. It is one of the active research areas in Natural Language Processing (NLP). Various
approaches have been deployed in the literature to address the problem. These techniques
devise complex and sophisticated frameworks in order to attain optimal accuracy with their
focus on polarity classification or binary classification. In this paper, we aim to fine-tune
BERT in a simple but robust approach for movie reviews sentiment analysis to provide
better accuracy than state-of-the-art (SOTA) methods. We start by conducting sentiment
W
classification for every review, followed by computing overall sentiment polarity for all the
reviews. Both polarity classification and fine-grained classification or multi-scale sentiment
IE
distribution are implemented and tested on benchmark datasets in our work. To optimally
EV
adapt BERT for sentiment classification, we concatenate it with a Bidirectional LSTM (BiL-
STM) layer. We also implemented and evaluated some accuracy improvement techniques
including Synthetic Minority Over-sampling TEchnique (SMOTE) and NLP Augmenter
PR
(NLPAUG) to improve the model for prediction of multi-scale sentiment distribution. We
found that including NLPAUG improved accuracy, however SMOTE did not work well.
Lastly, a heuristic algorithm is applied to compute overall polarity of predicted reviews from
the model output vector. We call our model BERT+BiLSTM-SA, where SA stands for Sen-
timent Analysis. Our best-performing approach comprises BERT and BiLSTM on binary,
three-class, and four-class sentiment classifications, and SMOTE augmentation, in addition
to BERT and BiLSTM, on five-class sentiment classification. Our approach performs at
par with SOTA techniques on both classifications. For example, on binary classification, we
obtain 97.67% accuracy, while the best performing SOTA model, NB-weighted-BON+dv-
cosine, has 97.40% accuracy on the popular IMDb dataset. The baseline, Entailment as
Few-Shot Learners (EFL), is outperformed on this task by 1.30%. On the other hand, for
five-class classification on SST-5, the best SOTA model, RoBERTa+large+Self-explaining,
has 55.5% accuracy, while we obtain 59.48% accuracy. We outperform the baseline on this
task, BERT-large, by 3.6%.
W
IE
EV
PR
DEDICATION
This thesis is dedicated to my late mother, Lincy Pyera Nyavizala Mphande. May her soul
continue resting in peace.
W
IE
EV
PR
ACKNOWLEDGEMENTS
Firstly, I am grateful to my advisor Dr. S. Gauch for her valuable help towards the
completion of this work. I am also thankful to my former advisor, Dr J. Zhan, and Dr.
usman Anjum for their initial support towards this thesis.
I am so grateful to the Institute of International Edecation/Agricultultural Transfor-
mation Initiative (IIE/ATI) scholarship programme for making it possible for me to study
at the University of Arkansas. I also thank the Data Analytics that are Robust and Trusted
(DART) project for supporting this work through the University of Arkansas CSCE depart-
W
ment.
IE
Last but not least, I thank all who in one way or another contributed in the completion
of this thesis, your efforts are not taken for granted.
EV
PR
TABLE OF CONTENTS
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Deep Learning on Sentiment Analysis . . . . . . . . . . . . . . . . . . 5
2.2.2 Deep Learning on Movie Reviews Sentiment Analysis . . . . . . . . . 6
2.3 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 BERT and Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 BERT and Movie Reviews Sentiment Analysis . . . . . . . . . . . . . 9
W
3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
IE 11
3.1.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.2 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
EV
3.1.3 Fine-tuning BERT with BiLSTM . . . . . . . . . . . . . . . . . . . . 13
3.1.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.5 Accuracy Improvement Approaches . . . . . . . . . . . . . . . . . . . 17
3.1.6 Overall polarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
PR
3.1.7 Overview of Our Work . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.1 IMDb movie reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.2 SST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.3 MR Movie Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.4 Amazon Product Data dataset . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Experimental settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5.1 Evaluation of Goal 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5.2 Evaluation of Goal 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
A All Publications Submitted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
W
IE
EV
PR
LIST OF FIGURES
Figure 3.1: Simplified diagram of BERT . . . . . . . . . . . . . . . . . . . . . . . . . 13
Figure 3.2: Fine-tuning part of BERT with BiLSTM . . . . . . . . . . . . . . . . . . 15
Figure 3.3: Binary Tree Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 3.4: Overview of our work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
W
IE
EV
PR
LIST OF TABLES
Table 4.1: Accuracy (%) Comparisons of Models on Benchmark Datasets for Binary
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Table 4.2: Accuracy (%) Comparisons for Three and Four Class Classification on IMDd 31
Table 4.3: Accuracy (%) Comporisons of Models on Benchmark Datasets for Five
Class Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Table 4.4: Accuracy (%) of Our Model with Accuracy Improvement Techniques on
SST-5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Table 4.5: Overall Polarity Computation on All the Datasets . . . . . . . . . . . . . 33
W
IE
EV
PR
1 Introduction
Sentiment Analysis aims to determine the polarity of emotions like happiness, sorrow,
grief, hatred, anger, and affection and opinions from text, reviews, and posts, which are avail-
able in many media platforms [1]. Sentiment analysis helps in tracking people’s viewpoints.
For example, it is a powerful marketing tool that enables product managers to understand
customer emotions in their various marketing campaigns. It is an important factor when
W
it comes to social media monitoring, product and brand recognition, customer satisfaction,
customer loyalty, advertising and promotion’s success, and product acceptance. It is among
IE
the most popular and valuable tasks in the field of NLP [2]. Sentiment analysis can be
EV
conducted as polarity classification or binary classification and fine-grained classification or
multi-scale sentiment distribution.
Movie reviews is an important approach to assess the performance of a particular
PR
movie. Whereas providing a numerical or star rating to a movie quantitatively tells us about
the success or failure of a movie, a collection of movie reviews is what gives us a deeper
qualitative insight on different aspects of the movie. A textual movie review tells us about
the strengths and weaknesses of the movie and deeper analysis of a movie review tells if
the movie generally satisfies the reviewer. We work on Movie Reviews Sentiment Analysis
in this study because movie reviews have standard benchmark datasets, where salient and
qualitative works have been published on, in [3], for example.
BERT is a popular pre-trained language representation model and has proven to
perform well on many NLP tasks like named entity recognition, question answering and
1
text classification [4]. It has been used in information retrieval in [5] to build an efficient
ranking model for industry use cases. The pre-trained language model was also successfully
utilised in [6] for extractive summarization of text and used for question answering with
satisfactory results in [7]. Yang et al. [8] efficiently applied the model in data augmentation
yielding optimal results. BERT has been primarily used in [9] for sentiment analysis, but
the accuracy is not satisfactory.
In this paper, we fine-tune BERT for sentiment analysis on movie reviews, comparing
both binary and fine-grained classifications, and achieve, with our best method, accuracy
W
that surpasses state-of-the art (SOTA) models. Our fine-tuning couples BERT with Bidirec-
tional LSTM (BiLSTM) and use the resulting model for binary and fine-grained sentiment
IE
classification tasks. To deal with class imbalance problem for fine-grained classification, we
EV
also implement oversampling and data augmentation techniques.
Fine-tuning is a common technique for transfer learning. The target model copies all
model designs with their parameters from the source model except the output layer and fine-
PR
tunes these parameters based on the target dataset. The main benefit of fine-tuning is no
need of training the entire model from scratch. Hence, we are fine-tuning BERT by adding
BiLSTM and train the model on movie reviews sentiment analysis benchmark datasets.
BERT processes input features bidirectionally [4], so does BiLSTM [10]. The primary idea
behind bidirectional processing is to present each training sequence forwards and backwards
to two separate recurrent nets, both of which are connected to the same output layer [10].
That is, both BERT and BiLSTM do not process inputs in temporal order, their outputs
tend to be mostly based on both previous and next contexts.
Following that, we compute an overall polarity on the output vector from BERT+BiLSTM-
Reproduced with permission of copyright owner. Further reproduction prohibited without permission.