Fake Review Detector
Summer Training Report
submitted in partial fulfilment of the requirement for the
degree of Bachelors of Technology
In
Computer Science & Engineering
Submitted By
Aman Kumar (01415002720)
To
Computer Science and Engineering Department
Maharaja Surajmal Institute of Technology Affiliated to Guru
Gobind Singh Indraprastha University Janakpuri, New Delhi-58
2020-24
I
CERTIFICATE
I, Aman Kumar, hereby declare that I have completed my six-week summer training
program on ‘Deep Learning for Robotic Arm Application’ provided by AI-Shala from
5th August 2022 to 16th September 2022. As part of the course, I was given an
opportunity to do a classification project. I have completed this project and have
written this report to present my findings. I declare that all the code for my project was
written by me and I also confirm that this project report was solely prepared for
academic purposes.
Aman Kumar
(01415002720)
II
CERTIFICATE FROM ORGANIZATION
III
ACKNOWLEDGEMENT
I gratefully acknowledge the expert guidance provided by the instructor of the Deep
Learning course provided by AI-Shala, Mr. Anil Sharma sir for helping me learn the
concepts so well. I would also like to thank the H.O.D of the computer science
department Mrs. Rinki Dwiwedi mam for her support and guidance.
I am also very grateful to the entire computer science department of Maharaja Surajmal
Institute of Technology for their constant guidance.
IV
ABSTRACT
Consumers’ reviews on ecommerce websites, online services, ratings and experience stories
are useful for the user as well as the vendor. The reviewer can increase their brand’s loyalty
and help other customers understand their experience with the product.
Similarly reviews help the vendors gain more profiles by increasing their sale of products, if
consumers leave positive feedback on their product review. But unfortunately, these review
mechanisms can be misused by vendors.
For example, one may create fake positive reviews to promote brand’s reputation or try to
demote competitor’s products by leaving fake negative reviews on their product.
Existing solutions with supervised include application of different machine learning
algorithms and different tools like Weka.
Unlike the existing work, instead of using a constrained dataset I chose to have a wide variety
of vocabulary to work on such as different subjects of datasets combined as one
big data set. Sentiment analysis has been incorporated based on emojis and text content in the
reviews. Fake reviews are detected and categorized. The testing results are
obtained through the application of Naïve Bayes, Linear SVC, Support Vector Machine and
Random forest algorithms.
The implemented (proposed) solution is to classify these reviews into fake or genuine. The
highest accuracy is obtained by using Naïve Bayes by including sentiment classifier.
V
TABLE OF CONTENTS
Certificate (student)……………………………………………………………….
Certificate (From the organization where the training is completed)…………….
Acknowledgement…………………………………………………………………
Abstract……………………………………………………………………………
Chapter 1 : Introduction……………………………………………………………
1.1 Need and Objective……………………………………………………
1.2 Methodology…………………………………………………………..
1.3 Software Used…………………………………………………………
1.4 About Organization…………………………………………………….
Chapter 2 : Project Design……………………………………………………….....
2.1 Methodology..........................................................................................
2.2 Software Development Life Cycle...........................................................
2.3 Feasibility Study.........................................................................................
2.4 Requirement Analysis................................................................................
2.5 Software/ Hardware Requirements............................................................
Chapter 3 : Implementation............................................................................................
Chapter 4 : Result and Discussion..................................................................................
Chapter 5 : Future Scope and Conclusion......................................................................
References.......................................................................................................................
VI
CHAPTER 1
INTRODUCTION
1.1 NEED AND OBJECTIVE
Everyone can freely express his/her views and opinions anonymously and without the fear of
consequences. Social media and online posting have made it even easier to post confidently
and openly. These opinions have both pros and cons while providing the right feedback to
reach the right person which can help fix the issue and sometimes a con when these get
manipulated These opinions are regarded as valuable. This allows people with malicious
intentions to easily make the system to give people the impression of genuineness and post
opinions to promote their own product or to discredit the competitor products and services,
without revealing identity of themselves or the organization they work for. Such people are
called opinion spammers and these activities can be termed as opinion spamming.
There are few different types of opinion spamming. One type is giving positive opinions to
some products with intention to promote giving untrue or negative reviews to products to
damage their reputation. Second type consists of advertisements with no opinions on product.
There is lot of research work done in field of sentiment analysis and created models while
using different sentiment analysis on data from various sources, but the primary focus is on
the algorithms and not on actual fake review detection. One of many other research works by
E. I. Elmurngi and A. Gherbi used machine learning algorithms to classify the product
reviews on Amazon.com dataset including customer usage of the product and buying
experiences. The use of Opinion Mining, a type of language processing to track the emotion
and thought process of the people or users about a product which can in turn help research
work.
Opinion mining, which is also called sentiment analysis, involves building a system to collect
and examine opinions about the product made in social media posts, comments, online
product and service reviews or even tweets. Automated opinion mining uses machine
learning, a component of artificial intelligence. An opinion mining system can be built using
a software that can extract knowledge from dataset and incorporate some other data to
improve its performance.
One of the biggest applications of opinion mining is in the online and e-commerce reviews
of consumer products, feedback and services. As these opinions are so helpful for both the
user as well as the seller the e-commerce web sites suggest their customers to leave a
feedback and review about their product or service they purchased. These reviews provide
valuable information that is used by potential customers to know the opinions of previous or
current users before they decide to purchase that product from that seller. Similarly, the seller
or service providers use this information to identify any defects or problems users face with
their products and to understand the competitive information to know the difference about
their similar competitors’ products.
There is a lot of scope of using opinion mining and many applications for different
usages:
Individual consumers: A consumer can also compare the summaries with competing
products before taking a decision without missing out on any other better products that
are available in the market.
Businesses/Sellers: Opinion mining helps the sellers to reach their audience and understand
their perception about the product as well as the competitors. Such reviews also help the
sellers to understand the issues or defects so that they can improve later versions of their
product. In today’s generation this way of encouraging the consumers to write a review about
a product has become a good strategy for marketing their product through real audience’s
voice. Such precious information has been spammed and manipulated. Out of many
researches one fascinating research was done to identify the deceptive opinion spam.
1.2 PROPOSED METHODOLOGY
We provide a global overview of the various features that can be employed to detect fake
reviews. Since the most effective approaches in the literature are in general supervised and
consider review- and reviewer-centric features, these two classes will be taken into
consideration.
A. Review-centric Features
The first class of features that have been considered is constituted by those related to a
review. They can be extracted both from the text constituting the review, i.e., textual features,
and from
meta-data connected to a review, i.e., meta-data features. A large part of reviews are
singletons, i.e., there is only one review written by a given reviewer in a certain period of
time for this kind of reviews, specific features must be designed.
1) Textual Features
It is possible to use Natural Language Processing techniques to extract simple features from
the text, and to use as features some statistics and some sentiment estimations connected to
the use of the words. Several approaches employ as textual features both unigrams and
bigrams extracted from the text of reviews.
Statistical data like
Number of words,
Ratio of capital letters,
Ratio of capital words,
Ratio of first person pronouns,
Ratio of ‘exclamation’ sentences,
A number representing the proportion of subjective words.
2) Meta-data Features
They can be generated by reasoning on the review’s cardinality with respect to the reviewer
and the entity reviewed.
These features include:
Basic features like
Rating of review
Rating deviation, i.e., the deviation of the evaluation provided in the review with
respect to the entity’s average rating
Singleton feature
Burst features which can be either due to sudden popularity of the entities reviewed or
to spam attacks.
3
B. Reviewer-centric features
This group of features is composed of features related to the reviewer’s behavior. In this way
it is possible to go beyond the content and meta-data associated with a review, which are
limited for classification, and considering the behavior of users in general in writing reviews.
1) Textual features
1) The textual features are employed to address the problem of review duplication. The
following textual features have been taken
Maximum Content Similarity (MCS), i.e. the evaluation of the maximum similarity over the
user’s reviews.
Average Content Similarity (ACS), i.e., the evaluation of the average similarity over the
user’s reviews.
Word number average, i.e., the average number of words that the user utilizes in his/her
reviews
2) Rating features
They are based on some aggregation, for each considered reviewer, of the information
concerning the ratings
Total number of reviews.
Ratios, i.e., the ratio of negative, positive and ‘extreme’ reviews.
Average deviation from entity’s average.
3) Temporal features
They are based on the temporal information that further describes how the ratings are
distributed over the time
Activity time of the user the difference of timestamps of the last and first reviews for a
given reviewer.
Maximum rating per day
Data entropy, the temporal gap in days between consecutive pairs or reviews.
The following techniques are used for implementing the supervised machine learning
technique for classification, for balancing data, and for testing the classifier.
4
C. Choice of the classifier and implementation
The majority of supervised classifiers to tackle the issue of opinion spam detection are based
on Naıve Bayes or Support Vector Machines (SVM). To implement the classifier, the Python
programming language has been employed, as it is used by a large community of developers,
thus offering a vast set of tools and libraries for different aims.
D. Choice of the dataset
The classification provided by Yelp has been used as a ground truth, where recommended
reviews correspond to ‘genuine’ reviews, and not recommended reviews correspond to ‘fake’
ones. The strengths of these datasets are:
The high number of reviews per user, which allows to consider the behavioral features of
each user
The diversified kinds of entities reviewed, i.e., restaurants and hotels
The datasets only contain basic information, such as the content, label, rating, and date of
each review, connected to the user who generated them.
E. Balancing data
Imbalanced data represents one of the major issues that have to be tackled when performing
supervised classification. In the training phase, if the unbalancing of training data is not
considered, there is the risk that the classifier learns mainly from the largest class of labeled
data therefore neglecting the minority class. The oversampling method is considered, it
consists in augmenting the minority class to balance it with the
largest one.
5
1.3 Software Used
Windows 10
Python 3.5.2
Different libraries are available in Python that helps in machine learning,
classification projects. Several of those libraries have improved the performance
of this project. Few of them are mentioned in this section.
First, “Numpy” that provides with high-level math function collection to
support multi-dimensional matrices and arrays. This is used for faster
computations over the weights (gradients) in neural networks.
Second, “scikit-learn” is a machine learning library for Python which features
different algorithms and Machine Learning function packages.
NLTK, natural language toolkit is helpful in word processing and
tokenization.
The project makes use of Anaconda Environment which is an open source
distribution for Python which simplifies package management and deployment.
It is best for large scale data processing.
6
1.4 About Organization
AI-Shala
Figure 1.1: Organisation logo
AI-Shala is an ed-tech platform where students and working professional can
learn skills essential to start a career in Artificial Intelligence.
Their aim is to provide training programs to make the students job ready. Their
hiring partners believe their training programs and consistently hire from them.
Their features
Live lectures with coding
Job preparation
Expert Mentorship
Developer and discussion forum
Certificate on completion
Internship opportunity
Their Vision
With this program, students will learn how to approach and solve problems
using
machine learning. It will also give them complete guidance for their interviews
and also a recommendation letter from their mentor. The mentor will give them
regular performance evaluation, which will help them improve upon their weak
areas.
Chapter 2- Project Design
To solve the major problem faced by online websites due to opinion spamming,
this project proposes to identify any such spammed fake reviews by classifying
them into fake and genuine. The method attempts to classify the reviews
obtained from freely available datasets from various sources and categories
including service based, product based, customer feedback, experience based
and the crawled Amazon dataset with a greater accuracy using Naïve Bayes [7],
Linear SVC, SVM, Random forest, Decision Trees algorithm. In order to
improve the accuracy, the additional features like comparison of the sentiment
of the review, verified purchases, ratings, emoji count, product category with
the overall score are used in addition to the review details.
A classifier is built based on the identified features. And those features are
assigned a probability factor or a weight depending on the classified training
sets. This is a supervised learning technique applying different Machine
learning algorithms to detect the fake or genuine reviews,The high-level
architecture of the implementation can be seen in Figure:1 and the problem is
solved in the following six steps:
2.1 Data Collection
Consumer review data collection- Raw review data was collected from different
sources, such as Amazon, websites for booking Airlines, Hotel and Restaurant,
CarGurus, etc. reviews. Doing so was to increase the diversity of the review
data. A dataset of 21000 was created.
2.2 Data Preprocess
Processing and refining the data by removal of irrelevant and redundant
information as well as noisy and unreliable data from the review dataset.
Step 1: Sentence tokenization
The entire review is given as input and it is tokenized into sentences using
NLTK
package.
Step 2: Removal of punctuation marks Punctuation marks used at the starting
and ending of the reviews are removed along with additional white spaces.
Step 3: Word
Tokenization Each individual review is tokenized into words and stored in a list
for easier retrieval.
Step 4: Removal of stop words Affixes are removed from the stem. For
example, the stem of "cooking" is "cook", and the stemming algorithm knows
that the "ing" suffix can be removed. A few words from the frequent word list is
shown below in Figure: 2.
2.3 Feature extraction
The preprocessed data is converted into a set of features by applying certain
parameters. The following features are extracted:
Normalized length of the review- Fake reviews tend to be of smaller length.
Reviewer ID- A reviewer posting multiple reviews with the same Reviewer ID.
Rating-Fake reviews in most scenarios have 5 out of 5 stars to entice the
customer or have the lowest rating for the competitive products thus it plays an
important role in fake detection.
Verified Purchase-Purchase reviews that are fake have lesser chance of it being
verified purchase than genuine reviews.
Thus these combination of features are selected for identifying the fake reviews.
This in turn improves the performance of the prediction models.
2.4 Sentiment Analysis
Classifying the reviews according to their emotion factor or sentiments being
positive, negative or neutral. It includes predicting the reviews being positive or
negative according to the words used in the text, emojis used, ratings given to
the review and so on. Related research shows that fake reviews has stronger
positive or negative emotions than true reviews. The reasons are that, fake
reviews are used to affect people opinion, and it is more significant to convey
opinions than to plainly describe the facts.
The Subjective vs Objective ratio matters: Advertisers post fake reviews with
more objective information, giving more emotions such as how happy it made
them than conveying how the product is or what it does. Positive sentiment vs
negative sentiment: The sentiment of the review is analyzed which in turn help
in making the decision of it being a fake or genuine review.
2.5 Fake Review Detection
Classification assigns items in a collection to target categories or classes. The
goal of classification is to accurately predict the target class
for each case in the data. Each data in the review file is assigned a weight and
depending upon which it is classified into respective classes - Fake and
Genuine.
2.6 Performance Evaluation and Results
Comparison of the accuracies of various models and classifiers with
enhancements for better results.programs to maAI-Shala is an ed-tech platform where students
Chapter 3- Implementation
The implementation of this project uses supervised learning technique on the
datasets and the fake and genuine labels help us to cross validate the
classification results of the data.
Collection of data is done by choosing appropriate dataset. Datasets for such
reviews with labels is found from different sources like hotel reviews, amazon
product reviews, and other free available review datasets and combined into
Reviews.txt file.
Firstly, the dataset is explored by loading it as csv format as shown in Figure 3:
Then to make it readable, the labels in the dataset are clearly labelled as fake or
genuine as shown in Figure 4.
The dataset created from multiple sources of information has many forms of
redundant and unclean values. Such type of data is neither useful nor easy to
model.
Preprocessing: Data has been cleaned by removing all the null values, white
spaces and punctuations. This raw dataset is loaded in the form of tuple using
the code as shown in
Figure 5 allowing to only focus on the textual review content.
Then the raw data is preprocessed by applying tokenization, removal of stop
words and lemmatization. The code snippet used is shown in Figure 6.
Feature Extraction: The text reviews have different features or peculiarities
that can help to solve the classification problem. For e.g. Length of reviews
(fake reviews tend to be smaller in length with less facts revealed about the
product) and repetitive words (fake reviews have smaller vocabulary with words
repeated). Apart from the just the review text there are other features that can
contribute towards the classification of reviews as fake. Some of the significant
ones that were used as additional features inclusion are Ratings, verified
purchase and product category. The code snippet used to extract them is shown
in Figure 7.
Figure 8 and Figure 9 show the count of the reviews for each feature.
Figure 8: Verified Purchase Review Count
Figure 9: Rating Review count
Sentiment Analysis: This processed data is now analyzed for emotions or
sentiment, if the review is positive or negative. The significant factors for
doing the sentiment analysis of the reviews are use of emoticons sentiment
scores and the rating of the reviews. Note that while removing the
punctuation marks a list of emoticons is parsed to be exception, so we do not
remove or discard them by accident, while cleaning the dataset. This is
explained in more detail in chapter 6 of Accuracy Enhancements under
Enhancement 4 section. Sentiment analysis is performed with use of different
classification algorithms such as Naïve Bayes, Linear SVC, Non-linear SVM
and Random forest to obtain better results and compare the accuracies.
Fake review Detection: This is the final goal of the project to classify these
reviews into fake or genuine. The preprocessed dataset is thus classified using
different classification algorithms to analyze variety of data to classify it.
3.1 NLP based Text blob Classifier:
The two classifiers used in this configuration are:
a. Naive Bayes classifier
b. Decision Tree classifier
The experimental configuration for both classifiers was kept the same, and
this section consists of the configurations used to set up the models for
training the Python Client. Naïve Bayes and Decision Tree Classifier are used
for detecting the genuine(T) and fake(F) reviews across a wide range of data
set. The probability for each word is calculated is given by the ratio of (sum
of frequency of each word of a class to the total words for that class). The
dataset is split into 80% training 20% testing, 16800 for training and 4200 for
testing. Finally, for testing the data using a test set where the probability of
each review is calculated for each class. The class with the highest probability
value using which the review is assigned the label i.e. true/genuine (T) or fake
(F) Review. The datasets used for training are F-train.txt and T-train.txt. They
include Review ID (for e.g. ID-1100) as well as the Review text (Great
product) shown below in Figure 10 and Figure 11 respectively.
Figure 10: F-train.txt: (Fake review training dataset)
Figure 11: T-train.txt: (True review training dataset)
Figure 12: Testing Data.txt: (Fake review testing dataset)
Figure 12 contains the testing dataset which has only the ID and text for the
review and the output of this after running the model is stored in output.txt
which contains the result after prediction as fake or True review alias F / T
3.2 SKlearn Based Classifiers:
The Sklearn based classifiers were also used for classification and compared
which algorithm to get better and accurate results.
a. Multinomial Naïve Bayes: Naive Bayes classifier is used in natural
language processing problems by predicting the tag of text, calculate
probability of each tag of a text and then output choose the highest one.
b. LinearSVC: This classifier classifies data by providing the best fit hyper
plane that can be used to divide the data into categories.
SVC: Different studies have shown If you use the default kernel in SVC (), the
Radial Basis Function (rbf) kernel, then you probably used a more nonlinear
decision boundary on the case of the dataset, this will vastly outperform a linear
decision boundary
d. Random Forest: This algorithm has also been used for classifying which
is provided by sklearn library by creating multiple decision trees set randomly
on subset of training data.
For these classifiers Reviews.txt dataset is used. Figure 13 shows the dataset.
Figure 13: Reviews.txt file
After the application of all these classifiers, accuracies for each of them is
compared and their performance is evaluated for classification of the fake
reviews. There are some more enhancements also made to the models as
discussed in the upcoming chapter 6. This provided even better accuracy
results for classification of these fake reviews.
Chapter 4- Result
Data visualization:
The following visualizations show the kind of data that was used and each
depicts how many product categories are there for each label in the
Reviews.txt. Here label means fake and genuine. For e.g. for category
Instruments there are 350 reviews with label fake as seen in the code snippet
is in Figure 21.
Figure 14: Label vs Product Category code snippet
Observing the number of occurences of reviews with ratings vs the label they
have. For eg. Number of occurnaces of reviews with a fake label and rated as
5 out of 5 is more than reviews with a fake label and rated 3. The following
Figure 22 shows Label vs Rating code snippet and the comparison Label vs
Rating is shown in Figure 23.
Figure 22: Label vs Rating code snippet
Figure 23: Label vs Rating
Observing the number of occurences of reviews with emojis vs the label they
have. For eg. Number of occurnaces of reviews with a fake label and have
emojis is less than reviews with a genuine label. The following Figure 24
shows the Label vs Emojis count code snippet and comparison Label vs
Emojis is shown in Figure 25.
Figure 24: Label vs Emoji Count code snippet
Figure 25: Label vs Emoji Count
Observing the number of occurences of reviews with stop words counts vs
the label they have. For eg. Number of occurnaces of reviews with a fake
label have stopwords is less than reviews with a genuine label. The following
Figure 26 shows the Label vs Stopwords count code snippet and the
comparison Label vs Stopwords count in Figure 27
Figure 26: Label vs Stopwords Count code snippet
Observing the number of occurences of reviews with verified purchases or
not vs the label they have. For eg. Number of occurnaces of reviews with a
fake label have way less verified purchases than reviews with a genuine label.
The following Figure 28 shows the Label vs Verified Purchases code snippet
and comparison Label vs Verified Purchases in Figure 29.
Figure 28: Label vs Verified Purchase
Figure 29: Label vs Verified Purchase
These snippets of code can be observed in the DataVisualization.ipynb file
for further reference. The following output.txt file is the result generated by
textblob Naïve Bayes Classifier. It can be shown in Figure 30.
Figure 30: Output.txt: (Classified testing output dataset)
The accuracy scores obtained for this dataset are shown as
follows:
Accuracy-80.542
F1 score-77.888
Precision Score-80.612
Recall-79.001
The following results were observed for each of the previously described
experimental setups. The results show how the accuracy has improved after
each enhancement to the model in Table 1:
Raw data w/ Preprocessing Feature Testing Sentiment
Tokenizatio & inclusion data classifier
n Lemmatizatio
n
Multinomia 72% 77% 81% 80% 84%
l Naïve
Bayes
Linear SVC 67% 70% 74% 73% 83%
SVM 69% 75% 77% 81% 81%
Random 68% 70% 72% 71% 79%
Forest
Table 1: Results
Another plotting of the results is shown in Figure 31 which depicts the bar
chart for each classifier with a different color for a data of 21000 in total.
Raw data is loaded from Reviews.txt file and by just parsing it and
tokenizing, accuracy of each model is calculated to predict the reviews being
fake or genuine. The best results were obtained using Naïve Bayes classifier
as evident in the figure.
Preprocessing and lemmatization of the review text is done, accuracy of
each model is calculated to predict the reviews being fake or genuine. The
best results were obtained using Naïve Bayes classifier.
Additional feature inclusion works on including additional features like
verified purchase, ratings, product category of the review. Previously, the
data features used were only in an ID, Text, Label tuple from each review in
the dataset. After utilizing these other features, the accuracy of the models
increased and can see the improvements in the results for each of the
classifiers.
Testing Dataset covers the classification accuracy for the reviews in the
testing dataset. Here as you observed the non-linear SVM classifier
performed the best and could give 81% accuracy. This shows it could
generalize and predict the fake reviews more accurately compared to Naïve
Bayes classification which outperformed pretty much in all the other
scenarios.
Sentiment classifier includes predicting the reviews being positive or
negative according to the emojis used, the count of positive or negative word
ratio, ratings given to the review. This sentiment classification is in turn used
in predicting the reviews being fake or genuine. The accuracy results show
how each model performed on sentiment prediction of the reviews in the
dataset.
Enhancement 1 is used in predicting the sentiment of the reviews using the
list of positive and negative words in the review.
Enhancement 2 compares the number of verbs and nouns in each review and
included in the preprocessing and lemmatization step.
Enhancement 3 is discount deceptive reviews predicted, it has increased the
accuracy, but this can be regarded as infinitesimally small to be included in
the results.
Enhancement 4 is using emojis has added to the overall performance of the
model that helped in most accurate measure. It has improved the sentiment
analysis of the reviews, and in turn helped the performance of the models to
predict whether the review is fake or genuine.
Chapter 5
5.1- Future Scope
1. To use a real time/ time based datasets which will allow us to compare the
user’s timestamps of the reviews to find if a certain user is posting too many
reviews in a short period of time.
2. To use and compare other machine learning algorithms like logistic
regression to extend the research to deep learning techniques.
3. To develop a similar process for unsupervised learning for unlabeled data
to detect fake reviews.
5.2- Conclusion
The fake review detection is designed for filtering the fake reviews. In this
research work SVM classification provided a better accuracy of classifying
than the Naïve Bayes classifier for testing dataset. On the other hand, the
Naïve Bayes classifier has performed better than other algorithms on the
training data. Revealing that it can generalize better and predict the fake
reviews efficiently. This method can be applied over other sampled instances
of the dataset. The data visualization helped in exploring the dataset and the
features identified contributed to the accuracy of the classification. The
various algorithms used, and their accuracies show how each of them have
performed based on their accuracy factors.
Also, the approach provides the user with a functionality to recommend the
most truthful reviews to enable the purchaser to make decisions about the
product. Various factors such as adding new vectors like ratings, emojis,
verified purchase have affected the accuracy of classifying the data better.
References
1. E. I. Elmurngi and A.Gherbi, “Unfair Reviews Detection on Amazon
Reviews using Sentiment Analysis with Supervised Learning Techniques,”
Journal of Computer Science, vol. 14, no. 5, pp. 714–726, June 2018.
2. J. Leskovec, “WebData Amazon reviews,” [Online]. Available:
http://snap.stanford.edu/data/web-Amazon-links.html [Accessed:
October 2018].
3. J. Li, M. Ott, C. Cardie and E. Hovy, “Towards a General Rule for
Identifying Deceptive Opinion Spam,” in Proceedings of the 52nd Annual
Meeting of the Association for Computational Linguistics, Baltimore, MD,
USA, vol. 1, no. 11, pp. 1566-1576, November 2014.
4. N. O’Brien, “Machine Learning for Detection of Fake News,” [Online].
Available:
https://dspace.mit.edu/bitstream/handle/1721.1/119727/1078649610-
MIT.pdf [Accessed: November 2018].
5. J. C. S. Reis, A. Correia, F. Murai, A. Veloso, and F. Benevenuto,
“Supervised Learning for Fake News Detection,” IEEE Intelligent Systems,
vol. 34, no. 2, pp. 76-81, May 2019.
6. B. Wagh, J. V. Shinde and P. A. Kale, “A Twitter Sentiment Analysis Using
NLTK and Machine Learning Techniques,” International Journal of
Emerging Research in Management and Technology, vol. 6, no. 12, pp.
37-44, December 2017.
7. A. McCallum and K. Nigam, “A Comparison of Event Models for Naive
Bayes Text Classification,” in Proceedings of AAAI-98 Workshop on
Learning for Text Categorization, Pittsburgh, PA, USA, vol. 752, no. 1, pp.
41-48, July 1998.
8. B. Liu and M. Hu, “Opinion Mining, Sentiment Analysis and Opinion Spam
Detection,” [Online]. Available:
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon
[Accessed: January 2019].
9. C. Hill, “10 Secrets to Uncovering which Online Reviews are Fake,”
[Online]. Available: https://www.marketwatch.com/story/10-secrets-to-
uncovering-which-online-reviews- are-fake-2018-09-21 [Accessed: March
2019].
10. J. Novak, “List archive Emojis,” [Online].
11. P. K. Novak, J. Smailović, B. Sluban and I. Mozeti, “Sentiment of
Emojis,” Journal of Computation and Language, vol.10, no. 12, pp. 1-4,
December 2015.
5
P. K. Novak, “Emoji Sentiment Rankin
g,” [Online]. Available:
http://kt.ijs.si/data/Emoji_sentiment_ranking/ [Accessed: July 2019]