Sentiment Analysis using Feature Selection
and Machine Learning Algorithms
Presented By
Shruti Pant
Under Guidance Of:
Ms. Kalpana Jain : Major Advisor
Dr. Naveen Choudhary : Co-Advisor
Dr. Naveen Jain : Advisor
Dr. Chitranjan Agarwal : DRI Nominee
Contents:
1) Introduction of Sentiment Analysis
2) Literature Survey
3) Research Gap and Motivation
4) Objective of thesis
5) Design Issue
6) Design Flow
7) Experimental Results
8) List of Publication
9) Scope of Future Enhancement
10) References
11) Changes
Introduction to
Sentiment Analysis
What is Machine Learning?
Machine Learning
Study of algorithms that improve their performance at
some task with experience
Optimize a performance criterion using example
data or past experience.
Role of Statistics: Inference from a sample
Role of Computer science: Efficient algorithms to
Solve the optimization problem
Representing and evaluating the model for
inference
Types
Supervised Learning
Classification
Regression
Unsupervised Learning
Reinforcement Learning
What people think?
What others think has always been an important piece of information
Which car should I buy?
Which schools should I
apply to?
Which Professor to work for?
Whom should I vote for?
So whom shall I ask?
Pre Web
Friends and relatives
Acquaintances
Consumer Reports
Post Web
I dont know who..but apparently its a good phone. It has good battery life and
Blogs (google blogs, livejournal)
E-commerce sites (amazon, ebay)
Review sites (CNET, PC Magazine)
Discussion forums (forums.craigslist.org,
forums.macrumors.com)
Friends and Relatives (occasionally)
Basics Of Sentiments
Holder (source) of attitude
Target (aspect) of attitude
Type of attitude
From a set of types
Like, love, hate, value, desire, etc.
Or (more commonly) simple weighted
Polarity: positive or negative
Text containing the attitude
Sentence or entire document
Sentiment
A thought, view, or attitude, especially one based
mainly on emotion instead of reason
Sentiment Analysis
aka opinion mining
use of natural language processing (NLP)
and computational techniques to automate
the extraction or classification of
sentiment from typically unstructured text
Identify the orientation of opinion in a piece of text
The movie The movie
was fabulous! was horrible!
Approaches : Classifier Based
Lexicon Based
A. Sentence Level Classification
Assumption: a sentence contains only one
opinion
Task 1: identify if sentence is opinionated
classes: objective and subjective
Task 2: determine polarity of sentence
classes: positive and negative
Quiz:
This is a beautiful bracelet..
Is this sentence subjective/objective?
Is it positive or negative ?
B. Document(post/review) Level
Classification
Assumption:
each document focuses on a single object
contains opinion from a single opinion holder
Task: determine overall sentiment orientation
in document
classes: positive and negative
C. Feature Level Classification
Goal: produce a feature-based opinion summary of
multiple reviews
Task 1: Identify and extract object features that
have been commented on by an opinion holder (e.g.
picture,battery life).
Task 2: Determine polarity of opinions on features
classes: positive and negative
Task 3: Group feature synonyms
Need of Sentiment Analysis
Consumer information
Product reviews
Marketing
Consumer attitudes
Trends
Politics
Politicians want to know voters views
Voters want to know policitians stances and who
else supports them
Social
Find like-minded individuals or communities
Literature Review
[1] presented an unsupervised method on document level using
point wise mutual information to classify reviews are
recommended and not recommended. This algorithm achieved
74% accuracy.
[2] extended the work using PMI and Latent Semantic Analysis
(LSA) and achieved the accuracy of 82.2%.
[3] analysed movie review dataset on several supervised machine
learning algorithms (SVM, NB and MaxEnt) and different feature
selection techniques on a movie reviews dataset. He applied
various pre-processing techniques like stemming or lemmatization.
He used NB and MaxEnt with POS tagging to increase the
performance more than SVM.
[4] proposed an approach using IMDB movie database where it
labelled the document into objective and subjective to find
minimum s-t cut in graph to achieve the accuracy of 85%.
[5] used machine learning techniques for interlanguage (English,
Dutch and French) studies. Feature selection along with negation
unigrams and stemming is performed for relevant features and then
Multinomial nave Bayes, SVM and maximum entropy are
compared to get the overall performance.
[6] performed sentiment analysis using various feature selections
schemes like tf-idf and term occurrence and classifies the dataset
using SVM and Naive Bayes to show the performance
comparison.
Research Gap and
motivation
Sentimental analysis is a hot topic of research.
Use of electronic media is increasing day by
day.
Time is money or even more valuable than
money therefore instead of spending times in
reading and figuring out the positivity or
negativity of text we can use automated
techniques for sentimental analysis.
Sentiment analysis is used in opinion mining.
Example Analyzing a product based on its
reviews and comments.
Key is to find the best
classifier according to the
dataset used and space and
time available
Earlier work have used
Lexicon analysis to find the
sentiment of the word
Earlier works have used
different feature selection
techniques and even
classifier to build a automatic
model
Objective
A heuristic based on Nave Bayes classifier
is designed to automatically predict the class
of the incoming review in order to maximize
the performance of the model.
Design Issue
Supervised / Classification
unsupervised / Hybrid Algorithm
Feature Selection
Techniques Lexicon Or Machine
Based
Design Flow
Dataset (Phase 1)
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002.
Thumbs up? Sentiment Classification using Machine Learning
Techniques. EMNLP-2002, 7986.
Bo Pang and Lillian Lee. 2004. A Sentimental Education:
Sentiment Analysis Using Subjectivity Summarization Based
on Minimum Cuts. ACL, 271-278
Polarity detection:
Is an IMDB movie review positive or
negative?
Data: Polarity Data 2.0:
http://www.cs.cornell.edu/people/pabo/movi
e-review-data
IMDB data in the Pang and Lee
database
when _star wars_ came out some snake eyes is the most
twenty years ago , the image of aggravating kind of movie :
traveling throughout the stars has the kind that shows so much
become a commonplace image . potential then becomes
[] unbelievably disappointing .
when han solo goes light speed , its not just because this is a
the stars change to bright lines , brian depalma film , and since
going towards the viewer in lines hes a great director and one
that converge at an invisible point whos films are always greeted
cool . with at least some fanfare .
_october sky_ offers a much and its not even because this
simpler imagethat of a single was a film starring nicolas
white dot , traveling horizontally cage and since he gives a
across the night sky . [. . . ] brauvara performance , this
film is hardly worth his talents
Pre-Processing (Phase 2)
Pre-processing is done in our proposed methodology to remove
the words which impede our process of sentiment analysis by
increasing the number of false positives or false negative.
In our model stop words are removed using Tf-idf. Term
Frequency- Inverse Document Frequency is known to find the
important and no so important word in the document. NLTK
also comes with an in-built list of 128 stop words which is
also included in our model to select the not relevant words.
We have done this by importing stopwords from NLTK
corpus.
Stemming algorithms attempt to automatically remove
suffixes (and in some cases prefixes) in order to find the root
word or stem of a given word. NLTK provides several
stemmer interfaces. In our proposed method we have used
porter stemmer to find the root words.
Feature Selection (Phase 3)
Feature selection is used to increase the
effectiveness of the model. Features which are
important are selected and fed to the classifier.
In our proposed methodology we used chi square
as a scoring function with which we can find if
two terms are associated to each other
(collocation correlation of two words or words
that are more likely to occur together).
It helps us in understanding if a word is
informative or not. If a word mainly occurs in
positive review and rarely in negative reviews it
can main that the word is important. So we find
how common a word is in a particular class
compared to other classes.
Feature Selection (Phase 4)
In Machine learning, A nave Bayes
classifier is a family of simple,
baseline probabilistic classifier based
on Bayes theorem with strong but
nave independence assumptions.
Experimental Results
Accuracy
100
93
90 84.75
84
81.6
80 75.25 76.5
70
60 Accuracy
50
40
30
20
10
Figure 4.2: Accuracy comparison of chi square and
information gain applied on our proposed methodology
with G. Tripathi et al
Precision
100 93
90.4 90.15
86.17
83.6 82.63
90
80
70
60 Precision
50
40
30
20
10
Figure 4.3: Precision comparison of chi square and
information gain applied on our proposed methodology
with G. Tripathi et al
Recall
100 94
88
90 81.6 81
80
70
59.5
56.5
60 Recall
50
40
30
20
10
Figure 4.3: Recall comparison of chi square and
information gain applied on our proposed methodology
with G. Tripathi et al
F-MEASURE
100
90
80
70
60 F-MEASURE
50 93.49
83.96 83.5 83.23
40 69.53 71.68
30
20
10
Figure 4.4: F-measure comparison of chi square and
information gain applied on our proposed methodology
with G. Tripathi et al
List of Publications
Pant, S., & Jain, K. (2017). Sentiment Analysis
using Feature Selection and Classification
Algorithms A survey. IJIERT,4(3), 109-113.
Pant, S., & Jain, K. (2017). Sentiment Analysis
using Feature Selection and Classification
Algorithms. IJIERT,4(5), 5-11.
Scope of Future
Enhancement
We would like to extend this technique on other
domains of opinion mining likes newspaper
articles, product reviews, political discussion
forums etc. We would like to apply in-depth
concepts of NLP for improved prediction of the
polarity of the document.
We are planning to make automatic sentiment
classifier for more than one languages starting
from the Hindi language. As nowadays
multilingual messages are posted on social
websites, so we will able to predict the sentiment
for any language.
It is worth extending the research using hybrid
techniques for sentiment analysis.
References
1. Turney, P. D. 2002, July. Thumbs up or thumbs down?:
semantic orientation applied to unsupervised
classification of reviews. In Proceedings of the 40th
annual meeting on association for computational
linguistics pp. 417-424.
2. Turney, P. D., & Littman, M. L. 2003. Measuring praise
and criticism: Inference of semantic orientation from
association. ACM Transactions on Information Systems
(TOIS), 21 : 315-346.
3. Pang, B., Lee, L., & Vaithyanathan, S. 2002, July.
Thumbs up?: sentiment classification using machine
learning techniques. In Proceedings of the ACL-02
conference on Empirical methods in natural language
processing-Volume 10 pp. 79-86.
4. Pang, B., & Lee, L. 2004, July. A sentimental
education: Sentiment analysis using subjectivity
summarization based on minimum cuts. In Proceedings
of the 42nd annual meeting on Association for
Computational Linguistics pp. 271-275.
5. Boiy, E., & Moens, M. F. 2009. A machine learning
approach to sentiment analysis in multilingual Web
texts. Information retrieval, 12 : 526-558.
6. Tripathi, G., & Naganna, S. 2015. Feature selection and
classification approach for sentiment analysis. Machine
Learning and Applications: An International
Journal, 2 : 1-16.
Changes
1
4
5
8
9
10
11
12
13
Thank You