Techniques For Sentiment Analysis of Twitter Data: A Comprehensive Survey
Techniques For Sentiment Analysis of Twitter Data: A Comprehensive Survey
net/publication/312559872
CITATIONS READS
7 500
2 authors:
All content following this page was uploaded by Mitali Desai on 23 March 2018.
    Abstract— The World Wide Web has intensely evolved a                subject, sentiment itself i.e. belief and object i.e. the topic
novel way for people to express their views and opinions about          about which the subject has shared the sentiment. An object is
different topics, trends and issues. The user-generated content         an entity that represents a definite person, item, product, issue,
present on different mediums such as internet forums, discussion        event, topic or any organization [3-7]. Sentiment analysis is
groups, and blogs serves a concrete and substantial base for            carried out at different levels ranging from coarse level to fine
decision making in various fields such as advertising, political        level. The coarse level sentiment analysis determines the
polls, scientific surveys, market prediction and business               sentiment of the whole manuscript or document. The fine level
intelligence. Sentiment analysis relates to the problem of mining       sentiment analysis, whereas focuses on the attributes.
the sentiments from online available data and categorizing the
                                                                        Sentiment analysis of Twitter data is carried out on sentence
opinion expressed by an author towards a particular entity into
at most three preset categories: positive, negative and neutral. In
                                                                        level which comes in between coarse level and fine level. In
this paper, firstly we present the sentiment analysis process to        the sentiment analysis process, the sentiments present in the
classify highly unstructured data on Twitter. Secondly, we              text are of two types: Direct and Comparative. The direct
discuss various techniques to carryout sentiment analysis on            sentiments in text are independent from other objects in the
Twitter data in detail. Moreover, we present the parametric             same sentence [7]. For example “the picture quality of this
comparison of the discussed techniques based on our identified          camera is great.” However, the comparative sentiments in the
parameters.                                                             text denote the comparison of different objects within the
                                                                        same sentence. For example “car x is cheaper than car y.”
   Keywords— Sentiment analysis; machine learning; opinion
mining; Twitter                                                             The existing sentiment analysis techniques are useful in
                                                                        various applications such as disaster relief and humanitarian
                       I.      INTRODUCTION                             assistance, marketing and trade predictions, checking political
                                                                        polls, advertising market, scientific surveys, checking
    Social Computing is an innovative and growing computing
                                                                        customer loyalty, finding job opportunities, population health
exemplar for the analysis and modeling of social activities
                                                                        care and understanding students’ learning experiences [1-7].
taking place on various platforms. It is used to produce
intellectual and interactive applications to derive efficient               In this paper, we present a sentiment analysis process for
results [1]. The wide availability of social media sites provides       Twitter data. Twitter is a micro-blogging site that is rapidly
individuals to share their sentiments or opinions about a               growing in terms of number of users [8-9]. Moreover, Tweets
particular event, product or issue. Mining of such informal and         are mostly public and limited to 140 characters that simplify
homogeneous data is highly useful to draw conclusions in                the identification of emotions in text [9-12]. Though, the
various fields. Though, the highly unstructured format of the           abundance of data, use of short forms, timing of different
opinion data available on web makes the mining process                  posts, and diversity of language make the sentiment analysis
challenging [2].                                                        process difficult for Twitter data.
    Textual information present on web is majorly classified                The rest of the paper is organized as follows: In section II,
into either of the two categories: fact data and sentiment data         we discuss the existing work in the field of sentiment analysis.
[3]. Fact data are the objective terminologies concerning               Section III describes the methodology to carryout sentiment
different entities, issues or events. Whereas sentiment data are        analysis. Section IV presents numerous supervised machine
the subjective terms, that define individual’s opinions or              learning algorithms used to conduct sentiment analysis and
beliefs for a particular entity, product or event. Sentiment            their comparison based on the identified parameters. Finally,
analysis is the process of recognizing and classifying different        Section V specifies the conclusion and future directions.
sentiments conveyed online by the individuals to derive the
writer's approach towards a specific product, topic or event is                               II.   RELATED WORK
positive, negative or neutral. Sentiment analysis has three                In current years, a voluminous amount of research has
major component of study as follows: sentiment holder i.e.              been conducted in the sentiment analysis domain. In [7],
authors have proposed a technique to classify students’ data              provided as an input to the built classifier to classify the
generated on Twitter into various categories to encounter                 remaining data i.e. test set. Each of the processing steps is
students’ various problems. In [13], authors have presented the           discussed thoroughly in the following sub-sections.
logical approach to mine the sentiments shared on different
social media platforms. They have analysed the sentiments of
the text using combinatory categorical grammar, annotation,
lexicon acquisition and semantic networks. The basic
techniques of sentiment classification and the methods for data
collection are presented in [14]. The accuracy of classification
process with selected feature vector is verified for the
electronic products domain using various classifiers such as
Nave Bayes, Maximum Entropy, Support Vector Machine,
and Ensemble classifiers in [15]. In [16], authors have
introduced a hybrid method that is a combination of the usage
of sentiment lexicons with a machine learning classifier for              Fig. 1. Sentiment analysis process of Twitter data
polarity detection of subjective texts in the consumer-products           A. Data Sources
domain. In [17], authors have proposed a batch of machine
learning methods with semantic analysis to classify the                       Selection of data source to conduct the sentiment analysis
sentence and reviews of different products based on twitter               plays a significant role. Social media platforms as the data
data using WordNet for better accuracy. In [18], authors have             sources are broadly categorized into three general categories:
examined the performance of different classifiers such as                 blogs, micro-blogging sites, and review site [13-16]. Among
Naïve Bayes, SMO, SVM and Random Forest to classify                       all categories, a micro-blogging site such as Twitter has
Twitter data. In [19], authors have presented a technique to              gained higher popularity due to its limited strength of the
normalize the noisy or irrelevant tweets and classify them                content and publically availability of data. From the following
according to the polarity i.e. positive or negative. Moreover,            statistics of the Twitter growth rate, it’s evident to use Twitter
they have employed a mixture model approach to generate                   as the data source for sentiment analysis.
different sentimental words. The generated words were later                    Twitter Growth Rate Statistics
used as feature indicators in the classification model. Authors
have introduced a novel method to predict sentiments about                    Approximately 6,000 tweets are tweeted on Twitter on per
stocks using various monetary communication boards and                        second basis. It resembles to 350,000 tweets sent per
performed an automatic prediction for the stock market using                  minute and 500 million tweets per day. That makes it
web sentiments in [20]. In [21], authors have examined the                    around 200 billion tweets per year. In Twitter's history, the
performance of sentiment analysis in e-learning domain using                  number of Tweets increased from 5,000 tweets per day in
various methods of feature selection i.e. CHI statistics, Mutual              2007 [8] to 500,000,000 tweets per day in 2013, that is
Information (MI) and Information Gain (IG). In [22], authors                  approximately a six orders of magnitude [8]. At the
have proposed an automatic sentiment classifier to classify                   intermediate stages it has the statistics of 300,000 tweets
reviews of Brazilian TV shows into positive or negative                       per day in 2008 [9], 2.5 million tweets per day in 2009 [9],
category and possessed 90% of accuracy. Authors have                          35 million tweets per day in 2010 [8], 200 million tweets
demonstrated a system to extract the Tweets and classify them                 per day in 2011 [10]. And 340 million tweets per day six
using domain oriented seed based enrichment technique to                      years after the emergence of Twitter i.e. on March 21,
reduce the information loss in the knowledge domain in [23].                  2012 [12]. This statistics conclude the use of Twitter for
In [24], authors have investigated numerous combinations of                   our research.
different preprocessing levels, machine learning techniques
                                                                               Twitter Studies
and features combining with neutral class to analyze real-time
students’ feedback. In [25], authors have developed an                        As per the recent work, the studies carry out on Twitter
enhanced sentiment classification method that can detect and                  data are in the field of health care, marketing, politics,
remove anomalies from Twitter data in addition to the                         advertising market, athletics etc. Analysis techniques used
classification.                                                               in these studies include qualitative content analysis,
                                                                              network or graph analysis, linguistic or psycholinguistic
       III.   METHODOLOGY FOR SENTIMENT ANALYSIS                              analysis, word clouds and histograms [5]. In addition,
    The sentiment analysis of Twitter data is an emerging field               Twitter has been voted as the most promising source for
that needs much more attention. Fig. 1 shows the steps to carry               the studies such as community or influence detection, topic
out the process of sentiment analysis on Twitter data.                        discovery,     market      and     business     predictions,
                                                                              recommendation systems and tweet classification.
    Firstly, the collected Twitter data is pre-processed to
perform the data cleaning. Secondly, the important features                    Tweets
are extracted from the clean text, applying any of the feature
                                                                              The message posted on Twitter is called Tweet, which is
selection methods. Thirdly, the portion of the data is manually
                                                                              limited to 140 characters. Tweets are generally composed
labeled as positive or negative Tweets to prepare a training set.
                                                                              of one of the followings [10] [13] [14]: text, links,
Finally, the extracted features and the labeled training set are
                                                                    150
                     International Conference on Computing, Communication and Automation (ICCCA2016)
   emoticons, and images. A six seconds video is even added                    Terms Frequency and Term Presence: These features
   as a Tweet component in 2012 [8-12]. Based on these                          denote individual and distinct words and their
   components the mining is applied to classify text, links,                    occurrence counts.
   images, emoji or emoticons and even videos. The Tweets
   contains three notations including hashtags (#), retweets                   Negative Phrases: The presence of negative words can
   (RT) and account Id (@).                                                     change the meaning or orientation of the opinion. So it
                                                                                is evident to take negative word orientation in account.
B. Twitter Data Collection Methods
                                                                               Parts Of Speech (POS): Finding nouns, verbs,
    The three possible ways to collect Tweets for research are
                                                                                adjectives etc. as they are significant gauges of
as follows [11]:
                                                                                opinions.
    Data repositories such as UCI, Friendster, Kdnuggets,
     and SNAP                                                              E. Sentiment Classification Techniques
                                                                              There are typically two techniques to identify sentiment of
    APIs: Twitter provides two types of APIs such as                      the text [7] [13] [26-32]: knowledge based technique and
     search API and stream API. Search API is used to                      machine learning techniques.
     collect Twitter data on the basis of hashtags and stream
     API is used to stream real time data from Twitter                         Knowledge based technique is also called Lexicon based
                                                                           technique. The lexicon-based technique focuses on deriving
    Automated tools that are further classified into                      the opinion based lexicons from the text and then identifying
     premium tools such as Radian6 [18], Sysmos,                           the polarity of those lexicons. Lexicons are the collection of
     Simplify360, Lithium and non-premium tools such as                    known and precompiled sentiment terms. This approach is
     Keyhole, Topsy, Tagboard and SocialMention                            further classified into Dictionary-based approach and Corpus-
C. Data Preprocessing                                                      based approach. In the Dictionary-based approach, we find the
                                                                           opinion oriented words, and then examine the dictionary to
    Mining of Twitter data is a challenging task. The collected            collect their synonyms and antonyms. Whereas in the Corpus-
data is raw data. In order to apply classifier, it is essential to         based approach, we create a list of opinion words and then
pre-process or clean the raw data. The pre-processing task                 based on their context specific orientations, we find additional
involves uniform casing, removal of hashtags and other                     related opinion words in a vast corpus. To conduct lexicon
Twitter notations (@, RT), emoticons, URLs, stop words,                    approach, a trivial set of words describing opinions is
decompression of slang words and compression of elongated                  collected manually with their known orientations as a mean of
words. The following steps show the pre-processing                         pre-processing task. The set is then grown gradually by
procedure.                                                                 searching in the distinguished and widely used lexicon
    Remove the Twitter notations such as hashtags (#),                    dictionary tool such as WordNet or Sentiful for their
     retweets (RT), and account Id (@).                                    synonyms and antonyms [17-18].
    Remove the URLs, hyperlinks and emoticon. It is                           Whereas the main objective of machine learning
     necessary to remove non letter data and symbols as we                 techniques is to develop the algorithm that optimizes the
     are dealing with only text data.                                      performance of the system using training data such as
                                                                           examples and/or past knowledge and experiences. The
    Remove the stop words such as are, is, am etc. The                    machine learning provides a solution of the sentiment
     stop words do not emphasize on any emotions, it is                    classification problem in two sequential steps:
     intended to remove them to compress the dataset.
                                                                              1) Develop and train the model using training set data i.e.
    Compress the elongated words such as happyyy into                           already labeled data.
     happy.
                                                                              2) Classifying the unlabeled or unclassified data based on
    Decompress the slag words such as g8, f9. Generally                         the trained or skilled model.
     slang words are adjectives or nouns and they contain
     the extreme level of sentiments. So it is necessary to                    Machine learning techniques are further classified into
     decompress them.                                                      supervised and unsupervised techniques [13] [15] [26-30]. To
                                                                           carry out sentiment analysis, typically the supervised machine
D. Feature Extraction                                                      learning techniques are used as we are dealing with subjective
   The pre-processed dataset has various discrete properties.              data. Supervised machine learning techniques highly depend
In feature extraction methods, we extract different aspects                on training data which are already labeled data unlike in the
such as adjectives, verbs and nouns and later these aspects are            case of unsupervised machine learning techniques. Based on
identified as positive or negative to detect the polarity of the           the provided training data, the classifier will classify the rest
whole sentence. Followings are the widely used Feature                     data i.e. test data. A large number of supervised machine
Extraction methods.                                                        learning algorithms such as Logistic Regression, Naïve Bayes,
                                                                           Decision Tree, Support Vector Machine (SVM), Random
                                                                           Forest, Maximum Entropy, and Bayesian Network are used
                                                                     151
                     International Conference on Computing, Communication and Automation (ICCCA2016)
                                                                   152
                         International Conference on Computing, Communication and Automation (ICCCA2016)
                                                                        153
                               International Conference on Computing, Communication and Automation (ICCCA2016)
Moreover, we presented the parametric comparison of the                                   [14] S. Bhuta, A. Doshi, U. Doshi and M. Narvekar, “A review of techniques
discussed supervised machine learning techniques based on                                      for sentiment analysis Of Twitter data”, Issues and Challenges in
                                                                                               Intelligent Computing Techniques (ICICT), 2014, pp. 583-591.
our identified parameters. It has been found that various
                                                                                          [15] M. S. Neethu and R. Rajasree, “Sentiment Analysis in Twitter using
techniques applied for sentiment analysis are domain specific                                  Machine Learning Techniques”, in 4th Int. Conf. of Computing,
and language specific.                                                                         Communications and Networking Technologies (ICCCNT), 2013, pp. 1-
                                                                                               5.
    Hence, the future opportunities in the domain of sentiment
                                                                                          [16] S. Bahrainian and A. Dangel, “Sentiment Analysis using Sentiment
analysis include developing a technique to perform sentiment                                   Features”, in Int. joint Conf. of Web Intelligence and Intelligent Agent
classification that can be applicable to any data regardless of                                Technologies, 2013, pp. 26-29.
domain. In addition, language diversity in social media data is                           [17] G. Gautam and D. Yadav, “Sentiment analysis of twitter data using
a key issue which is required to be eliminated in future.                                      machine learning approaches and semantic analysis”, in 7th Int. Conf. on
Moreover, some of the more crucial challenges of Natural                                       Contemporary Computing, 2014, pp. 437-442.
Language Processing (NLP) can also be used as further                                     [18] B. Gokulakrishnan, P. Plavnathan, R. Thiruchittampalam, A. Perera and
developments in sentiment analysis, such as hidden or veiled                                   N. Prasath, “Opinion Mining and Sentiment Analysis on aTwitter Data
                                                                                               Stream”, in Int. Conf. on Advances in ICT for Engineering Regions,
sentiment detection, satire detection, comparison or                                           2012, pp. 182-188.
association handling and emoticon detection.
                                                                                          [19] A. Celikyilmaz, D. Hakkani-Tur and Junlan Feng, “Probabilistic model-
                                                                                               based sentiment analysis of twitter messages”, IEEE Spoken Language
                                    REFERENCES                                                 Technology Workshop (SLT), 2010, pp. 79-84.
[1]     I. King, J. Li and K. T. Chan, “A Brief Survey of Computational                   [20] V. Sehgal and C. Song, “SOPS: Stock Prediction Using Web
        Approaches in Social Computing”, in Proc. of Int. Joint Conf. on Neural                Sentiment”, in 7th IEEE Int. Conf. on Data Mining Workshop, 2007, pp.
        Network, 2009, pp. 2699-2706.                                                          21-26.
[2]     S. R. Barahate and V. M. Shelake, “A Survey and Future Vision of Data             [21] Z. Kechaou, B. M. Ammar and A. M. Alimi, “Improving e-learning with
        mining in Educational Field”, in Proc. 2nd Int. Conf. on Advanced                      sentiment analysis of users' opinions”, in Global Engineering Education
        Computing & Communication Technology, 2012, pp. 96-100.                                Conference (EDUCON), 2011, pp. 1032-1038.
[3]     Bing Liu, N. Indurkhya and F. J. Damerau, Handbook of Natural                     [22] A.C.E.S Lima. and L.N. de Castro, “Automatic sentiment analysis of
        Language Processing, Second Edition, 2010, pp. 1-3860-68.                              Twitter messages”, in 4th Int. Conf. on Computational Aspects of Social
[4]     M. Dredze , “How Social Media Will Change Public Health”, IEEE                         Networks (CASoN), 2012, pp. 52-57.
        Intelligent Systems, 2012, pp. 1541-1672.                                         [23] R. Batool, A. M. Khattak, J. Maqbool and S. Lee, “Precise tweet
[5]     G. Siemens and P. Long, “Penetrating the fog: Analytics in learning and                classification and sentiment analysis”, in 12th Int. Conf. on Computer
        education”, Educause Review, 2011, vol. 46, no. 5, pp. 30-32.                          and Information Science (ICIS), 2013, pp. 461-466.
[6]     C. Romero and S. Ventura, "Educational Data Mining: A Review of the               [24] N. Altrabsheh, M. Cocea and S. Fallahkhair, “Sentiment analysis:
        State of the Art," in Systems, Man, and Cybernetics, Part C:                           towards a tool for analysing real-time students feedback”, in 26th
        Applications and Reviews, IEEE Transactions, 2010, vol. 40, no.6, pp.                  International Conference on Tools with Artificial Intelligence, 2014, pp.
        601-618.                                                                               420-423.
[7]     X. Chen, M. Vorvoreanu and K. Madhavan, “Mining Social Media Data                 [25] Z. WANG, V. J. Chuan TONG, X. XIN and H. C. CHIN, “Anomaly
        to Understand Students’ Learning Experiences”, IEEE Transaction,                       Detection through Enhanced Sentiment Analysis on Social Media Data”,
        2014, vol. 7, no. 3, pp. 246-259.                                                      in 6th International Conference on Cloud Computing Technology and
[8]     Weil, Kevin (VP of Product for Revenue and former Big Data engineer,                   Science, 2014, pp. 918-922.
        Twitter Inc.), "Measuring Tweets." Twitter Official Blog, February 22,            [26] V. Singh and S. K. Dubey, “Opinion mining and analysis: A literature
        2010. [Online]. Available: http://www.internetlivestats.com/twitter-                   review” , in 5th Int. Conf. on Confluence The Next Generation
        statistics. [Accessed: 19-Oct-2015].                                                   Information Technology Summit (Confluence), 2014, pp. 232-239.
[9]     Krikorian, Raffi (VP, Platform Engineering, Twitter Inc.), "New Tweets            [27] K. Khan, B. Baharudin, A. Khan and F. Malik, “Mining Opinion from
        per second record, and how!" Twitter Official Blog. August 16,                         Text Documents: A Survey”, Digital Ecosystems and Technologies,
        2013.[Online]. Available: https:// blog.twitter.com/ 2013/ new-tweets-                 2009, pp. 217-222.
        per- second-record-and-how. [Accessed: 19-Oct-2015].                              [28] K. Ghag and K. Shah, “Comparative analysis of the techniques for
[10]    Twitter Engineering, "200 million Tweets per day." Twitter Official                    Sentiment Analysis”, in Int. Conf. on Advances in Technology and
        Blog.         June        30,      2011.       [Online].       Available:              Engineering, 2013, pp. 1-7.
        https://blog.twitter.com/2011/200-million-tweets-per-day.     [Accessed:          [29] W. Medhat, A. Hassan and H. Korashy, "Sentiment analysis algorithms
        19-Oct-2015].                                                                          and applications: A survey”, Ain Shams Engineering Journal, vol. 5, no.
[11]    “Three Cool and Inexpensive Tools to Track Twitter Hashtags”, June                     4, 2014, pp. 1093-1113.
        11, 2013. [Online]. Available http://dannybrown.me/2013/06/11/three-              [30] J. Khairnar and M. Kinikar, “Machine Learning Algorithms for Opinion
        cool-toolstwitterhashtags/ [Accessed: 19-Oct-2015].                                    Mining and Sentiment Classification”, in International Journal of
[12]    "Twitter turns six." Twitter Official Blog. March 21, 2012. [Online].                  Scientific and Research Publications, vol. 3, no. 6, June 2013.
         Available: https://blog.twitter.com/2012/twitter-turns-six. [Accessed:           [31] A. Sarlan, C. Nadam and S. Basri, “Twitter Sentiment Analysis”, in Int.
        19-Oct-2015].                                                                          Conf. on Information Technology and Multimedia, 2014, pp. 213-216.
[13]    N. Kasture and P. Bhilare, “An Approach for Sentiment analysis on                 [32] P. Saloun, M. Hruzik and I. Zelinka, “Sentiment Analysis – e-Bussines
        social networking sites”, Computing Communication Control and                          and e-Learning Common Issue”, in 11th IEEE Int. Conf. on Emerging
        Automation (ICCUBEA), 2015, pp. 390-395.                                               eLearning Technologies and Applications, 2013, pp. 339-34.
154