Twitter Data Preprocessing For Spam Detection: Myungsook Klassen
Twitter Data Preprocessing For Spam Detection: Myungsook Klassen
                                                        Myungsook Klassen
                                                      Computer Science Dept
                                                   California Lutheran University
                                                       Thousand Oaks, USA
                                                    e-mail:mklassen@clunet.edu
 AbstractDetecting Twitter spammer accounts using various           purchased by a single person in an attempt to disrupt political
 classification machines learning algorithms was explored from       conversations following the announcement of the election
 an aspect of data preprocessing techniques. Data                    results.
 normalization, discretization and transformation were                   Twitter currently blocks malware by in-house-built
 methods used for preprocessing in our study. Additionally,          heuristics rules using Googles Safebrowsing application
 attribute reduction was performed by computing correlation          programming interface (API) [18] to filter spam activities
 coefficients among attributes and by other attribute selection      described in the Twitter Rules posted in its web site. Some
 methods to obtain high classification rates with classifiers,such   of spam definitions in the rule are such as an excessive
 as Support Vector Machine, Neural Networks, J4.8, and
                                                                     account creation in a short time period, excessive requests to
 Random Forests. When top 24 attributes were selected and
 used for these classifiers, the overall classification rates
                                                                     befriend other users, posting misleading links, and posting
 obtained were very close in range 84.30% and 89%. There             unrelated updates to a topic using a hashtag #. Twitter also
 was no unique subset of attributes which performed the best,        checks twitter contents with uniform resource locators
 and there were various different sets of attributes playing         (URLs) to see if they are on its known harmful sites blacklist
 important roles.                                                    database. Harmful sites can be phishing sites, sites that
                                                                     download malicious software onto users computers, or spam
    Keywords-data       preprocessing;spam    detection;    social   sites that request personal information. However, C. Grier et
 network; classification.                                            al. [2] show that it takes a few weeks for URLs posted in
                                                                     Twitter to be on its blacklist. In addition to the fact that
                       I.    INTRODUCTION                            Twitter itself does to prevent spamming, Twitter relies on
                                                                     users to report spam. Once a report is filed, Twitter
     Twitter6 [17] was started in 2006 by Jack Dorsey as an          investigates it to decide to suspend an account or not.
 online social networking and microblogging service for users        Currently, much research is going on to find a method to
 to send and receive short messages (called tweets) of up to         detect Twitter spamming in an efficient and automated way.
 140 characters. Tweeters 140-character limit on a message          After all, it is not very reliable for the Twitter community to
 serves modern day busy peoples trend of acquiring                  depend on users to identify spams manually based on
 information in a short and quick way. There is much less            previous spam activities.
 mindless minutia to read through short tweets. People can               An example of a tweet is shown in Figure 1. It shows a
 spend 5 to 10 minutes on Twitter to find out fast what is           tweet content of a Twitter user CLU Career Services with
 happening in the world.                                             a Twitter ID CLUCareer. When a Twitter User name or a
     As a result, within the last few years, Twitter has grown       Twitter user ID is clicked, its public profile page shows a full
 to be one of the most popular social network sites with a           name, a location, a web page, a short bio, along with tweet
 half billion daily tweets as of October 2012, up from 140           contents, the number of tweets, the total number of followers
 million per day in early 2011. Along with Twitters growth,         and their Twitter user names, the total number of people the
 spam activities in Twitter have increased and have become a         user is following, and Twitter user names of people the user
 problem. Spamming has been around since the birth of                is following. The tweet content in Figure 1 contains a
 internet and emails, and is not       a unique problem with         shortened link URL bloom.bg/11QHmLM which points to
 Twitter, but Twitter simply introduces new kinds of spam            an article page at [19]. Such shortened URLs allow users to
 behavior. Unlike popular social networking service                  post a message within 140 characters, but hide the source
 Facebook, or MySpace, anyone can read tweets without a              URLs, thus providing an easy opportunity for malicious
 Twitter account, but must register to post tweets. The fact         users to phish and spam. This tweet message also contains a
 that most accounts are public and can be followed without           topic Payrolls which is identified with hashtag # in front
 the users consent provides spammers with opportunities to          of it and a mention to BloomerbergNews user with @
 easily follow legitimate users.                                     symbol in front of it.
     A recent spamming activity took place during the                    All this information can be gathered using Twitter API
 Russian parliament election December 4, 2011 [8]. For two           by crawling the Twitter web site. Using these collected
 days after the election, Twitter users posted over 800,000          raw data, different attributes, either content attributes or
 tweets containing a hashtag related to elections. It turned out     user behavior attributes [13] can be created.
 nearly half the tweets were spams with unrelated contents,
 and spam tweets were sent out through fraudulent accounts
     Alex Wang [13] crawled Twitter and collected 29847             number of followees. The second is Influence factor,
 users with around 500K tweets and 49M follower/friends             which is defined for this study as a ratio of the sum of a
 relationships. He manually labeled each tweet either as            number of times mentioned and a number of times a user
 spam or non spam and found that only 1% is spam account.
                                                                    was mentioned to the sum of a number of times mentioned ,
 A graph based attribute reputation and content based
 attributes such as existence of duplicate tweets, the number       a number of times a user mentioned, and a number of times
 of HTTP links, the number of replies/mentions, the number          a user replied.
 of tweets with trending topics are used with Bayesian
 classifier for spam detection and 89% overall classification       B. Methods
 rate was reported.                                                     Four classifiers, SVM, random forest (RF), a multi layer
     McCord and Chuah [14] collected 1000 Twitter user              back propagation neural networks and J4.8 decision tree
 accounts and extracted the following attributes: distribution      implemented in the open source data mining suites WEKA
 of Tweets over 24 hours, the number of friends, the                were used in our experiments. WEKA [20] is a data mining
 number of followers, the number of URLs, the number of             software developed at the University of Waikato, New
 replies/mentions, weighted keywords,        the number of          Zealand. For SVM, a program grid.py from the libSVM
 retweets, and the number of hashtags and ran 4 different           implementation site [6] was used to select two important
 classifiers with theses attribute values. They reported that       parameters, C a penalty parameter of an error term and
 the Random Forest performs the best among 4 classifiers            gamma a RBF kernel function coefficient. These values are
 with an overall precision value of 0.957.                          used for SVM in WEKA.
     Benevenuto et al. [10] gathered a large Twitter data set
 related to three trend topics and extracted 39 contents            C. Evaluations
 attributes and 23 user behaviors attributes which were used            The ten cross validation is used to measure the
 with Support Vector Machine (SVM) classifier to detect             generalization performance of classifiers used in this
 Twitter spammers. Further description of data can be found         research. The method first partitions data into 10 equal
 at Section III, since our study was conducted using this set       sized segments and in each iteration, 9 different segments
 of data. They reported classification rates of 70.1% and           are used for training and 1 remaining segment is used for
 96.4% for spam class and non spam class, respectively.             testing. This repeats 10 times and an average of 10 results
     Twitter spammers are known to employ automation to             from testing segment is computed.
 publish tweets. Zhang et al. [16] presented a technique to              Classifier performance results are discussed using
 detect automated twitter content updates. They tested 19436        values derived from a confusion matrix. TABLE I shows a
 accounts and reported that 16% exhibit highly automated            confusion matrix of two classes.
 behavior and verified accounts, most-followed accounts,
                                                                                      TABLE I: CONFUSION MATRIX
 and followers of the most followed account all have lower
 automation rates of 6.9%, 12% and 4.2%, respectively.
                                                                                          Predicted Class1         Predicted Class2
                III.   EXPERIMENTAL SETUP                             Actual Class1               a                        b
                                                                      Actual Class2               c                        d
 A. Data set
     Data from Benevenuto et al. [10] was used as a basis for
                                                                        True Positive (TP) for class 1 is a/(a+c) and False
 this study. In his work, Twitter was crawled to collect
                                                                    Positive(FP) for class 1 is c/(c+d). Precision for class 1
 tweets with three most trendy topics at the time in August
                                                                    P= a/(a+c) is the ratio of the number of data predicted
 2009 and 1065 legitimate accounts and 355 spam accounts
                                                                    correctly to the total predicted as class 1. Recall for class 1
 were used for his study. Data contains thirty nine content
                                                                    R= a/(a+b) is the ratio of the number of data correctly
 attributes and twenty three user behaviors attributes, all
                                                                    predicted to the number of data in class 1. TP, FP, P and R
 numeric values, from the raw tweet information. Content
                                                                    for class 2 are similarly defined. A classification rate or
 attributes are a fraction of followings: tweets replied, tweets
                                                                    an average weighted TP rate in WEKA is defined as the
 with spam words, tweets with URLs, along with the mean,
                                                                    ratio of the number of correctly predicted data to the total
 median, min, and max of the followings: the number of
                                                                    number of data in both classes, (a+d)/(a+b+c+d). F-
 hashtags per words on each tweet, URLs per word on each
                                                                    measure is a weighted average of the precision P and recall
 tweet, characters per tweet, hashtags per tweet, mentions per
                                                                    R to measure of a tests accuracy and is defined as 2*
 tweet, numeric characters per tweet, URLs per tweet, words
                                                                    P*R/(P+R).
 per tweet and times a tweet is retweeted.
     Two additional attributes which are computed from                          IV.    EXPERIMENTS AND RESULTS
 existing attributes are added to the data for our study. One is
                                                                    A. Normalization
 reputation defined by Wang [13] as a ratio of the number of
                                                                       The original data set has a vast range of attribute values.
 followers to the sum of the number of followers and the            Seventeen attributes such as a fraction of tweets replied are
 in a range between 0 and 1 while most of content based              Top 10 ranking from Filtered Attribute Evaluation and Info
 features such as the number of followers are in a range 0           Gain Attribute Evaluation are quite similar, but are quite
 and over 40000. The age of an account and an elapsed time
 between tweets are measured in seconds so values range              TABLE II. BETTER PERFORMANCE OF NORMALIZED DATA WITH
                                                                                              SVM
 between 0 and 87,000,000. We investigated if attributes in
 greater numeric data ranges dominate their significance in                                       Without scaling
 learning and produce inaccurate classification. Experiments                            TP rate        FP rate      Precision       Recall
 without normalization and with normalization in a range -           No spam class         1            0.932         0.682           1
 1 and 1 were performed with libSVM and results shown in              Spam class         0.068            0             1           0.068
 TABLE II. It also shows that without normalizing data, the          Classification      0.689          0.622         0.788         0.689
                                                                          rate
 spam class was predicted very poorly with only 6.8% TP                                        Data scaled [-1,1]
 rate and the overall classification rate of 68.9%. With data                           TP rate       FP rate     Precision         Recall
 normalization,     not only the classification rate went up         No spam class       0.947         0.248        0.885           0.947
 significantly to 88.3%, but also more importantly the spam           Spam class         0.752         0.053        0.876           0.752
 class TP values went up to 75.2%! When data is normalized           Classification      0.883         0.183        0.882           0.883
                                                                          rate
 between 0 and 1, similar results to those with normalization
 between -1 and 1 are obtained.                                      TABLE III. PAIRED-T-TEST OF F MEASURE OF CLASSIFICATION
     Our next experiment is to evaluate sensitivity of                                         RATES
 classifiers with data normalization to obtain high
 classification rates. Classification rates obtained with 4                             libSVM       MultiLayer        J48         Random
 classifiers using both original data set and normalized data                                           NN                          Forest
                                                                      Original data      68.68         70.42          83.22         88.55
 set and results are presented below in TABLE III. Without           Normalized data     87.59         87.47          85.54         88.16
 data normalization, both the multi-layer neural networks and
 libSVM show a low performance with 68.68% and 70.42%                  TABLE IV. CORRELATION COEFFICIENTS OF SOME PAIRED
 classification rates respectively, while J48 and Random                                  ATTRIBUTES
 Forests consistently perform well regardless of data
 normalization.                                                       Two Attributes selected for correlation inspection        Correlation
                                                                                                                                coefficients
                                                                      number of hashtags per word on each tweet(mean). 0.886909
 B. Manual atttributes selection                                      number of hashtags per tweet(mean).
     A manual attribute selection process used for this study
 is based on the notion that an attribute with a high                 number of posted tweets per day(mean).                    0.742563
 correlation with the class attribute, but with a low               number of posted tweets per week(mean).
 correlation with other attributes is a good attribute.             number of followees of a users followers.                0.903385
 Correlation between the class attribute and an attribute           number of followees.
 being reviewed is computed for all attributes. For instance,
                                                                      number of followees of a users followers.                0.862586
 the existence of spam words in the screen name attribute
                                                                      number of followers.
 has -0.01085 correlation values with the class attribute, so
 this attribute is considered          not very useful and is         number of words per tweet (mean).                         0.349
 eliminated from the attribute list. Attributes min of the            number of characters per tweet(mean).
 number of URLs per tweet and max of the number of URLs
 per tweet showed a similar trend and as a result, they are              Different from those obtained by 3 others and top 10
 eliminated from the attribute list. Similar steps were taken        rankings from these 3 methods have only a few attributes
 with all other attributes for selecting good attributes.            in common. Simply there is not a group of top 10 attributes
     Redundancy of attributes is evaluated by their                  common for all selection methods. TABLE V shows three
 correlation values which are shown in TABLE III. When               attributes, the number of followers per followees, fraction of
 there is high correlation between two attributes, one               tweets replied, and the number of times the user replied are
 attribute is eliminated and if there is no strong correlation       on the top 10 ranking attributes for all 5 selection methods.
 between two attributes, both are kept. In the case of the           When a comparison is made for top 24 attributes, a situation
 number of words per tweet and the number of characters per          is worse-there is less percentage of common attributes
 tweet, there is a little correlation so both are kept. After this   selected by all 5 methods.
 manual process, 24 attributes are selected.                             However, despite different rankings of attributes, when
 C. Using WEKA attribute selection methods                           top 24 attributes were used with classifiers, all of them
     Chi Squared Attribute selection, Filtered Attribute             showed compatible classification rates, slightly lower than
 selection, Info Gain Attribute evaluation, Gain Ratio               results obtained when manually selected 24 attributes were
 Attribute Eval and oneR AttributeEval              with Ranker      used as shown in TABLE V.
 selection method were used to rank original 62 attributes.
     McCord [14] collected data at random without                   smaller number of attributes selected manually using
 considering trendy topics while Benevenuto did with three          correlation coefficient, equally high classification rates were
 specific popular topics at the time Twitter data was               obtained. This is an important finding for a real time
 gathered. The two unique attribute used in McCords work           detection of spams. Further investigation of Twitter
 are the word weight metric which is a difference between           characteristics is needed to understand why the spam class
 the weight of spam words and the weight of legitimate              TP value is not as high as we hope to.
 words in tweets. The sum of all weights is used as a word
 weight metric attribute. This weight parameter controls a                                        REFERENCES
 probability that a word can be in a list of spam words and a       [1]    D. Terdiman, Report: Twitter hits half a billion tweets per day,
 probability that it can be in the regular word list. And the              http://news.cnet.com, December 2012. Retrieved: January, 2013.
                                                                    [2]    C. Grier, K. Thomas, V. Paxson, and M. Zhang, @spam: The
 other unique attribute used for their study is the time of a              underground on 140 characters or less, Proceedings of CCS 10
 day a tweet was posted. The rationale of this attribute is that           Conference, pp. 27-37, October 2010.
 spammers work at night.                                            [3]    K. Thomas, C. Grier, V. Paxson, and D. Song, Suspended Accounts
     Wang [13] reported TP value 0.89 of the spam class and                in Retrospect: An analysis of Twitter spam, Proceedings of IMC
                                                                           conference, pp. 243-256, November 2011.
 the overall classification rate 91.7% and the recall value R       [4]    H. Kwak, C. Lee, H. Park, and S. Moon, What is Twitter,a social
 0.917 with Nave Bayesian classifier. With SVM, neural                    network or a news media? Proceedings of WWW2010 conference,
 networks, and J48 decision tree, 100%, 100%, and 66.7%                    pp. 591-600, April 2010.
 overall classification rates respectively reported. But the        [5]    S. Ghosh, B. Viswanath, F. Kooti, N. Sharma, G. Korlam, F.
                                                                           Benevenuto, N. Ganguly, and K. Gummadi, Understanding and
 reported recall rates for three classifiers are very low, 0.333,          combating link farming in the twitter social network, Proceedings of
 0.417, 0.25 for decision tree, neural network, SVM                        WWW2012 conference, pp. 61- 70, April 2012.
 respectively. In his work, like McCords work, random              [6]    C. Hsu, C. Chuan, and C. Lin, A Practical guide to support vector
 Twitter accounts were collected without considering trendy                classification, http://www.csie.ntu.edu.tw/~cjlin. Retrieved:
                                                                           January, 2013.
 topics. And the unique attributes used in his work are             [7]    C. Grier, K. Thomas, V. Paxson, and M. Zhang, @spam:the
 reputation which we used for our study and             the total          underground on 140 characters or less, Proceedings of CCS 10
 number of duplicate tweets which is computed by using                     conference, pp. 27-37, October 2010.
 Levenshtein distance. The rationale is that spammers use           [8]    http://www.icsi.berkeley.edu/icsi/gazette/2012/05/twitter-spam.
                                                                           Retrieved: December, 2012.
 different user names to post the same contents. An                 [9]    W.S. Sarle. Neural Network FAQ, 1997,
 observation worth mentioning is that the number of hashtags               ftp://ftp.sas.com/pub/neural/FAQ.html. Retrieved: December, 2012.
 for the spam class in his study is lower than those for the        [10]   F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida, Detecting
 non spam class. Many spam accounts have less than or                      Spammers on Twitter, Proceedings of CEAS 2010 conference, pp.
                                                                           21-30, July 2010.
 equal to 2 average hashtags while non spam accounts have           [11]   M. Cha, H. Haddadi, F. Benevenuto, and K. Gummadi, Measuring
 anywhere from 0 to 20 (an estimate of an average number                   user influence in twitter: The million follower fallacy, Proceedings
 of hashtags by a quick visual inspection of Figure 7d in his              of ICWSM 2010 conference, pp. 10-17, May 2010.
 work is about 7).                                                  [12]   J. Rabelo, R. Prudencio, and F. Barros, Using link structure to infer
                                                                           opinions in social networks, Proceedings of SMC conference, pp.
     This is quite contrast to what Benevenutos data shows                681- 685, October 2012.
 regarding the hashtags that spammers post much higher              [13]   A. Wang, Dont follow me: spam detection in twitter, Proceedings
 fraction of hashtags per tweet. This contrast may come from               of 5th international conference on security and cryptography, pp. 142-
 the fact that data crawled from Twitter by Benevenuto et al.              151, July 2010.
                                                                    [14]    M. McCord, and M. Chuah, Spam detection on twitter using
 was using trendy topics while Wangs was gathered at                      traditional classifiers, Proceedings of international conference on
 random without using any trendy topic. This comparison                    autonomic and trusted computing (ATC) conference, pp. 175-186,
 suggests that spammers use more hashtags to capture                       September 2011.
 legitimate users attention when there are hot trendy topics        [15]   U. C. Berkeley. Twitter Spam. International computer science
                                                                           institute (ICSI) ICSI Gazette, May 2012.
 being discussed but when there is little hot trendy topics         [16]   C. Zhang, and V. Paxson, Detecting and analyzing automated
 being discussed among Twitter users, usage of hashtags                    activity on twitter, Proceedings of PAM conference, pp. 102- 111,
 among spammers and among non spammers is not much                         March 2011.
 different .                                                        [17]   http://www.twitter.com. Retrieved: December, 2012.
                                                                    [18]   https://developers.google.com/safe-browsing/. Retrieved: January,
     In conclusion, our study shows that normalization,                    2013.
 transformation and discretization improve Twitter spam/no          [19]   http://www.bloomberg.com/news/2012-12-07/. Retrieved: December,
 spam classification rates, especially the spam class when                 2012.
 libSVM and back propagation multi layer neural networks            [20]   http://www.cs.waikato.ac.nz/ml/weka/. Retrieved: December, 2012.
 were used. And our study demonstrates that when using a