Twitter Data Preprocessing For Spam Detection: Myungsook Klassen

The document discusses preprocessing Twitter data for spam detection using machine learning algorithms. Various preprocessing techniques like normalization, discretization, and attribute reduction are explored. Classifiers like support vector machines, neural networks, and random forests are used on preprocessed data to detect spam tweets with classification rates close to 90%.

Uploaded by

SupportGame

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views6 pages

Twitter Data Preprocessing For Spam Detection: Myungsook Klassen

Uploaded by

SupportGame

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

FUTURE COMPUTING 2013 : The Fifth International Conference on Future Computational Technologies and Applications

Twitter Data Preprocessing for Spam Detection

Myungsook Klassen
Computer Science Dept
California Lutheran University
Thousand Oaks, USA
e-mail:mklassen@clunet.edu

AbstractDetecting Twitter spammer accounts using various purchased by a single person in an attempt to disrupt political
classification machines learning algorithms was explored from conversations following the announcement of the election
an aspect of data preprocessing techniques. Data results.
normalization, discretization and transformation were Twitter currently blocks malware by in-house-built
methods used for preprocessing in our study. Additionally, heuristics rules using Googles Safebrowsing application
attribute reduction was performed by computing correlation programming interface (API) [18] to filter spam activities
coefficients among attributes and by other attribute selection described in the Twitter Rules posted in its web site. Some
methods to obtain high classification rates with classifiers,such of spam definitions in the rule are such as an excessive
as Support Vector Machine, Neural Networks, J4.8, and
account creation in a short time period, excessive requests to
Random Forests. When top 24 attributes were selected and
used for these classifiers, the overall classification rates
befriend other users, posting misleading links, and posting
obtained were very close in range 84.30% and 89%. There unrelated updates to a topic using a hashtag #. Twitter also
was no unique subset of attributes which performed the best, checks twitter contents with uniform resource locators
and there were various different sets of attributes playing (URLs) to see if they are on its known harmful sites blacklist
important roles. database. Harmful sites can be phishing sites, sites that
download malicious software onto users computers, or spam
Keywords-data preprocessing;spam detection; social sites that request personal information. However, C. Grier et
network; classification. al. [2] show that it takes a few weeks for URLs posted in
Twitter to be on its blacklist. In addition to the fact that
I. INTRODUCTION Twitter itself does to prevent spamming, Twitter relies on
users to report spam. Once a report is filed, Twitter
Twitter6 [17] was started in 2006 by Jack Dorsey as an investigates it to decide to suspend an account or not.
online social networking and microblogging service for users Currently, much research is going on to find a method to
to send and receive short messages (called tweets) of up to detect Twitter spamming in an efficient and automated way.
140 characters. Tweeters 140-character limit on a message After all, it is not very reliable for the Twitter community to
serves modern day busy peoples trend of acquiring depend on users to identify spams manually based on
information in a short and quick way. There is much less previous spam activities.
mindless minutia to read through short tweets. People can An example of a tweet is shown in Figure 1. It shows a
spend 5 to 10 minutes on Twitter to find out fast what is tweet content of a Twitter user CLU Career Services with
happening in the world. a Twitter ID CLUCareer. When a Twitter User name or a
As a result, within the last few years, Twitter has grown Twitter user ID is clicked, its public profile page shows a full
to be one of the most popular social network sites with a name, a location, a web page, a short bio, along with tweet
half billion daily tweets as of October 2012, up from 140 contents, the number of tweets, the total number of followers
million per day in early 2011. Along with Twitters growth, and their Twitter user names, the total number of people the
spam activities in Twitter have increased and have become a user is following, and Twitter user names of people the user
problem. Spamming has been around since the birth of is following. The tweet content in Figure 1 contains a
internet and emails, and is not a unique problem with shortened link URL bloom.bg/11QHmLM which points to
Twitter, but Twitter simply introduces new kinds of spam an article page at [19]. Such shortened URLs allow users to
behavior. Unlike popular social networking service post a message within 140 characters, but hide the source
Facebook, or MySpace, anyone can read tweets without a URLs, thus providing an easy opportunity for malicious
Twitter account, but must register to post tweets. The fact users to phish and spam. This tweet message also contains a
that most accounts are public and can be followed without topic Payrolls which is identified with hashtag # in front
the users consent provides spammers with opportunities to of it and a mention to BloomerbergNews user with @
easily follow legitimate users. symbol in front of it.
A recent spamming activity took place during the All this information can be gathered using Twitter API
Russian parliament election December 4, 2011 [8]. For two by crawling the Twitter web site. Using these collected
days after the election, Twitter users posted over 800,000 raw data, different attributes, either content attributes or
tweets containing a hashtag related to elections. It turned out user behavior attributes [13] can be created.
nearly half the tweets were spams with unrelated contents,
and spam tweets were sent out through fraudulent accounts