0% found this document useful (0 votes)
14 views8 pages

Neurocomputing: Xianghan Zheng, Zhipeng Zeng, Zheyi Chen, Yuanlong Yu, Chunming Rong

Uploaded by

fatna.elmendili
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views8 pages

Neurocomputing: Xianghan Zheng, Zhipeng Zeng, Zheyi Chen, Yuanlong Yu, Chunming Rong

Uploaded by

fatna.elmendili
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Neurocomputing 159 (2015) 27–34

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

Detecting spammers on social networks


Xianghan Zheng a,b, Zhipeng Zeng a,b, Zheyi Chen c, Yuanlong Yu a,b,n, Chunming Rong d
a
College of Mathematics and Computer Science, Fuzhou University, Fuzhou, China
b
Fujian Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou, China
c
Department of Computer Science, QingHua University, Beijing, China
d
Department of Computer Science and Electronic Engineering, University of Stavanger, Stavanger, Norway

art ic l e i nf o a b s t r a c t

Article history: Social network has become a very popular way for internet users to communicate and interact online.
Received 10 September 2014 Users spend plenty of time on famous social networks (e.g., Facebook, Twitter, Sina Weibo, etc.), reading
Received in revised form news, discussing events and posting messages. Unfortunately, this popularity also attracts a significant
26 November 2014
amount of spammers who continuously expose malicious behavior (e.g., post messages containing
Accepted 8 February 2015
commercial URLs, following a larger amount of users, etc.), leading to great misunderstanding and
Communicated by Huaping Liu
Available online 24 February 2015 inconvenience on users' social activities. In this paper, a supervised machine learning based solution is
proposed for an effective spammer detection. The main procedure of the work is: first, collect a dataset
Keywords: from Sina Weibo including 30,116 users and more than 16 million messages. Then, construct a labeled
Social network
dataset of users and manually classify users into spammers and non-spammers. Afterwards, extract a set
Spammer
of feature from message content and users' social behavior, and apply into SVM (Support Vector
Machine learning
Support vector machine Machines) based spammer detection algorithm. The experiment shows that the proposed solution is
capable to provide excellent performance with true positive rate of spammers and non-spammers
reaching 99.1% and 99.9% respectively.
& 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).

1. Introduction inside message, it is difficult to identify the content without visiting


the site.
Within the past few years, online social network, such as Face- There has been a few proposals from industry and academia,
book, Twitter, Weibo, etc., has become one of the major way for discussing possible solutions for spam detection and filtering
internet users to keep communications with their friends [1–3]. (described in Section 2). However, they are either ineffective or
According to Statista report [4], the number of social network based on too much considered conditions (e.g., a lot of content and
users has reached 1.61 billion until late 2013, and is estimated to behavior feature, etc.). This paper investigates social spammer
be around 2.33 billion users globe, until the end of 2017. content and behavior issues, and proposes an effective machine
However, along with great technical and commercial success, learning model for spammer detection. The paper contains the
social network platform also provides a large amount of opportu- following four main contributions:
nities for broadcasting spammers, which spreads malicious mes-
sages and behavior. According to Nexgate's report [5], during the  The paper adopts the spammer feature to detect spammer and
first half of 2013, the growth of social spam has been 355%, much test the results over Sina Weibo, the biggest social network site
faster than the growth rate of accounts and messages on most in China. Under the Weibo API, a specific dataset crawler is dev-
branded social networks. eloped to extract any unauthorized users' public messages inside
The impact of social spam is already significant. A social spam the Weibo platform. This is the first step for data analysis.
message is potentially seen by all the followers and recipients' fri-  The major novelty of the paper is to study a set of most impo-
ends. Even worse, it might cause misdirection and misunderstand- rtant features related to message content and user behavior
ing in public and trending topic discussions. For example, trending and apply them on the SVM based classification algorithm for
topics are always abused by spammers to publish comments with spammer detection. The experiment and comparison work
URLs, misdirecting all kinds of users to completely unrelated web- shows that the proposed solution enables to provide higher
sites. Because most social networks provide shorten service on URLs accuracy.
 Through feature selection algorithms and experiment testing,
ten most important feature and the weight of these feature are
n
Corresponding author. identified. The experiment results further validate the selected

http://dx.doi.org/10.1016/j.neucom.2015.02.047
0925-2312/& 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
28 X. Zheng et al. / Neurocomputing 159 (2015) 27–34

spammer feature (manually classified) and also explain why 2.1.2. Repost
the proposed solution could achieve excellent performance. Repost is another way to send message. User always reposts the
 The paper also develops a prototype software that is capable to dis- other users' message that is interested. The reposted message will
tinguish any Weibo user (spammer or non-spammer). With friendly also be received by the user's followers.
user interface, efficient and accurate classification result, ordinary
users are capable to distinguish any Weibo users with simple ope-
2.1.3. Hashtag
ration. The software has been published in Sourceforge [6].
Weibo users could post message containing hashtags () to identify
a specific topic. If enough users pick up this topic, it will appear in the
It should be mentioned that although the proposed approach is list of trending topics.
currently tested specifically in the Sina Weibo social network, it is Example: happy birthday to Alice hello Alice.
applicable to all other existing social sites (e.g., Twitter, Facebook,
etc.) with few revisions. The rest of the paper is organized as 2.2. Existing research
follows. Section 2 presents the background of the Weibo social
network and displays some related works about spammer detec- In the past ten years, email spam detection and filtering mechan-
tion. Section 3 introduces the method how we collect the dataset isms have been widely implemented. The main work could be
and extract feature. Section 4 describes the spammer detection summarized into two categories: the content-based model and the
model, experiments and corresponding evaluation. Finally, the identity-based model. In the first model, a series of machine learning
conclusion and future works are given in Section 5. approaches [8,9] are implemented for content parsing according to
the keywords and patterns that are spam potential. In the identity-
based model, the most commonly used approach is that each user
maintains a whitelist and a blacklist of email addresses that should
2. Releated works and should not be blocked by anti-spam mechanism [10,11]. More
recent work is to leverage social network into email spam identifica-
2.1. The Sina Weibo social network tion according to the Bayesian probability [12]. The concept is to use
social relationship between sender and receiver to decide closeness
According to [3], the number of Sina Weibo site users has reached and trust value, and then increase or decrease Bayesian probability
over 500 million. Statistics show that Weibo is consistently among according to these value.
the top 25 most frequently visited websites during the past few years With the rapid development of social networks, social spam
[7]. As one of the largest social networks in China, Weibo attracts has attracted a lot of attention from both industry and academia.
millions of users online every day. In industry, Facebook proposes an EdgeRank algorithm [13] that
Weibo application is similar to Twitter, where users post mes- assigns each post with a score generated from a few feature (e.g.,
sages, interact with friends, talk about news and share interesting number of likes, number of comments, number of reposts, etc.).
topics via social network services. It is designed as a microblogging Therefore, the higher EdgeRank score, the less possibility to be a
website where users post short messages no more than 140 cha- spammer. The disadvantage of this approach is that spammers
racters. The posted messages will be delivered to followers immedi- could join their networks and continuously like and comment
ately. Each user is identified by a unique username and could start each other in order to achieve a high EdgeRank score.
following another user in order to receive friends' latest messages on In academia, Yardi et al. [14] studies the behavior of a small part of
homepage. The user who is followed could either accept the request spammers in Twitter, and find that the behavior of spammers is
to follow back, or just reject. Fig. 1 describes a simple following different from legitimate users in the field of posting tweets, foll-
graph, in which user A is following user B, and user B and user C are owers, following friends and so on. Stringhini et al. [15] further inv-
following each other. estigates spammer feature via creating a number of honey-profiles in
There are a number of expressions in Weibo, allowing users to three large social network sites (Facebook, Twitter and Myspace) and
interact with others in a better way, including mention, repost and identifies five common features (followee-to-follower, URL ratio,
hashtag. message similarity, message sent, friend number, etc.) potential for
spammer detection. However, although both of two approaches
introduce convincible framework for spammer detection, they lack
of detailed approaches specification and prototype evaluation.
2.1.1. Mention Wang [16] proposes a naïve Bayesian based spammer classifica-
A Weibo message containing a series of keywords like @username, tion algorithm to distinguish suspicious behavior from normal ones
meaning that the message sender is willing to share something with in Twitter, with the precision result (F-measure value) of 89%. Gao
the user mentioned. As a consequence, Weibo will automatically et al. [17] adopts a set of novel feature for effectively reconstructing
notify the user mentioned with the message in his/her homepage. spam messages into campaigns rather than examining them indivi-
Example: @Bob wann'a go for a coffee? dually (with precision value over 80%). The disadvantage of these two
approaches is that they are not precise enough.
Benevenuto et al. [18] collects a large dataset from Twitter and
identify 62 feature related to tweet content and user social beh-
A B avior. These characteristics are regarded as attributes in a machine
learning process for classifying users as either spammers or non-
spammers. Zhu et al. [19] proposes a matrix factorization based
spam classification model to collaboratively induce a succinct set
of latent feature (over 1000 items) learned through social relation-
C ship for each user in RenRen site (www.renren.com). However,
these two approaches are based on a large amount of selected
feature that might consume heavy computing capability and
Fig. 1. A simple following graph. spend much time in model training.
X. Zheng et al. / Neurocomputing 159 (2015) 27–34 29

In Sina Weibo field, literature [20] investigates three types of


Data Source
spammer behavior (aggressive advertisement, duplicate reposting and
100 Non-Spammers 50 Spammers
aggressive following) and extracts three separated sets of feature.
Different from the main approach with all feature used by one spa-
mmer classifier, this proposal is based on a group of classifiers, each
using three generated feature sets and working jointly as a spammer Web Crawler
classifier to detect spammer. The concept of combining several spam- Followee Crawler Message Crawler
ming classifiers together is expected to improve detection perfor-
mance. However, because that each separated feature set might not
contain enough feature items (8 at most), the computation result User List
might be inaccurate (precise rate reaches only 82.06%).
Non-Spammers Spammers
Generally, this paper follows similar concept with previous
works, however, with a few distinguished points:

1. Our proposed SVM-based classification model considers only Feature Vector Converter
Number of Followees
18 feature items and achieve the best performance result, with
F-measure value reaching over 99%. This is the best result ever Number of Followers
Username Weibo API
achieved (although different collected datasets with different Number of Created Days
contents might cause a bit deviation in result computation, a ……
big improvement of result is still comparable and significant).
2. The importance of each selected feature is studied and verified Message Crawler
through the Weka [21], a data mining software upon Java tool. Number of Reposts
The combination usage of these feature also explains why the
Number of Comments
proposed approach is capable to achieve much higher precision Messages’IDs Weibo API
rate than other existing works. Number of Likes
3. Instead of pure experiment upon specific dataset, a prototype ……
software is specifically developed and opened for public usage,
helping any user to distinguish spammer on the Sina Weibo
network environment. The accuracy of prototype further Feature Vectors
proves the efficiency of proposed solution. Non-Spammers Spammers

Fig. 2. Dataset and feature collection procedure.


3. Dataset collection and analysis

3.1. Dataset and feature collection 4. For each user, a feature vector is constructed according to
crawled user and message information described above.
Similar as most social media platforms, the public Weibo
developer API (specifically, user_timeline API) only provides the After that, our work labels collected users as spammers or non-
downloading functionality on the recent messages of authorized spammers. We develop a mechanism to help three volunteers ana-
users. This is considered as an obstacle to the process of data lyze each collected user manually and independently based on the
collection. To solve this problem, specific data crawler and feature recent messages. The majority voting is introduced to decide which
collection mechanism are developed, as described in the following class the user belongs to if one user is labeled to different classes.
steps (see Fig. 2): However, a user labeling process depends on human judgment, and
might lead to inevitable human error. Therefore, we ignore and
1. 100 normal users (from celebrity, company, and government discard the users whose class is difficult to decide. In total, 11488
that post/repost/comment frequently) and 50 spammer users spammers and 17646 non-spammers are labeled. Finally, 80% spam-
(who expose malicious behavior frequently) are manually sele- mers and non-spammers from labeled dataset are randomly selected
cted as data source. as the training data, leaving the rest as testing data.
2. Two types of data crawlers are developed for ordinary user and
spammer respectively. The ordinary user crawler is for extracting
3.2. Feature analysis
normal user's list of followees, which are also considered as
normal users because most of the normal users are unlikely to
Unlike normal Weibo users, spammers usually aim at the com-
follow spammers in reality (also validated through analysis in
mercial intent such as advertisement spreading. In this section, we
Section 3.2.2); spammer crawler is for extracting the list of
analyze the difference between spammers and non-spammers from
spammers behind spammers' specific reposted messages. Finally,
both content and behavior point of view according to dataset collected.
30,116 Weibo users are extracted.
3. For each user, we crawl corresponding information inside 500
recent messages, with Step1: the basic user information (e.g., 3.2.1. Content-based feature
the number of followees, number of followers, created days, From Dataset, we randomly select 300 spam and 300 non-spam
etc.) could be achieved via Weibo API; Step 2: through the messages, each of which assigned by a random integer identity
username, it is capable to crawl a set of message ID, through value ranged from 1 to 300. Besides, the maximum number of
which the message attributes (e.g., the number of reposts, the reposts, comments and likes is set to 100.
number of comments, the number of likes, etc.) could be From statistics point of view, three most obvious and important
obtained with help of the Weibo API. Finally, more than 16 featuresof spam messages could be achieved. Fig. 3(a) shows the
million messages are crawled from 25th, Feb, 2014 to 1st, repost number distribution, inside which more than 90% of spam
May, 2014. messages have a repost counts lower than 10. Similarly, the number of
30 X. Zheng et al. / Neurocomputing 159 (2015) 27–34

100 100 100

Number of Comments
Non-Spam Non-Spam Non-Spam
Number of Reposts

80 80 80

Number of Likes
Spam Spam Spam
60 60 60

40 40 40

20 20 20

0 0 0
0 100 200 300 0 100 200 300 0 100 200 300
The number of reposts The number of comments The number of likes

15 5 3
Non-Spam Non-Spam Non-Spam
Number of Mentions

Number of Hashtags
Spam 4 Spam Spam
Number of URLs

10 2
3

2
5 1
1

0 0 0
0 100 200 300 0 100 200 300 0 100 200 300
The number of mentions The number of URLs The number of hashtags
Fig. 3. Distribution of content-based feature.

comments and likes is also quite small, as shown in Fig. 3(b) and (c). the reason is that some companies create official accounts to promote
This may be explained that most normal users pay litter attention to their products with URLs linking to specific websites.
spam messages.
Fig. 3(d) indicates the number of mentions in each message. As
expected, most spam messages do not contain any mention because 4. Spammer detection
most spammers only aim at advertising, and spend few time on
interacting with other users. Fig. 3(e) indicates that most spams Based on dataset and feature collection described in the previous
contain at least one URLs linking to advertisement pages. The section, a supervised machine learning model is introduced for
number of hashtags analyzed in Fig. 3(f) shows that spammers spammers identification. Supervised learning [22] is the machine
sometimes post messages so as to be retrieved by search engine. learning task of inferring a function from labeled training data that
consists of a set of training examples. Inside supervised learning,
each example is a pair consisting of an input object (typically a
vector) and a desired output value (also called supervisory signal).
3.2.2. User-based feature Through analysis of the training data, supervised learning solution
In the following, cumulative distribution function (CDF) is intr- produces a classification model for predicting new examples.
oduced to study the feature of spammers, as shown in Fig. 4.
Fig. 4(a) analyzes the number of followees for each user. 4.1. SVM based spammer detection model
Normally, spammers try to follow a multitude of legitimate users
so as to be followed back. However, it does not work for most time, Fig. 5 illustrates the basic concept of proposed spammer detec-
as shown in Fig. 4(b). This type of behavior makes the fraction of tion model. In this solution, training data is converted to a series of
followees per followers very large in comparison with non- feature vectors that consist of a set of values for attributes. These
spammers, as illustrated in Fig. 4(c). vectors construct the input of supervised machine learning algo-
Analysis in the number of created days (See Fig. 4(d)) indicates rithm. After training, a classification model is applied to distinguish
that spammers have to create new accounts frequently. This might whether the specific user belongs to normal user or spammer.
be because of anti-spam mechanism that would eventually detect Because spammers and non-spammers have different social
and automatically clean spammer accounts. behavior, through analyzing content feature and user behavior, it
After that, the fraction of messages per day is illustrated in is capable to distinguish abnormal behavior from legitimate ones.
Fig. 4(e). Spammer accounts usually act as a “Robot” to post mes- In this paper, we set 18 feature listed in the following: the number
sages automatically. After calculating the average number of of followees, the number of followers, the number of messages,
messages per day for both spammers and non-spammers, it is the number of friends following each other, the number of
found that the number of messages posted by spammers per day is favorites, the number of created days, fraction of followees per
approximately three times higher than non-spammers (with mean followers, fraction of original messages, number of messages per
value of spammers and non-spammer 15.19 and 3.62 respectively). day, the average number of reposts, the average number of
Finally, Fig. 4(f) analyzes the number of average URLs in each user's comments, average number of likes, the average number of URLs,
recent messages. It shows that most spammers have at least one URL the average number of pictures, the average number of hashtags,
in each message. However, the result indicates that some normal users the average number of user mentioned, fraction of messages
also include URLs in many of their messages. After manually checking, containing URLs, fraction of messages containing pictures.
X. Zheng et al. / Neurocomputing 159 (2015) 27–34 31

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6


Spammer
CDF

CDF
CDF
0.4 Non-Spammer 0.4 0.4

0.2 0.2 Spammer 0.2 Spammer


Non-Spammer Non-Spammer
0 0 0
0 1000 2000 3000 0 2000 4000 6000 0 5 10 15 20
The number of followees The number of followers The fraction of followees per followers

1 1 1

0.8 0.8 0.8

0.6 0.6 0.6


Spammer Spammer Spammer
CDF

CDF

CDF
0.4 Non-Spammer 0.4 Non-Spammer 0.4 Non-Spammer

0.2 0.2 0.2

0 0 0
0 500 1000 1500 2000 0 20 40 60 0 0.5 1 1.5 2
The number of created days The fraction of messages per day The number of average URLs
Fig. 4. Cumulative distribution function of user-based feature.

Feature Extraction Table 1


Social Network Example of confusion matrix.

Predicted

Feature Detection Spammer Non-spammer


Classifier
Web Crawler
Vectors Model Results True Spammer a b
Non-spammer c d

spammers, c expresses the number of non-spammers misclassified as


Support Vector Machine spammers, and d is the number of non-spammers correctly classified.
According to the confusion matrix, a set of metrics commonly eval-
Fig. 5. Overview of spammer detection model. uated in machine learning field are introduced: precision, recall and
F-measure.
Precision (P) is the ratio of number of instances correctly clas-
4.2. SVM classifier sified to the total number of instances and is expressed by formula
P¼ a/(a þc). Recall (R) is the ratio of the number of instances
The spammer detection solution is based on a non-linear correctly classified to the total number of predicted instances and
support vector machine (SVM) classifier [23] with the Radial Basis is expressed with formula R¼a/(a þb). F-measure is the harmonic
Function (RBF) kernel. This could be achieved through the imple- mean between precision and recall, and is defined as F¼2PR/
mentation provided by libSVM [24], an integrated software for (P þR). For evaluation of classifiers' performance, F-measure value
supporting vector classification, regression and distribution is more precise because it is a combination value with summariz-
estimation. ing of both precision and recall.
The SVM with RBF kernel function has two such training para-
meters: C controls overfitting of the model; and gamma controls the 4.4. Ratio of spammer to non-spammer
degree of nonlinearity. In the experiment, we apply a parameter
selection tool provided by libSVM to select parameters automatically Firstly, we use complete training dataset for testing work and
with a 5-fold cross-validation. This tool uses grid search policy to find achieve F-measure value of spammer and non-spammer as 91.6%
highest classification accuracy through computation from different and 93.2% respectively. This might not be the optimized result. In
values of C and gamma pair. Finally, a most suitable pair that C and order to achieve higher spammer detection accuracy, the ratio of
gamma equal with 128 and 0.03125 respectively is generated and spammer to non-spammer in the training dataset is changed as
selected for our specific training dataset. follows: 10:1, 8:1, 6:1 4:1, 2:1, 1:2, 1:4, 1:6, 1:8 and 1:10, with
corresponding classification accuracy result illustrated in Fig. 6. It
4.3. Evaluation metrics shows that F-measure value of both spammer and non-spammer
grows simultaneously when the ratio of spammer decreases, and
In the evaluation, we consider a confusion matrix illustrated in reaches the highest accuracy of about 99.5% and 99.9% when the
Table 1, where a represents the number of spammers correctly cla- ratio is set to 1:2. After that, the accuracy drops quickly while the
ssified, b refers to the number of spammers misclassified as non- ratio of non-spammer rises. On the other hand, it is obvious that an
32 X. Zheng et al. / Neurocomputing 159 (2015) 27–34

1 Table 3
Classification evaluation.

Precision Recall F-measure


0.995
Spammer 0.999 0.991 0.995
Non-spammer 0.995 0.999 0.997
F-Measure

0.99

Table 4
0.985
Comparison between SVM and other classifiers.

Classifier Precision Recall F-measure


0.98
Spammer Spammer Non- Spammer Non- Spammer Non-
spammer spammer spammer
Non-Spammer
0.975 SVM 0.999 0.995 0.991 0.999 0.995 0.997
10:1 8:1 6:1 4:1 2:1 1:2 1:4 1:6 1:8 1:10 Decision 0.942 0.95 0.953 0.958 0.947 0.954
Ratio of Spammer to Non-Spammer Tree
Naïve 0.939 0.96 0.922 0.966 0.93 0.963
Fig. 6. Classification accuracy with different ratios of spammer to non-spammer in Bayes
training dataset. Bayes Ne-
twork 0.946 0.915 0.907 0.956 0.926 0.935

Table 2 ranking of importance of these selected attributes. Specifically, we


Confusion matrix. evaluate the relative power of each selected attribute and distin-
guish one user class from the others by applying these two methods
Predicted
respectively. The result listed in Table 5 indicates that the most 10
Spammer Non-spammer
important attributes taken from the two methods are quite similar.
Additionally, we notice that the top two most important attr-
True Spammer 99.1% 0.9% ibutes are the number of created days and the average number of
Non-spammer 0.1% 99.9% comments, which are also easy to be identified from the normal
user point of view. These two attributes also highlight the behavior
feature that spammers usually create new accounts to avoid being
appropriate ratio of spammer to non-spammer is important since a detected, and receive little feedback from legitimate users. There-
large quantitative difference (i.e. 10:1 or 1:10) would result in lower fore, for normal users, ignore Weibo messages from very new acc-
accuracy. This is because that a large ratio of spammer indicates a ount with little comment could be a good strategy to avoid spam.
large probability to misclassify normal user to spammer, and vice Furthermore, we verify the importance of the top 10 attributes
versa. Therefore, in the following experiment, the ratio of spammer via dividing 18 attributes into 10 subsets (each of which represents
to non-spammer is set to be 1:2. all attributes minus i-th attribute). We calculate F-measure value of
both spammer and non-spammer inside each subset according to
4.5. Classification result and comparison approaches described in Section 4. In Fig. 7, the result indicates
that (1) the accuracy result indeed decreases slightly when any
Table 2 illustrates confusion matrix obtained by SVM classifier. attribute is removed; (2) generally (ignore the column of All-10),
It shows that our proposed solution is quite efficient, with 99.1% the more importance of the specific attribute, the less accuracy
spammers and 99.9% non-spammers correctly classified, leaving result will be (See All-1 column for example); (3) the miss of a
only a small fraction of spammers and non-spammers misclassi- single attribute does not influence much on result (with the worst
fied. Table 3 describes the value of evaluation metrics, in which accuracy value reaching 98.6%). This could be explained that most
precision, recall and F-measure are calculated for spammer and spammers are related to multi-feature and could be clearly
non-spammer respectively. classified even one important feature is missing.
Besides, we also compare the proposed approach with other
classifiers: Decision Tree, Naïve Bayes and Bayes Network, with
implementation provided by Weka. For each classifier, the same 4.7. Prototype implementation
evaluation metrics (precision, recall and F-measure) are calculated
for both spammers and non-spammers, with the result illustrated Instead of relying only on the experiment of specific training
in Table 4. It is obvious that SVM classifier is capable to achieve and testing dataset, we further develop a prototype software for
best accuracy. This indicates that the hyperplane calculated by the purpose of distinguishing Weibo users in real environment.
SVM could separate training data into two parts with a maximum The work is described in the following steps:
margin. Besides, it is shown that the other three classifiers also
achieve good accuracy. This is because that the suitable feature 1. Based on developed data crawler, the prototype software
(including content and user behavior) selected are capable to contains an user interface that accepts an username or trending
distinguish spammers from non-spammers effectively. topic as input.
2. We randomly select a trending topic, called Jeremy Lin joined
4.6. Importance of the attributes and user suggestions the Lakers (initiated in July 25, 2014), which has attracted 2562
participating users by Aug 25, 2014.
After that, two well-known feature selection methods (informa- 3. Each participating user in this topic is analyzed according to
tion gain and Chi Squared available on Weka) are applied to find the content and behavior feature, and classified as spammer or
X. Zheng et al. / Neurocomputing 159 (2015) 27–34 33

Table 5
Attributes Ranking list Top 10.

Rank Information gain Chi squared

1 Number of created days Number of created days


2 Average number of comments Average number of comments
3 Fraction of followees per followers Average number of URLs
4 Average number of URLs Fraction of followees per followers
5 Fraction of messages containing URLs Average number of user mentioned
6 Average number of user mentioned Fraction of messages containing URLs
7 Fraction of original messages Average number of pictures
8 Average number of pictures Fraction of original messages
9 Number of messages per day Number of messages per day
10 Number of followees Number of followees

1 detection that contains the capability of real-time data and feature


collection, lower training time with high accuracy. Extreme Learning
Machine (ELM) [25,26], a new learning scheme of feedforward neural
networks that provide much lower training time and similar accu-
0.98
racy, could be one possible solution.
On the other hand, feature extracted in our proposed solution
F-measure

(also existing approaches) is based on statistical analysis and


0.96 manual selection. However, In the era of big data with huge data
volume and convenient access [27], feature extraction mechanism
in our solution might be low adaptive and costive. Therefore, how
to import the concept of artificial intelligence technology (e.g.
0.94
deep learning algorithms [28–30]) into automatic feature learning
Spammer and extraction has become an important question.
Non-Spammer
0.92
All All-1 All-2 All-3 All-4 All-5 All-6 All-7 All-8 All-9 All-10 Acknowledgments
Fig. 7. Classification results with different feature subsets.
The authors would like to thank the support of the Technology
Innovation Platform Project of Fujian Province under Grant no.
Table 6 2009J1007, the Program of Fujian Key Project under Grant no.
Spammer detection in trending topic. 2013H6011, and the Natural Science Foundation of Fujian Province
under Grant no. 2013J01228.
Trending topic Jeremy Lin joined the Lakers
Total users 2562
The authors would like to thank Prof. Yuanlong Yu from Fuzhou
Detected spammers 14 University for his invaluable expert advice that makes this paper
Real spammer accounts 13 successfully completed.
False alarms 1
Accuracy 92.9%
References

non-spammer based on proposed classification model. Finally, [1] Facebook, 〈http://www.facebook.com/〉.


[2] Welcome to Twitter, 〈http://twitter.com/〉.
14 users are labeled as spammer.
[3] Weibo – SINA, 〈http://english.sina.com/weibo/〉.
4. We analyze these 14 users' recent messages manually and find that [4] Statista, 〈http://www.statista.com/〉.
13 users are spammer account, with only one user misclassified (as [5] Nexgate. 2013 State of Social Media Spam, 〈http://nexgate.com/wp-content/
illustrated in Table 6). The testing result further proves feasibility, uploads/2013/09/Nexgate-2013-State-of-Social-Media-Spam-Research-Re
port.pdf〉, 2013.
efficiency and reliability of our proposed solution. Note that [6] Weibocrawler, 〈http://weibocrawler.sourceforge.net/〉.
developed software is open for public usage in Sourceforge site. [7] Alexa Top 500 Global Sites, 〈http://www.alexa.com/topsites〉.
[8] M. Uemura, T. Tabata, Design and evaluation of a Bayesian-filter-based image
spam filtering method, in: Proceedings of the International Conference on
Information Security and Assurance (ISA), IEEE, 2008, pp. 46–51.
5. Conclusion and future works [9] B. Zhou, Y. Yao, J. Luo, Cost-sensitive three-way email spam filtering, J. Intell.
Inf. Syst. 42 (1) (2013) 19–45.
[10] J. Jung, E. Sit, An empirical study of spam traffic and the use of DNS black Lists,
In this paper, we have introduced a machine learning based spa- in: Proceedings of the 4th ACM SIGCOMM Conference on Internet Measure-
mmer detection solution for social networks. The solution considers ment, ACM, 2004, pp. 370–375.
the user's content and behavior feature, and apply them into SVM [11] M. Antonakakis, R. Perdisci, D. Dagon, W. Lee, N. Feamster, Building a dynamic
reputation system for DNS, in: Proceedings of the Third USENIX Workshop on
based algorithm for spammer classification. Through a multitude Large-scale Exploits and Emergent Threats (LEET), 2010.
of analysis, experiment, evaluation and prototype implementation [12] Trust evaluation based content filtering in social interactive data, in: Proceed-
work, we have shown that proposed solution is feasible and is ings of the 2013 International Conference on Cloud Computing and Big Data
(CloudCom-Asia), IEEE, 2013, pp. 538–542.
capable to reach much better classification result than the other
[13] J. Kincaird, Edgerank: the secret sauce that makes Facebook's news feed tick,
existing approaches. TechCrunch, 2010, 〈http://techcrunch.com/2010/04/22/facebook-edgeran〉.
However, two open issues are still waiting for urgent answer. On [14] S. Yardi, D. Romero, G. Schoenebeck, Detecting spam in a Twitter network,
one hand, although the proposed approach could achieve precise First Monday 15 (1) (2009).
[15] G. Stringhini, C. Kruegel, G. Vigna, Detecting spammers on social networks, in:
classification result, it takes over one hour in a process of model Proceedings of the 26th Annual Computer Security Applications Conference,
training. Therefore, one open issue includes online spammer ACM, 2010, pp. 1–9.
34 X. Zheng et al. / Neurocomputing 159 (2015) 27–34

[16] A.H. Wang, Don't follow me: spam detection in Twitter, Security and Zheyi Chen (zheyi.chen@yahoo.cn) is currently work-
Cryptography (SECRYPT), in: Proceedings of the 2010 International Conference ing toward his M.S. degree in College of Information
on. IEEE, 2010, pp. 1–10. Science and Technology at QingHua University. His
[17] H. Gao, Y. Chen, K. Lee, D. Palsetia, A. Choudhary, Towards online spam filtering current research interests mainly focus on New Gen-
in social networks, in: Proceedings of the Symposium on Network and eration Network, especially on Cloud Computing and
Distributed System Security (NDSS), 2012. Applications.
[18] F. Benevenuto, G. Magno, T. Rodrigues, V. Almeida, Detecting spammers on
Twitter, in: Proceedings of the Seventh Annual Collaboration, Electronic
messaging, Anti-abuse and Spam Conference (CEAS), 2010.
[19] Y. Zhu, X. Wang, E. Zhong, N.N. Liu, H. Li, Q. Yang, Discovering spammers in
social networks, in: Proceedings of the Twenty-Sixth AAAI Conference on
Artificial Intelligence (AAAI), 2012.
[20] X. Hu, J. Tang, Y. Zhang, H. Liu, Social spammer detection in microblogging, in:
Proceedings of the Twenty-Third International Joint Conference on Artificial
Intelligence, ACM, 2013, pp. 2633–2639.
[21] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The
Yuanlong Yu (yu.yuanlong@fzu.edu.cn) is currently
WEKA data mining software: an update, ACM SIGKDD Explor. Newsl. 11 (1)
professor in the College of Mathematics and Computer
(2009) 10–18.
Sciences, Fuzhou University, China. He received the B.
[22] F. Wang, C. Zhang, Robust self-tuning semi-supervised learning, Neurocom-
Eng. degree in automatic control from the Beijing
puting 70 (16) (2007) 2931–2939.
Institute of Technology, Beijing, China (2000), the M.
[23] C. Cortes, V. Vapnik, Support-vector networks, Mach. learn. 20 (3) (1995)
Eng. degree in computer applied technology from
273–297.
Tsinghua University, Beijing, China (2003), and the
[24] LIBSVM – A Library for Support Vector Machines, 〈http://www.csie.ntu.edu.
Ph.D. degree in electrical engineering from the Memor-
tw/  cjlin/libsvm/〉.
ial University of Newfoundland, St. John's, NL, Canada
[25] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and
(2010). His current research interests mainly focus on
applications, Neurocomputing 70 (1) (2006) 489–501.
machine learning, computer vision and cognitive
[26] G.-B. Huang, H. Zhou, R. Zhang, Extreme learning machine for regression and
robotics.
multiclass classification, IEEE Trans. Syst., Man, Cybern. 42 (2) (2012) 513–529.
[27] X. Zheng, N. Chen, Z. Chen, C. Rong, G. Chen, W. Guo, Mobile cloud based
framework for remote-resident multimedia discovery and access, J. Internet
Technol. 15 (6) (2014) 1043–1050.
[28] G.E. Hinton, Learning multiple layers of representation, Trends. Cogn. Sci. 11 Chunming Rong (chunming.rong@uis.no) is Professor
(10) (2007) 428–434. of the University of Stavanger and head of the Center
[29] Y. Bengio, Scaling up deep learning, in: Proceedings of the 20th ACM SIGKDD for IP-based Service Innovation (CIPSI) at the University
International Conference on Knowledge Discovery and Data Mining, ACM, of Stavanger (UiS) in Norway. The CIPSI has the mission
2014, p. 1966. to promote cross-fertilization between several research
[30] S. Zhou, Q. Chen, X. Wang, Active deep learning method for semi-supervised fields to facilitate design and delivery of large-scale and
sentiment classification, Neurocomputing 120 (2013) 536–546. complex IP-based services required by many applica-
tion areas. Chunming's research interests include cloud
computing, big data analysis, security and privacy.

Xianghan Zheng is associate professor in the College of


Mathematics and Computer Sciences, Fuzhou Univer-
sity, China. He received his MSc of Distributed System
(2007) and Ph.D. of Information Communication Tech-
nology (2011) from University of Agder, Norway. His
current research interests include New Generation Net-
work with special focus on Cloud Computing Services
and Applications, Big Data Processing and Security.

Zhipeng Zeng (zhipeng.zeng@fzu.edu.cn) is currently


working toward his M.S. Degree in College of Mathe-
matics and Computer Science at Fuzhou University. His
current research interests mainly focus on Big Data
Analysis, especially on Social Network Analysis,
Machine Learning, etc.

You might also like