Neurocomputing: Xianghan Zheng, Zhipeng Zeng, Zheyi Chen, Yuanlong Yu, Chunming Rong
Neurocomputing: Xianghan Zheng, Zhipeng Zeng, Zheyi Chen, Yuanlong Yu, Chunming Rong
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
art ic l e i nf o a b s t r a c t
Article history: Social network has become a very popular way for internet users to communicate and interact online.
Received 10 September 2014 Users spend plenty of time on famous social networks (e.g., Facebook, Twitter, Sina Weibo, etc.), reading
Received in revised form news, discussing events and posting messages. Unfortunately, this popularity also attracts a significant
26 November 2014
amount of spammers who continuously expose malicious behavior (e.g., post messages containing
Accepted 8 February 2015
commercial URLs, following a larger amount of users, etc.), leading to great misunderstanding and
Communicated by Huaping Liu
Available online 24 February 2015 inconvenience on users' social activities. In this paper, a supervised machine learning based solution is
proposed for an effective spammer detection. The main procedure of the work is: first, collect a dataset
Keywords: from Sina Weibo including 30,116 users and more than 16 million messages. Then, construct a labeled
Social network
dataset of users and manually classify users into spammers and non-spammers. Afterwards, extract a set
Spammer
of feature from message content and users' social behavior, and apply into SVM (Support Vector
Machine learning
Support vector machine Machines) based spammer detection algorithm. The experiment shows that the proposed solution is
capable to provide excellent performance with true positive rate of spammers and non-spammers
reaching 99.1% and 99.9% respectively.
& 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
http://dx.doi.org/10.1016/j.neucom.2015.02.047
0925-2312/& 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
28 X. Zheng et al. / Neurocomputing 159 (2015) 27–34
spammer feature (manually classified) and also explain why 2.1.2. Repost
the proposed solution could achieve excellent performance. Repost is another way to send message. User always reposts the
The paper also develops a prototype software that is capable to dis- other users' message that is interested. The reposted message will
tinguish any Weibo user (spammer or non-spammer). With friendly also be received by the user's followers.
user interface, efficient and accurate classification result, ordinary
users are capable to distinguish any Weibo users with simple ope-
2.1.3. Hashtag
ration. The software has been published in Sourceforge [6].
Weibo users could post message containing hashtags () to identify
a specific topic. If enough users pick up this topic, it will appear in the
It should be mentioned that although the proposed approach is list of trending topics.
currently tested specifically in the Sina Weibo social network, it is Example: happy birthday to Alice hello Alice.
applicable to all other existing social sites (e.g., Twitter, Facebook,
etc.) with few revisions. The rest of the paper is organized as 2.2. Existing research
follows. Section 2 presents the background of the Weibo social
network and displays some related works about spammer detec- In the past ten years, email spam detection and filtering mechan-
tion. Section 3 introduces the method how we collect the dataset isms have been widely implemented. The main work could be
and extract feature. Section 4 describes the spammer detection summarized into two categories: the content-based model and the
model, experiments and corresponding evaluation. Finally, the identity-based model. In the first model, a series of machine learning
conclusion and future works are given in Section 5. approaches [8,9] are implemented for content parsing according to
the keywords and patterns that are spam potential. In the identity-
based model, the most commonly used approach is that each user
maintains a whitelist and a blacklist of email addresses that should
2. Releated works and should not be blocked by anti-spam mechanism [10,11]. More
recent work is to leverage social network into email spam identifica-
2.1. The Sina Weibo social network tion according to the Bayesian probability [12]. The concept is to use
social relationship between sender and receiver to decide closeness
According to [3], the number of Sina Weibo site users has reached and trust value, and then increase or decrease Bayesian probability
over 500 million. Statistics show that Weibo is consistently among according to these value.
the top 25 most frequently visited websites during the past few years With the rapid development of social networks, social spam
[7]. As one of the largest social networks in China, Weibo attracts has attracted a lot of attention from both industry and academia.
millions of users online every day. In industry, Facebook proposes an EdgeRank algorithm [13] that
Weibo application is similar to Twitter, where users post mes- assigns each post with a score generated from a few feature (e.g.,
sages, interact with friends, talk about news and share interesting number of likes, number of comments, number of reposts, etc.).
topics via social network services. It is designed as a microblogging Therefore, the higher EdgeRank score, the less possibility to be a
website where users post short messages no more than 140 cha- spammer. The disadvantage of this approach is that spammers
racters. The posted messages will be delivered to followers immedi- could join their networks and continuously like and comment
ately. Each user is identified by a unique username and could start each other in order to achieve a high EdgeRank score.
following another user in order to receive friends' latest messages on In academia, Yardi et al. [14] studies the behavior of a small part of
homepage. The user who is followed could either accept the request spammers in Twitter, and find that the behavior of spammers is
to follow back, or just reject. Fig. 1 describes a simple following different from legitimate users in the field of posting tweets, foll-
graph, in which user A is following user B, and user B and user C are owers, following friends and so on. Stringhini et al. [15] further inv-
following each other. estigates spammer feature via creating a number of honey-profiles in
There are a number of expressions in Weibo, allowing users to three large social network sites (Facebook, Twitter and Myspace) and
interact with others in a better way, including mention, repost and identifies five common features (followee-to-follower, URL ratio,
hashtag. message similarity, message sent, friend number, etc.) potential for
spammer detection. However, although both of two approaches
introduce convincible framework for spammer detection, they lack
of detailed approaches specification and prototype evaluation.
2.1.1. Mention Wang [16] proposes a naïve Bayesian based spammer classifica-
A Weibo message containing a series of keywords like @username, tion algorithm to distinguish suspicious behavior from normal ones
meaning that the message sender is willing to share something with in Twitter, with the precision result (F-measure value) of 89%. Gao
the user mentioned. As a consequence, Weibo will automatically et al. [17] adopts a set of novel feature for effectively reconstructing
notify the user mentioned with the message in his/her homepage. spam messages into campaigns rather than examining them indivi-
Example: @Bob wann'a go for a coffee? dually (with precision value over 80%). The disadvantage of these two
approaches is that they are not precise enough.
Benevenuto et al. [18] collects a large dataset from Twitter and
identify 62 feature related to tweet content and user social beh-
A B avior. These characteristics are regarded as attributes in a machine
learning process for classifying users as either spammers or non-
spammers. Zhu et al. [19] proposes a matrix factorization based
spam classification model to collaboratively induce a succinct set
of latent feature (over 1000 items) learned through social relation-
C ship for each user in RenRen site (www.renren.com). However,
these two approaches are based on a large amount of selected
feature that might consume heavy computing capability and
Fig. 1. A simple following graph. spend much time in model training.
X. Zheng et al. / Neurocomputing 159 (2015) 27–34 29
1. Our proposed SVM-based classification model considers only Feature Vector Converter
Number of Followees
18 feature items and achieve the best performance result, with
F-measure value reaching over 99%. This is the best result ever Number of Followers
Username Weibo API
achieved (although different collected datasets with different Number of Created Days
contents might cause a bit deviation in result computation, a ……
big improvement of result is still comparable and significant).
2. The importance of each selected feature is studied and verified Message Crawler
through the Weka [21], a data mining software upon Java tool. Number of Reposts
The combination usage of these feature also explains why the
Number of Comments
proposed approach is capable to achieve much higher precision Messages’IDs Weibo API
rate than other existing works. Number of Likes
3. Instead of pure experiment upon specific dataset, a prototype ……
software is specifically developed and opened for public usage,
helping any user to distinguish spammer on the Sina Weibo
network environment. The accuracy of prototype further Feature Vectors
proves the efficiency of proposed solution. Non-Spammers Spammers
3.1. Dataset and feature collection 4. For each user, a feature vector is constructed according to
crawled user and message information described above.
Similar as most social media platforms, the public Weibo
developer API (specifically, user_timeline API) only provides the After that, our work labels collected users as spammers or non-
downloading functionality on the recent messages of authorized spammers. We develop a mechanism to help three volunteers ana-
users. This is considered as an obstacle to the process of data lyze each collected user manually and independently based on the
collection. To solve this problem, specific data crawler and feature recent messages. The majority voting is introduced to decide which
collection mechanism are developed, as described in the following class the user belongs to if one user is labeled to different classes.
steps (see Fig. 2): However, a user labeling process depends on human judgment, and
might lead to inevitable human error. Therefore, we ignore and
1. 100 normal users (from celebrity, company, and government discard the users whose class is difficult to decide. In total, 11488
that post/repost/comment frequently) and 50 spammer users spammers and 17646 non-spammers are labeled. Finally, 80% spam-
(who expose malicious behavior frequently) are manually sele- mers and non-spammers from labeled dataset are randomly selected
cted as data source. as the training data, leaving the rest as testing data.
2. Two types of data crawlers are developed for ordinary user and
spammer respectively. The ordinary user crawler is for extracting
3.2. Feature analysis
normal user's list of followees, which are also considered as
normal users because most of the normal users are unlikely to
Unlike normal Weibo users, spammers usually aim at the com-
follow spammers in reality (also validated through analysis in
mercial intent such as advertisement spreading. In this section, we
Section 3.2.2); spammer crawler is for extracting the list of
analyze the difference between spammers and non-spammers from
spammers behind spammers' specific reposted messages. Finally,
both content and behavior point of view according to dataset collected.
30,116 Weibo users are extracted.
3. For each user, we crawl corresponding information inside 500
recent messages, with Step1: the basic user information (e.g., 3.2.1. Content-based feature
the number of followees, number of followers, created days, From Dataset, we randomly select 300 spam and 300 non-spam
etc.) could be achieved via Weibo API; Step 2: through the messages, each of which assigned by a random integer identity
username, it is capable to crawl a set of message ID, through value ranged from 1 to 300. Besides, the maximum number of
which the message attributes (e.g., the number of reposts, the reposts, comments and likes is set to 100.
number of comments, the number of likes, etc.) could be From statistics point of view, three most obvious and important
obtained with help of the Weibo API. Finally, more than 16 featuresof spam messages could be achieved. Fig. 3(a) shows the
million messages are crawled from 25th, Feb, 2014 to 1st, repost number distribution, inside which more than 90% of spam
May, 2014. messages have a repost counts lower than 10. Similarly, the number of
30 X. Zheng et al. / Neurocomputing 159 (2015) 27–34
Number of Comments
Non-Spam Non-Spam Non-Spam
Number of Reposts
80 80 80
Number of Likes
Spam Spam Spam
60 60 60
40 40 40
20 20 20
0 0 0
0 100 200 300 0 100 200 300 0 100 200 300
The number of reposts The number of comments The number of likes
15 5 3
Non-Spam Non-Spam Non-Spam
Number of Mentions
Number of Hashtags
Spam 4 Spam Spam
Number of URLs
10 2
3
2
5 1
1
0 0 0
0 100 200 300 0 100 200 300 0 100 200 300
The number of mentions The number of URLs The number of hashtags
Fig. 3. Distribution of content-based feature.
comments and likes is also quite small, as shown in Fig. 3(b) and (c). the reason is that some companies create official accounts to promote
This may be explained that most normal users pay litter attention to their products with URLs linking to specific websites.
spam messages.
Fig. 3(d) indicates the number of mentions in each message. As
expected, most spam messages do not contain any mention because 4. Spammer detection
most spammers only aim at advertising, and spend few time on
interacting with other users. Fig. 3(e) indicates that most spams Based on dataset and feature collection described in the previous
contain at least one URLs linking to advertisement pages. The section, a supervised machine learning model is introduced for
number of hashtags analyzed in Fig. 3(f) shows that spammers spammers identification. Supervised learning [22] is the machine
sometimes post messages so as to be retrieved by search engine. learning task of inferring a function from labeled training data that
consists of a set of training examples. Inside supervised learning,
each example is a pair consisting of an input object (typically a
vector) and a desired output value (also called supervisory signal).
3.2.2. User-based feature Through analysis of the training data, supervised learning solution
In the following, cumulative distribution function (CDF) is intr- produces a classification model for predicting new examples.
oduced to study the feature of spammers, as shown in Fig. 4.
Fig. 4(a) analyzes the number of followees for each user. 4.1. SVM based spammer detection model
Normally, spammers try to follow a multitude of legitimate users
so as to be followed back. However, it does not work for most time, Fig. 5 illustrates the basic concept of proposed spammer detec-
as shown in Fig. 4(b). This type of behavior makes the fraction of tion model. In this solution, training data is converted to a series of
followees per followers very large in comparison with non- feature vectors that consist of a set of values for attributes. These
spammers, as illustrated in Fig. 4(c). vectors construct the input of supervised machine learning algo-
Analysis in the number of created days (See Fig. 4(d)) indicates rithm. After training, a classification model is applied to distinguish
that spammers have to create new accounts frequently. This might whether the specific user belongs to normal user or spammer.
be because of anti-spam mechanism that would eventually detect Because spammers and non-spammers have different social
and automatically clean spammer accounts. behavior, through analyzing content feature and user behavior, it
After that, the fraction of messages per day is illustrated in is capable to distinguish abnormal behavior from legitimate ones.
Fig. 4(e). Spammer accounts usually act as a “Robot” to post mes- In this paper, we set 18 feature listed in the following: the number
sages automatically. After calculating the average number of of followees, the number of followers, the number of messages,
messages per day for both spammers and non-spammers, it is the number of friends following each other, the number of
found that the number of messages posted by spammers per day is favorites, the number of created days, fraction of followees per
approximately three times higher than non-spammers (with mean followers, fraction of original messages, number of messages per
value of spammers and non-spammer 15.19 and 3.62 respectively). day, the average number of reposts, the average number of
Finally, Fig. 4(f) analyzes the number of average URLs in each user's comments, average number of likes, the average number of URLs,
recent messages. It shows that most spammers have at least one URL the average number of pictures, the average number of hashtags,
in each message. However, the result indicates that some normal users the average number of user mentioned, fraction of messages
also include URLs in many of their messages. After manually checking, containing URLs, fraction of messages containing pictures.
X. Zheng et al. / Neurocomputing 159 (2015) 27–34 31
1 1 1
CDF
CDF
0.4 Non-Spammer 0.4 0.4
1 1 1
CDF
CDF
0.4 Non-Spammer 0.4 Non-Spammer 0.4 Non-Spammer
0 0 0
0 500 1000 1500 2000 0 20 40 60 0 0.5 1 1.5 2
The number of created days The fraction of messages per day The number of average URLs
Fig. 4. Cumulative distribution function of user-based feature.
Predicted
1 Table 3
Classification evaluation.
0.99
Table 4
0.985
Comparison between SVM and other classifiers.
Table 5
Attributes Ranking list Top 10.
[16] A.H. Wang, Don't follow me: spam detection in Twitter, Security and Zheyi Chen (zheyi.chen@yahoo.cn) is currently work-
Cryptography (SECRYPT), in: Proceedings of the 2010 International Conference ing toward his M.S. degree in College of Information
on. IEEE, 2010, pp. 1–10. Science and Technology at QingHua University. His
[17] H. Gao, Y. Chen, K. Lee, D. Palsetia, A. Choudhary, Towards online spam filtering current research interests mainly focus on New Gen-
in social networks, in: Proceedings of the Symposium on Network and eration Network, especially on Cloud Computing and
Distributed System Security (NDSS), 2012. Applications.
[18] F. Benevenuto, G. Magno, T. Rodrigues, V. Almeida, Detecting spammers on
Twitter, in: Proceedings of the Seventh Annual Collaboration, Electronic
messaging, Anti-abuse and Spam Conference (CEAS), 2010.
[19] Y. Zhu, X. Wang, E. Zhong, N.N. Liu, H. Li, Q. Yang, Discovering spammers in
social networks, in: Proceedings of the Twenty-Sixth AAAI Conference on
Artificial Intelligence (AAAI), 2012.
[20] X. Hu, J. Tang, Y. Zhang, H. Liu, Social spammer detection in microblogging, in:
Proceedings of the Twenty-Third International Joint Conference on Artificial
Intelligence, ACM, 2013, pp. 2633–2639.
[21] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The
Yuanlong Yu (yu.yuanlong@fzu.edu.cn) is currently
WEKA data mining software: an update, ACM SIGKDD Explor. Newsl. 11 (1)
professor in the College of Mathematics and Computer
(2009) 10–18.
Sciences, Fuzhou University, China. He received the B.
[22] F. Wang, C. Zhang, Robust self-tuning semi-supervised learning, Neurocom-
Eng. degree in automatic control from the Beijing
puting 70 (16) (2007) 2931–2939.
Institute of Technology, Beijing, China (2000), the M.
[23] C. Cortes, V. Vapnik, Support-vector networks, Mach. learn. 20 (3) (1995)
Eng. degree in computer applied technology from
273–297.
Tsinghua University, Beijing, China (2003), and the
[24] LIBSVM – A Library for Support Vector Machines, 〈http://www.csie.ntu.edu.
Ph.D. degree in electrical engineering from the Memor-
tw/ cjlin/libsvm/〉.
ial University of Newfoundland, St. John's, NL, Canada
[25] G.-B. Huang, Q.-Y. Zhu, C.-K. Siew, Extreme learning machine: theory and
(2010). His current research interests mainly focus on
applications, Neurocomputing 70 (1) (2006) 489–501.
machine learning, computer vision and cognitive
[26] G.-B. Huang, H. Zhou, R. Zhang, Extreme learning machine for regression and
robotics.
multiclass classification, IEEE Trans. Syst., Man, Cybern. 42 (2) (2012) 513–529.
[27] X. Zheng, N. Chen, Z. Chen, C. Rong, G. Chen, W. Guo, Mobile cloud based
framework for remote-resident multimedia discovery and access, J. Internet
Technol. 15 (6) (2014) 1043–1050.
[28] G.E. Hinton, Learning multiple layers of representation, Trends. Cogn. Sci. 11 Chunming Rong (chunming.rong@uis.no) is Professor
(10) (2007) 428–434. of the University of Stavanger and head of the Center
[29] Y. Bengio, Scaling up deep learning, in: Proceedings of the 20th ACM SIGKDD for IP-based Service Innovation (CIPSI) at the University
International Conference on Knowledge Discovery and Data Mining, ACM, of Stavanger (UiS) in Norway. The CIPSI has the mission
2014, p. 1966. to promote cross-fertilization between several research
[30] S. Zhou, Q. Chen, X. Wang, Active deep learning method for semi-supervised fields to facilitate design and delivery of large-scale and
sentiment classification, Neurocomputing 120 (2013) 536–546. complex IP-based services required by many applica-
tion areas. Chunming's research interests include cloud
computing, big data analysis, security and privacy.