0% found this document useful (0 votes)
51 views11 pages

Detecting Rumors on Twitter

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views11 pages

Detecting Rumors on Twitter

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Rumor has it: Identifying Misinformation in Microblogs

Vahed Qazvinian Emily Rosengren Dragomir R. Radev Qiaozhu Mei


University of Michigan
Ann Arbor, MI
{vahed,emirose,radev,qmei}@umich.edu

Abstract in which one’s well-being is at risk (DiFonzo et al.,


1994).
A rumor is commonly defined as a state- The rapid growth of online social media has made
ment whose true value is unverifiable. Ru- it possible for rumors to spread more quickly. On-
mors may spread misinformation (false infor- line social media enable unreliable sources to spread
mation) or disinformation (deliberately false large amounts of unverified information among peo-
information) on a network of people. Identi-
ple (Herman and Chomsky, 2002). Therefore, it is
fying rumors is crucial in online social media
where large amounts of information are easily crucial to design systems that automatically detect
spread across a large network by sources with misinformation and disinformation (the former of-
unverified authority. In this paper, we address ten seen as simply false and the latter as deliberately
the problem of rumor detection in microblogs false information).
and explore the effectiveness of 3 categories of Our definition of a rumor is established based on
features: content-based, network-based, and social psychology, where a rumor is defined as a
microblog-specific memes for correctly iden-
statement whose truth-value is unverifiable or delib-
tifying rumors. Moreover, we show how these
features are also effective in identifying disin- erately false. In-depth rumor analysis such as deter-
formers, users who endorse a rumor and fur- mining the intent and impact behind the spread of
ther help it to spread. We perform our exper- a rumor is a very challenging task and is not possi-
iments on more than 10,000 manually anno- ble without first retrieving the complete set of social
tated tweets collected from Twitter and show conversations (e.g., tweets) that are actually about
how our retrieval model achieves more than the rumor. In our work, we take this first step to
0.95 in Mean Average Precision (MAP). Fi-
retrieve a complete set of tweets that discuss a spe-
nally, we believe that our dataset is the first
large-scale dataset on rumor detection. It can
cific rumor. In our approach, we address two basic
open new dimensions in analyzing online mis- problems. The first problem concerns retrieving on-
information and other aspects of microblog line microblogs that are rumor-related. In the second
conversations. problem, we try to identify tweets in which the ru-
mor is endorsed (the posters show that they believe
the rumor).
1 Introduction
2 Related Work
A rumor is an unverified and instrumentally relevant
statement of information spread among people (Di- We review related work on 3 main areas: Analyzing
Fonzo and Bordia, 2007). Social psychologists ar- rumors, mining microblogs, and sentiment analysis
gue that rumors arise in contexts of ambiguity, when and subjectivity detection.
the meaning of a situation is not readily apparent,
or potential threat, when people feel an acute need 2.1 Rumor Identification and Analysis
for security. For instance rumors about ‘office ren- Though understanding rumors has been the sub-
ovation in a company’ is an example of an ambigu- ject of research in psychology for some time (All-
ous context, and the rumor that ‘underarm deodor- port and Lepkin, 1945), (Allport and Postman,
ants cause breast cancer’ is an example of a context 1947), (DiFonzo and Bordia, 2007), research has

1589

Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1589–1599,
Edinburgh, Scotland, UK, July 27–31, 2011. c 2011 Association for Computational Linguistics
only recently begun to investigate how rumors are opinion mining and sentiment analysis, it presents
manifested and spread differently online. Mi- a different class of problem because we are con-
croblogging services, like Twitter, allow small cerned not just with the opinion of the person post-
pieces of information to spread quickly to large au- ing a tweet, but with whether the statements they
diences, allowing rumors to be created and spread in post appear controversial. The automatic identifica-
new ways (Ratkiewicz et al., 2010). tion of rumors from a corpus is most closely related
Related research has used different methods to to the identification of memes done in (Leskovec et
study the spread of memes and false information al., 2009), but presents new challenges since we seek
on the web. Leskovec et al. use the evolution to highlight a certain type of recurring phrases. Our
of quotes reproduced online to identify memes and work presents one of the first attempts at automatic
track their spread overtime (Leskovec et al., 2009). rumor analysis.
Ratkiewicz et al. (Ratkiewicz et al., 2010) created
the “Truthy” system, identifying misleading politi- 2.3 Mining Twitter Data
cal memes on Twitter using tweet features, includ- With its nearly constant update of new posts and
ing hashtags, links, and mentions. Other projects public API, Twitter can be a useful source for
focus on highlighting disputed claims on the Inter- collecting data to be used in exploring a num-
net using pattern matching techniques (Ennals et al., ber of problems related to natural language pro-
2010). Though our project builds on previous work, cessing and information diffusion (Bifet and Frank,
our work differs in its general focus on identifying 2010). Pak and Paroubek demonstrated experimen-
rumors from a corpus of relevant phrases and our at- tally that despite frequent occurrences of irregular
tempts to further discriminate between phrases that speech patterns in tweets, Twitter can provide a use-
confirm, refute, question, and simply talk about ru- ful corpus for sentiment analysis (Pak and Paroubek,
mors of interest. 2010). The diversity of Twitter users make this
Mendoza et al. explore Twitter data to analyze the corpus especially valuable. Ratkiewicz et al also
behavior of Twitter users under the emergency situ- use Twitter to detect and track misleading political
ation of 2010 earthquake in Chile (Mendoza et al., memes (Ratkiewicz et al., 2010).
). They analyze the re-tweet network topology and Along with many advantages, using Twitter as a
find that the patterns of propagation in rumors dif- corpus for sentiment analysis does present unusual
fer from news because rumors tend to be questioned challenges. Because posts are limited to 140 charac-
more than news by the Twitter community. ters, tweets often contain information in an unusu-
2.2 Sentiment Analysis ally compressed form and, as a result, grammar used
may be unconventional. Instances of sarcasm and
The automated detection of rumors is similar to tra- humor are also prevalent (Bifet and Frank, 2010).
ditional NLP sentiment analysis tasks. Previous The procedures we used for the collection and anal-
work has used machine learning techniques to iden- ysis of tweets are similar to those described in previ-
tify positive and negative movie reviews (Pang et ous work. However, our goal of developing compu-
al., 2002). Hassan et al. use a supervised Markov tational methods to identify rumors being transmit-
model, part of speech, and dependency patterns to ted through tweets differentiates our project.
identify attitudinal polarities in threads posted to
Usenet discussion posts (Hassan et al., 2010). Oth- 3 Problem Definition
ers have designated sentiment scores for news sto-
ries and blog posts based on algorithmically gener- Assume we have a set of tweets that are about the
ated lexicons of positive and negative words (God- same topic that has some controversial aspects. Our
bole et al., 2007). Pang and Lee provide a detailed objective in this work is two-fold: (1) Extract tweets
overview of current techniques and practices in sen- that are about the controversial aspects of the story
timent analysis and opinion mining (Pang and Lee, and spread misinformation (Rumor retrieval). (2)
2008; Pang and Lee, 2004). Identify users who believe that misinformation ver-
Though rumor classification is closely related to sus users who refute or question the rumor (Belief

1590
Name Rumor Regular Expression Query Status #tweets
obama Is Barack Obama muslim? Obama & (muslim|islam) false 4975
airfrance Air France mid-air crash photos? (air.france|air france) & (photo|pic|pix) false 505
cellphone Cell phone numbers going public? (cell|cellphone|cell phone) mostly false 215
michelle Michelle Obama hired too many staff? staff & (michelle obama|first lady|1st lady) partly true 299
palin Sarah Palin getting divorced? palin & divorce false 4423

Table 1: List of rumor examples and their corresponding queries used to collect data from Twitter

classification). regexp (extracted from about.com) to Twitter and re-


The following two tweets are two instances of the trieve a large primitive set of tweets that is supposed
tweets written about president Obama and the Mus- to have a high recall. This set however, contains a lot
lim world. The first tweet below is about president of false positives, tweets that match the regexp but
Obama and Muslim world, where the second tweet are not about the rumor (e.g., “Obama meets muslim
spread misinformation that president Obama is Mus- leaders”). Moreover, a rumor is usually stated using
lim. various instances (e.g., “Barack HUSSEIN Obama”
versus “Obama is muslim”). Our goal is then to de-
(non-rumor) “As Obama bows to Muslim leaders sign a learning framework that filters all such false
Americans are less safe not only at home but also positives and retrieves various instances of the same
overseas. Note: The terror alert in Europe... ”
rumor
(rumor) “RT @johnnyA99 Ann Coulter Tells Larry Although our second task, belief classification,
King Why People Think Obama Is A Muslim can be viewed as an opinion mining task, it is sub-
http://bit.ly/9rs6pa #Hussein via @NewsBusters stantially different from opinion mining in nature.
#tcot ..” The difference from a standard opinion mining task
is that here we are looking for attitudes about a sub-
The goal of the retrieval task is to discriminate tle statement (e.g., “Palin is getting divorce”) instead
between such tweets. In the second task, we use of the overall sentiment of the text or the opinion
the tweets that are flagged as rumorous, and identify towards an explicit object or person (e.g., “Sarah
users that endorse (believe) the rumor versus users Palin”).
who deny or question it. The following three tweets
are about the same story. The first user is a believer
4 Data
and the second and third are not. As September 2010, Twitter reports that its users
publish nearly 95 million tweets per day1 . This
(confirm) “RT @moronwatch: Obama’s a Muslim. Or
makes Twitter an excellent case to analyze misin-
if he’s not, he sure looks like one #whyimvotingre-
publican.”
formation in social media.
Our goal in this work was to collect and annotate
(deny) “Barack Obama is a Christian man who had a large dataset that includes all the tweets that are
a Christian wedding with 2 kids baptised in Jesus written about a rumor in a certain period of time. To
name. Tea Party clowns call that muslim #p2 #gop” collect such a complete and self-contained dataset
about a rumor, we used the Twitter search API, and
(doubtful) “President Barack Obama’s Religion:
retrieved all the tweets that matched a given regular
Christian, Muslim, or Agnostic? - The News
expression. This API is the only API that returns re-
of Today (Google): Share With Friend...
sults from the entire public Twitter stream and not
http://bit.ly/bk42ZQ”
a small randomly selected sample. To overcome the
rate limit enforced by Twitter, we collected match-
The first task is substantially more challenging
ing tweets once per hour, and remove any duplicates.
than a standard IR task because of the requirement of
To use the search API, we carefully designed reg-
both high precision (every result should be actually
ular expression queries to be broad enough to match
discussing the rumor) and high recall (the set should
be complete). To do this, we submit a handcrafted 1
http://twitter.com/about

1591
all the tweets that are about a rumor. Each query Rumor non-rumor (0) believe (11) deny/ (12) total
doubtful/neutral
represents a popular rumor that is listed as “false” obama 3,036 926 1,013 4975
or only “partly true” on About.com’s Urban Leg- airfrance 306 71 128 505
cellphone 132 74 9 215
ends reference site2 between 2009 and 2010. Table 1 michelle 83 191 25 299
lists the rumor examples that we used to collect our palin 86 1,709 2,628 4,423
total 3,643 2,971 3,803 10,417
dataset along with their corresponding regular ex-
pression queries and the number of tweets collected. Table 2: Number of instances in each class from the an-
notated data
4.1 Annotation
We asked two annotators to go over all the tweets task κ
rumor retrieval 0.954
in the dataset and mark each tweet with a “1” if it
belief classification 0.853
is about any of the rumors from Table 1, and with
a “0” otherwise. This annotation scheme will be Table 3: Inter-judge agreement in two annotation tasks in
used in our first task to detect false positives, tweets terms of κ-statistic
that match the broad regular expressions and are re-
trieved, but are not about the rumor. For instance, 4.2 Inter-Judge Agreement
both of the following tweets match the regular ex-
To calculate the annotation accuracy, we annotated
pression for the palin example, but only the sec-
500 instances twice. These annotations were com-
ond one is rumorous.
pared with each other, and the Kappa coefficient (κ)
(0) “McCain Divorces Palin over her ‘untruths and out was calculated. The κ statistic is formulated as
right lies’ in the book written for her. McCain’s
team says Palin is a petty liar and phony” Pr(a) − Pr(e)
κ=
1 − Pr(e)
(1) “Sarah and Todd Palin to divorce, according to local
Alaska paper. http://ow.ly/iNxF” where P r(a) is the relative observed agreement
among raters, and P r(e) is the probability that anno-
We also asked the annotators to mark each pre- tators agree by chance if each annotator is randomly
viously annotated rumorous tweet with “11” if the assigning categories (Krippendorff, 1980; Carletta,
tweet poster endorses the rumor and with “12” if the 1996). Table 3 shows that annotators can reach
user refutes the rumor, questions its credibility, or is a high agreement in both extracting rumors (κ =
neutral. 0.95) and identifying believers (κ = 0.85).
(12) “Sarah Palin Divorce Rumor Debunked on Face- 5 Approach
book http://ff.im/62Evd”
In this section, we describe a general framework,
(11) “Todd and Sarah Palin to divorce which given a tweet, predicts (1) whether it is a
http://bit.ly/15StNc” rumor-related statement, and if so (2) whether the
user believes the rumor or not. We describe 3 sets of
Our annotation of more than 10,400 tweets shows
features, and explain why these are intuitive to use
that %35 of all the instances that matched the regu-
for identification of rumors.
lar expressions are false positives, tweets that are not
We process the tweets as they appear in the user
rumor-related but match the initial queries. More-
timeline, and do not perform any pre-processing.
over, among tweets that are about particular ru-
Specially, we think that capitalization might be an
mors, nearly %43 show the poster believe the rumor,
important property. So, we do not lower-case the
demonstrating the importance of identifying misin-
tweet texts either.
formation and those who are misinformed. Table 2
Our approach is based on building different Bayes
shows the basic statistics extracted from the annota-
classifiers as high level features and then learning
tions for each story.
a linear function of these classifiers for retrieval in
2
http://urbanlegends.about.com the first task and classification in the second. Each

1592
Bayes classifier, which corresponds to a feature fi , From each tweet we extract 4 (2 × 2) features,
calculates the likelihood ratio for a given tweet t, as corresponding to unigrams and bigrams of each rep-
shown in Equation 1. resentation. Each feature is the log-likelihood ra-
tio calculated using Equation 2. More formally,
we represent each tweet t, of length n, lexically
P (θi+ |t) P (θi+ ) P (t|θi+ ) as (w1 w2 · · · wn ) and with part-of-speech tags as
= (1)
P (θi− |t) P (θi− ) P (t|θi− ) (p1 p2 · · · pn ). After building the positive and nega-
tive models (θ+ , θ− ) for each feature using the train-
Here θi+ and θi− are two probabilistic models built ing data, we calculate the likelihood ratio as defined
based on feature fi using a set of positive (+) and in Equation 2 where
negative (−) training data. The likelihood ratio ex-
presses how many times more likely the tweet t is P (t|θ+ ) X
n
P (wj |θ+ )
under the positive model than the negative model = log (3)
P (t|θ− ) P (wj |θ− )
with respect to fi . j=1
For computational reasons and to avoid dealing for unigram-lexical features (TXT1) and
with very small numbers we use the log of the like-
lihood ratio to build each classifier.
n−1
P (t|θ+ ) X P (wj wj+1 |θ+ )
= log (4)
P (t|θ− ) P (wj wj+1 |θ− )
j=1
P (θi+ |t) P (θi+ ) P (t|θi+ )
LLi = log = log + log (2)
P (θi− |t) P (θi− ) P (t|θi− ) for bigram-based lexical features (TXT2). Simi-
larly, we define the unigram and bigram-based part-
P (θi+ ) of-speech features (POS1 and POS2) as the log-
The first term can be easily calculated us-
P (θi− ) likelihood ratio with respect to the positive and neg-
ing the maximum likelihood estimates of the prob- ative part-of-speech models.
abilities (i.e., the estimate of each probability is the
corresponding relative frequency). The second term 5.2 Network-based Features
is calculated using various features that we explain The features that we have proposed so far are all
below. based on the content of individual tweets. In the
second set of features we focus on user behavior on
5.1 Content-based Features Twitter. We observe 4 types of network-based prop-
The first set of features are extracted from the text of erties, and build 2 features that capture them.
the tweets. We propose 4 content based features. We Twitter enables users to re-tweet messages from
follow (Hassan et al., 2010) and present the tweet other people. This interaction is usually easy to de-
with 2 different patterns: tect because the re-tweeted messages generally start
with the specific pattern: ‘RT @user’. We use this
• Lexical patterns: All the words and segments property to infer about the re-tweeted message.
in the tweet are represented as they appear and Let’s suppose a user ui re-tweets a message t from
are tokenized using the space character. the user uj (ui : “RT @uj t”). Intuitively, t is more
likely to be a rumor if (1) uj has a history of posting
• Part-of-speech patterns: All words are replaced or re-tweeting rumors, or (2) ui has posted or re-
with their part-of-speech tags. To find the part- tweeted rumors in the past.
of-speech of a hashtag we treat it as a word Given a set of training instances, we build a pos-
(since they could have semantic roles in the itive (θ+ ) and a negative (θ− ) user models. The
sentence), by omitting the tag sign, and then first model is a probability distribution over all users
precede the tag with the label TAG/. We also that have posted a positive instance or have been re-
introduce a new tag, URL, for URLs that appear tweeted in a positive instance. Similarly, the sec-
in a tweet. ond model is a probability distribution over users

1593
that have posted (or been re-tweeted in) a negative Feature LL-ratio model
TXT1 content unigram content unigram
instance. After building the models, for a given TXT2 content bigram content unigram
Content
tweet we calculate two log-likelihood ratios as two POS1 content pos content pos unigram
POS2 content pos content pos bigram
network-based features. URL1 content unigram target URL unigram
The first feature is the log-likelihood ratio that ui Twitter URL2 content bigram target URL bigram
TAG hashtag hashtag
is under a positive user model (USR1) and the sec- USR1 tweeting user all users in the data
ond feature is the log-likelihood ratio that the tweet Network
USR2 re-tweeted user all users in the data
is re-tweeted from a user (uj ) who is under a positive
Table 4: List of features used in our optimization frame-
user model than a negative user model (USR2).
work. Each feature is a log-likelihood ratio calculated
The distinction between the posting user and the against a a positive (+) and negative (−) training models.
re-tweeted user is important, since some times the
users modify the re-tweeted message in a way that
changes its meaning and intent. In the following ex-
ample, the original user is quoting president Obama. m
P (t|θ+ ) X P (#hj |θ+ )
The second user is re-tweeting the first user, but has = log (5)
added more content to the tweet and made it sound P (t|θ− ) P (#hj |θ− )
j=1
rumorous.
5.3.2 URLs
original message (non-rumor) “Obama says he’s do- Previous work has discussed the role of URLs
ing ‘Christ’s work’.”
in information diffusion on Twitter (Honeycutt and
re-tweeted (rumor) “Obama says he’s doing ‘Christ’s Herring, 2009). Twitter users share URLs in their
work.’ Oh my God, CHRIST IS A MUSLIM.” tweets to refer to external sources or overcome the
length limit forced by Twitter. Intuitively, if a tweet
5.3 Twitter Specific Memes is a positive instance, then it is likely to be similar to
Our final set of features are extracted from memes the content of URLs shared by other positive tweets.
that are specific to Twitter: hashtags and URLs. Using the same reasoning, if a tweet is a negative
Previous work has shown the usefulness of these instance, then it should be more similar to the web
memes (Ratkiewicz et al., 2010). pages shared by other negative instances.
Given a set of training tweets, we fetch all the
5.3.1 Hashtags URLs in these tweets and build θ+ and θ− once for
One emergent phenomenon in the Twitter ecosys- unigrams and once for bigrams. These models are
tem is the use of hashtags: words or phrases prefixed merely built on the content of the URLs and ignore
with a hash symbol (#). These hashtags are created the tweet content. Similar to previous features, we
by users, and are widely used for a few days, then calculate the log-likelihood ratio of the content of
disappear when the topic is outdated (Huang et al., each tweet with respect to θ+ and θ− for unigrams
2010). (URL1) and bigrams URL2).
In our approach, we investigate whether hashtags Table 4 summarizes the set of features used in our
used in rumor-related tweets are different from other proposed framework, where each feature is a log-
tweets. Moreover, we examine whether people who likelihood ratio calculated against a positive (+) and
believe and spread rumors use hashtags that are dif- negative (−) training models. To build these lan-
ferent from those seen in tweets that deny or ques- guage models, we use the CMU Language Modeling
tion a rumor. toolkit (Clarkson and Rosenfeld, 1997).
Given a set of training tweets of positive and neg-
ative examples, we build two statistical models (θ+ , 5.4 Optimization
θ− ), each showing the usage probability distribution We build an L1 -regularized log-linear model (An-
of various hashtags. For a given tweet, t, with a set drew and Gao, 2007) on various features discussed
of m hashtags (#h1 · · · #hm ), we calculate the log- before to predict each tweet. Suppose, a procedure
likelihood ratio using Equation 2 where generates a set of candidates for an input x. Also,

1594
let’s suppose Φ : X × Y → RD is a function that 6.1 Rumor Retrieval
maps each (x, y) to a vector of feature values. Here, In this experiment, we view different stories as
the feature vector is the vector of coefficients corre- queries, and build a relevance set for each query.
sponding to different network, content, and twitter- Each relevance set is an annotation of the entire
based properties, and the parameter vector θ ∈ RD 10,417 tweets, where each tweet is marked as rel-
(D ≤ 9 in our experiments) assigns a real-valued evant if it matches the regular expression query and
weight to each feature. This estimator chooses θ to is marked as a rumor-related tweet by the annotators.
minimize the sum of least squares and a regulariza- For instance, according to Table 2 the cellphone
tion term R. dataset has only 83 relevant documents out of the
1X entire 10,417 documents.
θ̂ = arg min{ ||hθ, xi i − yi ||22 + R(θ)} (6)
θ 2
i
For each query we use 5-fold cross-validation,
and predict the relevance of tweets as a function of
where the regularizer term R(θ) is the weighted L1 their features. We use these predictions and rank
norm of the parameters. all the tweets with respect to the query. To evalu-
X
R(θ) = α |θj | (7) ate the performance of our ranking model for a sin-
j gle query (Q) with the set of relevant documents
{d1 , · · · , dm }, we calculate Average Precision as
Here, α is a parameter that controls the amount of
regularization (set to 0.1 in our experiments). 1 X
m

Gao et. al (Gao et al., 2007) argue that op- AP (Q) = Precision(Rk ) (8)
m
timizing L1 -regularized objective function is chal- k=1

lenging since its gradient is discontinuous whenever where Rk is the set of ranked retrieval results from
some parameters equal zero. In this work, we use the top result to the k th relevant document, dk (Man-
the orthant-wise limited-memory quasi-Newton al- ning et al., 2008).
gorithm (OWL-QN), which is a modification of L-
BFGS that allows it to effectively handle the dis- 6.1.1 Baselines
continuity of the gradient (Andrew and Gao, 2007). We compare our proposed ranking model with a
OWL-QN is based on the fact that when restricted number of other retrieval models. The first two sim-
to a single orthant, the L1 regularizer is differen- ple baselines that indicate a difficulty lower-bound
tiable, and is in fact a linear function of θ. Thus, for the problem are Random and Uniform meth-
as long as each coordinate of any two consecutive ods. In the Random baseline, documents are ranked
search points does not pass through zero R(θ) does based on a random number assignment to them. In
not contribute at all to the curvature of the function the Uniform model, we use a 5-fold cross validation,
on the segment joining them. Therefore, we can use and in each fold the label of the test documents is de-
L-BFGS to approximate the Hessian of L(θ) alone termined by the majority vote from the training set.
and use it to build an approximation to the full reg- The main baseline that we use in this work, is the
ularized objective that is valid on a given orthant. regular expression that was submitted to Twitter to
This algorithm works quite well in practice, and typ- collect data (regexp). Using the same regular ex-
ically reaches convergence in even fewer iterations pression to mark the relevance of the documents will
than standard L-BFGS (Gao et al., 2007). cause a recall value of 1.00 (since it will retrieve all
the relevant documents), but will also retrieve false
6 Experiments positives, tweets that match the regular expression
We design 2 sets of experiments to evaluate our ap- but are not rumor-related. We would like to inves-
proach. In the first experiment we assess the effec- tigate whether using training data will help us de-
tiveness of the proposed method when employed in crease the rate of false positives in retrieval.
an Information Retrieval (IR) framework for rumor Finally, using the Lemur Toolkit software3 , we
retrieval and in the second experiment we employ employ a KL divergence retrieval model with
various features to detect users’ beliefs in rumors. 3
http://www.lemurproject.org/

1595
Dirichlet smoothing (KL). In this model, documents
are ranked according to the negation of the diver-
gence of query and document language models.
More formally, given the query language model θQ ,
and the document language model θD , the docu-
ments are ranked by −D(θQ ||θD ), where D is the
KL-divergence between the two models.
X p(w|θQ )
D(θQ ||θD ) = p(w|θQ ) log (9)
w
p(w|θD )
Figure 1: Average precision and recall of the proposed
To estimate p(w|θD ), we use Bayesian smoothing
method employing each set of features: content-based,
with Dirichlet priors (Berger, 1985). network-based, and twitter specific.
C(w, D) + µ.p(w|θS )
ps (w|θD ) = P (10)
µ + w C(w, D) tweets do not share hashtags or are not written based
on the contents of external URLs.
where, µ is a parameter, C is the count function, and
Finally, we find that user history can be a good
thetaS is the collection language model. Higher val-
indicator of rumors. However, we believe that this
ues of µ put more emphasis on the collection model.
feature could be more helpful with a complete user
Here, we try two variants of the model, one using
set and a more comprehensive history of their activ-
the default parameter value in Lemur (µ = 2000),
ities.
and one in which µ is tuned based on the the data
(µ = 10). Using the test data to tune the parameter 6.1.3 Domain Training Data
value, µ, will help us find an upper-bound estimate
As our last experiment with rumor retrieval we in-
of the effectiveness of this method.
vestigate how much new labeled data from an emer-
Table 5 shows the Mean Average Precision
gent rumor is required to effectively retrieve in-
(MAP) and Fβ=1 for each method in the rumor re-
stances of that particular rumor. This experiment
trieval task. This table shows that a method that
helps us understand how our proposed framework
employs training data to re-rank documents with
could be generalized to other stories.
respect to rumors makes significant improvements
To do this experiment, we use the obama story,
over the baselines and outperforms other strong re-
which is a large dataset with a significant number of
trieval systems.
false positive instances. We extract 400 randomly
6.1.2 Feature Analysis selected tweets from this dataset and keep them for
To investigate the effectiveness of using indi- testing. We also build an initial training dataset of
vidual features in retrieving rumors, we perform the other 4 rumors, and label them as not relevant.
5-fold cross validations for each query, using We assess the performance of the retrieval model as
different feature sets each time. Figure 1 shows we gradually add the rest of the obama tweets. Fig-
the average precision and recall for our pro- ure 2 shows both Average Precision and labeling ac-
posed optimization system when content-based curacy versus the size of the labeled data used from
(TXT1+TXT2+POS1+POS2), network-based the obama dataset. This plot shows that both mea-
(USR1+USR2), and twitter specific memes sures exhibit a fast growth and reach 80% when the
(TAG+URL1+URL2) are employed individually. number of labeled data reaches 2000.
Figure 1 shows that features that are calculated us-
ing the content language models are very effective in 6.2 Belief Classification
achieving high precision and recall. Twitter specific In previous experiments we showed that maximiz-
features, especially hashtags, can result in high pre- ing a linear function of log-likelihood ratios is an
cisions but lead to a low recall value because many effective method in retrieving rumors. Here, we in-

1596
Method MAP 95% C.I. Fβ=1 95% C.I.
Random 0.129 [-0.065, 0.323] 0.164 [-0.051, 0.379]
Uniform 0.129 [-0.066, 0.324] 0.198 [-0.080, 0.476]
regexp 0.587 [0.305, 0.869] 0.702 [0.479, 0.925]
KL (µ = 2000) 0.678 [0.458, 0.898] 0.538 [0.248, 0.828]
KL (µ = 10) 0.803 [0.641, 0.965] 0.681 [0.614, 0.748]
LL (all 9 features) 0.965 [0.936, 0.994] 0.897 [0.828, 0.966]

Table 5: Mean Average Precision (MAP) and Fβ=1 of each method in the rumor retrieval task. (C.I.: Confidence
Interval)

Method Accuracy Precision Recall Fβ=1 Win/Loss Ratio


random 0.501 0.441 0.513 0.474 1.004
uniform 0.439 0.439 1.000 0.610 0.781
TXT 0.934 0.925 0.924 0.924 14.087
POS 0.742 0.706 0.706 0.706 2.873
content (TXT+POS) 0.941 0.934 0.930 0.932 15.892
network (USR) 0.848 0.873 0.765 0.815 5.583
TAG 0.589 0.734 0.099 0.175 1.434
URL 0.664 0.630 0.570 0.598 1.978
twitter (TAG+URL) 0.683 0.658 0.579 0.616 2.155
all 0.935 0.944 0.906 0.925 14.395

Table 6: Accuracy, precision, recall, Fβ=1 , and win/loss ratio of belief classification using different features.

vestigate whether this method, and in particular, the


proposed features are useful in detecting users’ be-
liefs in a rumor that they post about. Unlike re-
trieval, detecting whether a user endorses a rumor or
refutes it may be possible using similar methods re-
gardless of the rumor. Intuitively, linguistic features
such as negation (e.g., “obama is not a muslim”), or
capitalization (e.g., “barack HUSSEIN obama ...”),
user history (e.g., liberal tweeter vs. conservative
tweeter), hashtags (e.g., #tcot vs. #tdot), and URLs
(e.g., links to fake airfrance crash photos) should
help to identify endorsements.
We perform this experiment by making a pool
of all the tweets that are marked as “rumorous” in
the annotation task. Table 2 shows that there are
6,774 such tweets, from which 2,971 show belief
and 3,803 tweets show that the user is doubtful, de-
Figure 2: Average Precision and Accuracy learning curve nies, or questions it.
for the proposed method employing all 9 features. Using various feature settings, we perform 5-fold
cross-validation on these 6,774 rumorous tweets.
Table 6 shows the results of this experiment in terms
of F-score, classification accuracy, and win/loss ra-
tio, the ratio of correct classification to an incorrect

1597
classification. believe them. Journal of Abnormal and Social Psy-
chology, 40(1):3 – 36.
7 Conclusion Gordon Allport and Leo Postman. 1947. The psychology
of rumor. Holt, Rinehart, and Winston, New York.
In this paper we tackle the fairly unaddressed prob- Galen Andrew and Jianfeng Gao. 2007. Scalable train-
lem of identifying misinformation and disinform- ing of l1-regularized log-linear models. In ICML ’07,
ers in Microblogs. Our contributions in this pa- pages 33–40.
per are two-fold: (1) We propose a general frame- James Berger. 1985. Statistical decision theory and
work that employs statistical models and maximizes Bayesian Analysis (2nd ed.). New York: Springer-
a linear function of log-likelihood ratios to retrieve Verlag.
rumorous tweets that match a more general query. Albert Bifet and Eibe Frank. 2010. Sentiment knowl-
edge discovery in twitter streaming data. In Bernhard
(2) We show the effectiveness of the proposed fea-
Pfahringer, Geoff Holmes, and Achim Hoffmann, edi-
ture in capturing tweets that show user endorsement. tors, Discovery Science, volume 6332 of Lecture Notes
This will help us identify disinformers or users that in Computer Science, pages 1–15. Springer Berlin /
spread false information in online social media. Heidelberg.
Our work has resulted in a manually annotated Jean Carletta. 1996. Assessing agreement on classifi-
dataset of 10,000 tweets from 5 different controver- cation tasks: the kappa statistic. Comput. Linguist.,
sial topics. To the knowledge of authors this is the 22(2):249–254.
first large-scale publicly available rumor dataset, and Philip Clarkson and Roni Rosenfeld. 1997. Statistical
language modeling using the cmu-cambridge toolkit.
can open many new dimensions in studying the ef-
Proceedings ESCA Eurospeech, 47:45–148.
fects of misinformation or other aspects of informa-
Nicholas DiFonzo and Prashant Bordia. 2007. Rumor,
tion diffusion in online social media. gossip, and urban legend. Diogenes, 54:19–35, Febru-
In this paper we effectively retrieve instances of ary.
rumors that are already identified and evaluated by Nicholas DiFonzo, P. Prashant Bordia, and Ralph L. Ros-
an external source such as About.com’s Urban Leg- now. 1994. Reining in rumors. Organizational Dy-
ends reference. Identifying new emergent rumors namics, 23(1):47–62.
directly from the Twitter data is a more challenging Rob Ennals, Dan Byler, John Mark Agosta, and Barbara
task. As our future work, we aim to build a sys- Rosario. 2010. What is disputed on the web? In Pro-
tem that employs our findings in this paper and the ceedings of the 4th workshop on Information Credibil-
ity, WICOW ’10, pages 67–74.
emergent patterns in the re-tweet network topology
Jianfeng Gao, Galen Andrew, Mark Johnson, and
to identify whether a new trending topic is a rumor Kristina Toutanova. 2007. A comparative study of pa-
or not. rameter estimation methods for statistical natural lan-
guage processing. In ACL ’07.
8 Acknowledgments Namrata Godbole, Manjunath Srinivasaiah, and Steven
Skiena. 2007. Large-scale sentiment analysis for
The authors would like to thank Paul Resnick,
news and blogs. In Proceedings of the International
Rahul Sami, and Brendan Nyhan for helpful discus- Conference on Weblogs and Social Media (ICWSM),
sions. This work is supported by the National Sci- Boulder, CO, USA.
ence Foundation grant “SoCS: Assessing Informa- Ahmed Hassan, Vahed Qazvinian, and Dragomir Radev.
tion Credibility Without Authoritative Sources” as 2010. What’s with the attitude? identifying sentences
IIS-0968489. Any opinions, findings, and conclu- with attitude in online discussions. In Proceedings of
sions or recommendations expressed in this paper the 2010 Conference on Empirical Methods in Natural
are those of the authors and do not necessarily re- Language Processing, pages 1245–1255, Cambridge,
MA, October. Association for Computational Linguis-
flect the views of the supporters.
tics.
Edward S Herman and Noam Chomsky. 2002. Manu-
facturing Consent: The Political Economy of the Mass
References
Media. Pantheon.
Floyd H. Allport and Milton Lepkin. 1945. Wartime ru- Courtenay Honeycutt and Susan C. Herring. 2009. Be-
mors of waste and special privilege: why some people yond microblogging: Conversation and collaboration

1598
via twitter. Hawaii International Conference on Sys-
tem Sciences, 0:1–10.
Jeff Huang, Katherine M. Thornton, and Efthimis N.
Efthimiadis. 2010. Conversational tagging in twitter.
In Proceedings of the 21st ACM conference on Hyper-
text and hypermedia, HT ’10, pages 173–178.
Klaus Krippendorff. 1980. Content Analysis: An Intro-
duction to its Methodology. Beverly Hills: Sage Pub-
lications.
Jure Leskovec, Lars Backstrom, and Jon Kleinberg.
2009. Meme-tracking and the dynamics of the news
cycle. In KDD ’09: Proceedings of the 15th ACM
SIGKDD international conference on Knowledge dis-
covery and data mining, pages 497–506.
Christopher D. Manning, Prabhakar Raghavan, and Hin-
rich Schütze. 2008. Introduction to Information Re-
trieval. Cambridge University Press.
Marcelo Mendoza, Barbara Poblete, and Carlos Castillo.
Twitter under crisis: Can we trust what we rt?
Alexander Pak and Patrick Paroubek. 2010. Twit-
ter as a corpus for sentiment analysis and opinion
mining. In Nicoletta Calzolari (Conference Chair),
Khalid Choukri, Bente Maegaard, Joseph Mariani,
Jan Odijk, Stelios Piperidis, Mike Rosner, and Daniel
Tapias, editors, Proceedings of the Seventh conference
on International Language Resources and Evaluation
(LREC’10), Valletta, Malta, may. European Language
Resources Association (ELRA).
Bo Pang and Lillian Lee. 2004. A sentimental educa-
tion: sentiment analysis using subjectivity summariza-
tion based on minimum cuts. In ACL’04, Morristown,
NJ, USA.
Bo Pang and Lillian Lee. 2008. Opinion mining and
sentiment analysis. Foundations and Trends in Infor-
mation Retrieval, 2:1–135.
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.
2002. Thumbs up?: sentiment classification using ma-
chine learning techniques. In Proceedings of confer-
ence on Empirical methods in natural language pro-
cessing, EMNLP’02, pages 79–86.
Jacob Ratkiewicz, Michael Conover, Mark Meiss, Bruno
Gonçalves, Snehal Patil, Alessandro Flammini, and
Filippo Menczer. 2010. Detecting and tracking
the spread of astroturf memes in microblog streams.
CoRR, abs/1011.3768.

1599

You might also like