0% found this document useful (0 votes)
13 views8 pages

Interpretable Rumor Detection in Microblogs by Attending To User Interactions

Uploaded by

FD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views8 pages

Interpretable Rumor Detection in Microblogs by Attending To User Interactions

Uploaded by

FD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Interpretable Rumor Detection in Microblogs by Attending to User Interactions

Ling Min Serena Khoo, Hai Leong Chieu Zhong Qian, Jing Jiang
DSO National Laboratories Singapore Management University
12 Science Park Drive 80 Stamford Road
Singapore 118225 Singapore 178902
{klingmin,chaileon}@dso.org.sg qianzhongqz@163.com,jingjiang@smu.edu.sg
arXiv:2001.10667v1 [cs.CL] 29 Jan 2020

Abstract Source Claim:


“walmart donates $10,000 to support darren wilson and the on going
racist police murders #ferguson #boycottwalmart URL”
We address rumor detection by learning to differentiate be-
tween the community’s response to real and fake claims in
microblogs. Existing state-of-the-art models are based on tree R_1 R_2 R_3 R_4

models that model conversational trees. However, in social “@70torinoman


@TefPoe Though it
“@70torinoman
@TefPoe Walmart
“@70torinoman
#thanksWalmart donates
“@70torinoman
...Well I will be
media, a user posting a reply might be replying to the entire wouldn't surprise donates $10k to Darren $10,000 to defend #Wilson paying off that credit
me, is there proof Wilson?! No way this is after beaten in struggle card and never
thread rather than to a specific user. We propose a post-level of this? Just true, right? Source? for gun w/250LB man returning!! I will not
googled & Link? #Ferguson #Ferguson shop somewhere that
attention model (PLAN) to model long distance interactions couldn't find.” #sayitaintso” http://t.co/g3Jku9YuGt” condones murder!!”
between tweets with the multi-head attention mechanism in a
R_1_1 R_2_1 R_3_1
transformer network. We investigated variants of this model: “@lyburtus1
“@davidgrelle
(1) a structure aware self-attention model (StA-PLAN) that @70torinoman
@70torinoman
“@SAGandAFTRA
@TefPoe yeah I @70torinoman That pic is
@TefPoe not true WM
incorporates tree structure information in the transformer net- googled it to I'm the
can't support this due
not of #MikeBrown. Lie
type to jump all over debunked first week after
work, and (2) a hierarchical token and post-level attention what's on social
to laws regulAting
shooting.”
public cos”
model (StA-HiTPLAN) that learns a sentence representation media”

with token-level self-attention. To the best of our knowledge,


we are the first to evaluate our models on two rumor detection Figure 1: Sample of a thread from Twitter15 resulting from
data sets: the PHEME data set as well as the Twitter15 and a fake claim.
Twitter16 data sets. We show that our best models outperform
current state-of-the-art models for both data sets. Moreover,
the attention mechanism allows us to explain rumor detection
predictions at both token-level and post-level.
Hence, the interaction between users as well as the con-
tent shared could be captured for fake news detection. We
discuss briefly two state-of-the-art models (Ma, Gao, and
1 Introduction Wong 2018b; Kumar and Carley 2019).
The spread of fake news can have far reaching and devas- Ma et al. (2018b) organized the source claim and its re-
tating effects. $130 billion in stock value was wiped out in sponding tweets in a tree structure as shown in Figure 1.
minutes after a false Associated Press’s tweet claimed that Each tweet is represented as a node; the top node would
Barack Obama was injured following an explosion in 2013 be the source claim and the children of a node are tweets
(Rapoza 2017). Some researchers even believe that fake that have responded to it directly. They modeled the spread
news has affected the outcome of the 2016 United States of information in a propagation tree using recursive neural
presidential election (Gunther, Beck, and Nisbet 2018). The networks. Signals from different nodes are aggregated re-
severity of the impact of fake news warrants the need for an cursively in either a bottom-up or a top-down manner. In-
effective and automated means of detecting fake news and formation is propagated from the child node to the parent
has hence spurred much research in this area. node in the bottom-up model and vice versa in a top-down
model. Similarly, Kumar et al. (2019) organized conversa-
Our work focuses on the detection of fake claims using
tion threads with a tree structure and explored several vari-
community response to such claims. This area of research
ants of branch and tree LSTMs for rumour detection.
exploits the collective wisdom of the community by ap-
plying natural language processing to comments directed Both papers used tree models with the intention of mod-
towards a claim (see Figure 1). The key principle behind elling structural information present in the conversation
such works is that users on social media would share opin- thread. Information is propagated either from the parent to
ions, conjectures and evidences for inaccurate information. the child or vice versa in tree models. However, the thread
structure in social media conversations might be unclear.
Copyright c 2020, Association for the Advancement of Artificial Each user is often able to observe all the replies in different
Intelligence (www.aaai.org). All rights reserved. branches of the conversation. A user debunking a fake news
may not be directed solely at the person he is replying to - Content Information: Early work on deceptive content
the content created could also be applicable to other tweets detection studied the use of linguistic cues such as per-
in the thread. Tree models do not model interactions between centage of pronouns, word length, verb quantity, and word
nodes from other branches explicitly and is a key limitation classes (Fuller, Biros, and Wilson 2009; Mihalcea and Strap-
when modelling social media conversations. For example, in parava 2009; Ott et al. 2011; Rubin et al. 2016) on fake re-
Figure 1, tweet R 1 and its replying tweet, R 1 1, have ex- views, witness accounts, and satire. The detection of fake
pressed doubt regarding the factuality of the source claim. news using linguistic features have also been studied in
Tweets R 2 1 and R 3 1 have provided conclusive evidence (O’Brien et al. 2018; Wang 2017). Such analysis of de-
that debunk the source claims as fake. Though R 2 1 and ceptive content relies on linguistic features which might be
R 3 1 are child nodes of R 2 and R 3 respectively, they could unique to domains or topics.
provide important information to all the other nodes along Source and Social Network: Another group of work
the tree, such as R 1 1 and R 1. Therefore, we should con- studied the source of fake news, and its social network.
sider interactions between all tweets, not just those between Wang et al. (2017) found that adding source information to
parent and their child, for better aggregation of information the content improves fake news classification accuracy. Hu
from the replying tweets. et al. (2013) found that accounts created to spread fake news
In this paper, we propose to overcome some limitations tend to have different social network characteristics.
of tree models in modelling social media conversations as Fact Checking: Fact checking websites such as politifact.
identified from our analysis. More specifically, we flattened com and snopes.com rely on manual verification to debunk
the tree structure and arranged all tweets in a chronological fake news, but are unable to match up the rate at which
order. We propose a post-level attention model (PLAN) that fake news are being generated (Funke 2019). Automated fact
allows all possible pairwise interaction between tweets with checking aims to check claims against trustworthy sources
the self-attention mechanism. To combine the strengths such as Wikipedia (Ciampaglia et al. 2015). More recently,
of both self-attention mechanism and tree models, we Thorne et al. (2018) proposed the FEVER shared task to ver-
incorporated structural information and performed structure ify an input claim against a database of 5 million Wikipedia
aware self-attention at a post-level (StA-PLAN). Lastly, documents, and classify each claim into one of three classes:
we designed a structure aware hierarchical token and and Supports, Refute or Not Enough Info. Fact checking is a
post-level attention network (StA-HiTPLAN) to learn more more principled approach to fake news detection. However,
complex sentence representations for each tweet. it also requires an established corpus of verified facts and
may not work for new claims with little evidence.
The contributions of this paper are the following: Community Response: Detecting fake news from com-
• We utilize the attention weights from our model to pro- munity response is most closely related to our paper. Re-
vide both token-level and post-level explanations behind searchers have worked on automatically predicting the ve-
the model’s prediction. To the best of our knowledge, we racity of a claim by building classifiers that leverages on the
are the first paper that has done this. comments and replies to social media posts, as well as the
• We compare against previous works on two data sets - propagation pattern (Enayet and El-Beltagy 2017; Castillo,
PHEME 5 events (Kochkina, Liakata, and Zubiaga 2018) Mendoza, and Poblete 2011; Ma, Gao, and Wong 2017;
and Twitter15 and Twitter16 (Ma, Gao, and Wong 2017). 2018b). Ma et al. (2018a) adopted a multi-task learning ap-
Previous works only evaluated on one of the two data sets. proach to build a classifier that learns stance aware features
for rumour detection. Similarly, Li et al. (2019) adopted a
• Our proposed models could outperform current state-of- multi-task learning approach for his models and included
the-art models for both data sets. user information in his models. Chen et al. (2017) proposed
The rest of the paper is organized as follows. In Section to pool out distinctive features that captures contextual varia-
2, we examine some related work. We define our problem tions of posts over time. Beyond linguistic features, other re-
statement in Section 3. We present our models in Section 4 searchers have also looked at the demographics of the users
and results in Section 5. We then conclude with future work (Yang et al. 2012; Li, Zhang, and Si 2019) or interaction pe-
in Section 6. riodicity (Kwon et al. 2013) to determine the credibility of
users. Yang et al. (2012) collected the user characteristics
2 Related Work who were engaged in spreading fake news and used only
Existing approaches on automatically differentiating fake user characteristics to build a fake news detector by clas-
from real claims leverage on a variety of features: (i) the sifying the propagation path. Li et al. (2019) used a com-
content of the claim, (ii) the bias and social network of bination of user information and content features to train a
the source of the claim, (iii) fact checking with trustwor- LSTM with multi-task learning objective.
thy sources (e.g., Wikipedia), and (iv) community response In this paper, we work on detecting rumor and fake news
to the claims. Our work falls into the last class of detecting solely from the post and comments. Unlike Li, Zhang, and
rumors or fake claims from community response. In this sec- Si; Yang et al. (2019; 2012), we do not use the identities
tion, we give a brief review of each class of work, focusing of user accounts. Most closely related to our work is Ma
on works that detect fake claims using community response. et al. (2016; 2017; 2018b) and Kumar et al. (2019). Ma et
For a more detailed survey, we refer the reader to (Sharma al. (2016; 2017; 2018b) used recurrent neural network, prop-
et al. 2019). agation tree kernels and recursive neural trees to model posts
in the same thread either sequentially or with a tree structure However, as we will see in the data statistics in Table 1,
to predict whether the sequence pertains to rumour or real the trees in the data sets are very shallow, with the major-
news. Kumar et al. (2019) used various variants of LSTM ity of comments replying directly to the source tweet in-
(Branch LSTM, Tree LSTM and Binarized Constituency stead of replying to other tweets. We found that in social
Tree LSTM). In this paper, instead of recursive tree models, media, as often the entire thread is visible, a user replying
we propose a transformer network for rumor detection. to the root post might be continuing the conversation of ear-
lier users and not specifically writing a reply to the root post.
3 Rumor Detection Tree models that do not explicitly model every possible pair-
wise interaction between tweets are therefore sub-optimal in
In this section, we first define our problem statement. We modelling social media conversations. Therefore, we pro-
address the problem of predicting the veracity of a claim pose to overcome the limitations with our described models
given all of its responding tweets and the interactions be- in subsequent sections.
tween these tweets. We define each thread to be:
4.2 Transformer Networks
X = {x1 , x2 , x3 , ..., xn },
The attention mechanism in transformer networks enable ef-
where x1 is the source tweet, xi the ith tweet in chronologi- fective modeling of long-range dependencies (Vaswani et
cal order, and n the number of tweets in the thread. al. 2017). This is an advantage for rumor detection since
Besides the textual information, there is also structural in- there could be many replying tweets in a conversation and
formation in the conversation tree which could be exploited it is vital to be able to model the interactions between
(Ma, Gao, and Wong 2018b; Wu, Yang, and Zhu 2015). In the tweets effectively. We briefly discuss the multi-head at-
the tree-structured models, a pair of tweets xi and xj are tention (MHA) layer in the transformer network and refer
only related if either xi is replying to xj or vice versa. In all the reader to (Vaswani et al. 2017) for more details. Each
of our proposed models, we allow any post to attend to any MHA layer in the transformer network is made up of a self-
other post in the same thread. In our structure aware mod- attention sublayer and fully connected feed-forward sub-
els, we label the relations between any pair of tweets xi and layer. In the self-attention sublayer, a query and a set of key-
xj with a relation label, R(i, j) ∈ {parent, child, before, af- value pair is mapped to an output. The output is a weighted
ter, self}. The value of R(i, j) is obtained by applying the sum of the values, where the weight for each value is deter-
following set of rules in sequence: parent: if xi is directly mined by the compatibility between the query and the corre-
replying xj , child: if xj is directly replying xi , before: if xi sponding key. The compatibility, αij , is the attention weight
comes before xj , after: if xi comes after xj , self: if i = j. between a query from position i and a key from position j,
The rumor detection task reduces to learning to predict and is calculated with simple scaled dot-product attention:
each (X, R) to its rumor class y. We conducted experiments
on two rumor detection data sets, namely the Twitter15 and qi kjT
αij = Compatibility(qi , kj ) = softmax( √ ) (1)
Twitter16 data, and the PHEME 5 data. The classes are dif- dk
ferent for the data sets we are working on: The output at each position, zi , would then be a weighted
• for Twitter15 and Twitter16: y ∈ {non-rumor, false- sum of the compatibility and value at other positions.
rumor, true-rumor, unverified}, and Xn
• for PHEME, y ∈ {false-rumor, true-rumor, unverified}. zi = αij vj , (2)
j=1

4 Approaches where αij ∈ [0, 1] is the attention weight from Equation 1.


A higher value indicates higher compatibility.
For completeness, we first provide a brief description of
In order to allow the model to jointly attend to infor-
recursive neural networks and the attention mechanism in
mation from different representation subspaces at different
transformer networks. We then describe the proposed mod-
positions, Vaswani et al. (2017) introduced the concept of
els that is primarily based on the attention mechanism.
multi-head attention. Queries, keys and values are projected
h times with different learned linear projections. Each pro-
4.1 Recursive Neural Networks jected versions of queries, keys and values would perform
Ma et al. (2018b) applied tree-structured recursive neural the attention function (Equation 2) in parallel, yielding h
networks (RvNN) to the rumor detection problem: each output values. These are concatenated and once again pro-
thread is represented as a tree where the root is the claim, jected, generating the final values. The final layers would
and each comment is a child to the post it is responding to. then be passed through a fully connected feed-forward sub-
In RvNN, the input of each node in the tree is a vector of layer consisting of two linear layers with a RELU activation
tf-idf values of the words in the post. Ma et al. (2018b) stud- between them.
ied two models, a bottom-up and a top-down tree models.
Kumar et al. (2019) applied a similar mechanism.The for- 4.3 Post-Level Attention Network (PLAN)
mulation is appealing as we expect information from deny- The architecture of our post-level attention network (PLAN)
ing and questioning comments to be propagated and used for is shown in Figure 2a. We flattened the structure of the con-
the classification of rumors. versation tree and arranged the tweets chronologically in a
y
Equation 6 extends upon Equation 1 by explicitly adding
Softmax
y
aKij when computing compatibility. Likewise, Equation 7 ex-
Softmax Attention

Attention
Tweet Level Self-Attention
tends upon Equation 2 by adding aVij when computing the
output vector. Both aVij and aK
TWEET LEVEL SELF ATTENTION

ij are vectors that represents


Add & Norm
Attention
Feed Forward

Add & Norm SOURCE X0 ... Xn


one of the five possible structural relationship between the
Self-Attention pair of the tweets (i.e. parent, child, before, after and self)

TOKEN LEVEL SELF ATTENTION


K V Q Add & Norm

Feed Forward
Add & Norm

Feed Forward
Add & Norm

Feed Forward
as described in Section 3. These vectors are learned parame-
TIME DELAY
EMBEDDING Add & Norm
Self-Attention
Add & Norm
Self-Attention
Add & Norm
Self-Attention
ters in our model. We learn two distinct vectors to ensure the
SOURCE X0 ... Xn K V Q K V Q K V Q
suitability for use of these vectors in the two different equa-
tions. Intuitively, aK
MAXPOOL ENCODING ENCODING ENCODING

X00 X01 X
... 0n
X10 X11 X
... 1n
Xn0 Xn1 X
... nn
ij gives the compatibility function more
SOURCE REPLY X1 ... REPLY Xn X00 X01

SOURCE
...
X0n X10 X11
REPLY X1
...
X1n
...
Xn0 Xn1

REPLY Xn
...
Xnn
information to better determine compatibility; compatibility
(a) PLAN Model is now determined by both textual content and structural re-
lationship of a pair of tweet instead of solely textual content.
(b) StA-HiTPLAN Model
The addition of aVij allows both structural and content infor-
mation to be propagated to other tweets.
Figure 2: Proposed models
4.5 Structure Aware Hierarchical Token and
linear structure with the source tweet as the first tweet. For Post-Level Attention Network
our PLAN model, we applied max pooling to each tweet (StA-HiTPLAN)
xi in the linear structure to obtain it’s sentence representa- Our PLAN model performs max-pooling to get the sentence
tion x0i . We then pass a sequence of sentence embedding representation of each tweet. However, it could be more
X 0 = (x01 , x02 , ..., x0n ) through s number of multi-head at- ideal to allow the model to learn the importance of the word
tention (MHA) layers to model the interactions between the vectors instead. Hence, we propose a hierarchical attention
tweets. We refer to these MHA layers as post-level atten- model - attention at a token-level then at a post-level. An
tion layers. As such, this transforms X 0 = (x01 , x02 , ..., x0n ) overview of the hierarchical model is shown in Figure 2b.
to U = (u1 , u2 , ..., un ) . Lastly, we used the attention mech- Instead of using max-pooling to obtain sentence represen-
anism to interpolate the tweets before passing through a fully tation, we performed token-level self-attention before using
connected layer for prediction. the attention mechanism to interpolate the output. This ap-
αk = softmax(γ T uk ), (3) proach would also learn a more complex sentence repre-
m
sentation for each tweet. More formally, each tweet could
X be represented as a sequence of word tokens where xi =
v= αk uk , (4)
(xi,1 , xi,2 , ..., xi,|xi | ). We passed the sequence of word to-
k=0
kens in a tweet through sword number of MHA layers. This
p= softmax(WpT v + bp ), (5) allows interactions between tokens in a tweets and we refer
to these layers as token-level attention layers. After which,
where γ ∈ Rdmodel , αk ∈ R, Wp ∈ Rdmodel ,K , b ∈ Rdmodel ,
we used the attention mechanism to interpolate the output
uk is the output after passing through s number of MHA lay-
from the MHA layers to obtain a sentence representation for
ers, K is the number of output classes, v and p are the rep-
each tweet. The sentence embedding for each tweet will then
resentation vector and prediction vector for X respectively.
be used to perform structure aware post-level self-attention
4.4 Structure Aware Post-Level Attention as described in Section 4.4.
Network (StA-PLAN) 4.6 Time Delay Embedding
One possible limitation of our model is that we lose struc-
tural information by organising tweets in a linear structure. Tweets created at different time intervals could be in-
Structural information that are inherently present in a con- terpreted differently. Tweets expressing disbelief when a
versation tree might still be useful for fake news detection. source claim is first created could be common as the claim
Tree models are superior in this aspect since the structural may have not been verified. However, doubtful tweets at
information is modelled explicitly. To combine the strengths later stage of propagation could indicate a high tendency that
of tree models and the self-attention mechanism, we ex- the source claim is fake. Therefore, we investigated the use-
tended our PLAN model to include structural information fulness of encoding tweets with time delay information with
explicitly. We adopted the formula in Shaw, Uszkoreit, and all three of our proposed models - PLAN, StA-PLAN and
Vaswani (2018) to perform structure aware self-attention: StA-HiTPLAN.
To include time delay information for each tweet, we bin
the tweets based on their latency from the time the source
qi kjT + aK
ij tweet was created. We set the total number of time bins to
αij = softmax( √ ), (6)
dk be 100 and each bin represents a 10 minutes interval. Tweets
n
X with latency of more than 1,000 minutes would fall into the
zi = αij (vj + aVij ) (7) last time bin. We used the positional encoding formula intro-
j=1 duced in the transformer network in Vaswani et al. (2017) to
Data set Twitter15 Twitter16 PHEME PHEME and Twitter15 and Twitter16. Our model dimension
Tree-depth 2.80 2.77 3.12
is 300 and the dimension of intermediate output is 600. We
Num leaves 34.2 31.9 10.3 used 12 post-level MHA layers and 2 token-level MHA lay-
Num tweets 40.6 37.9 14.9 ers. For training of the model, we used the ADAM optimizer
False 334 172 393 with 6000 warm start-up steps. We used an initial learning
True 350 189 1008 rate of 0.01 with 0.3 dropout. We used a batch size of 32
Unverified 358 190 571 for PLAN and StA-PLAN, and 16 for StA-HiTPLAN due
Non-rumor 371 205 - to memory limitations of the GPUs used. We compare the
Total trees 1413 756 1972 following models:
Total tweets 57,368 27,652 29,383
• PLAN: The post-level attention network in Section 4.3.
• StA-PLAN: The structure aware post-level attention net-
Table 1: Average tree depths, number of leaves and tweets.
work in Section 4.4.
• StA-HiTPLAN: The structure aware hierarchical atten-
tion model in Section 4.5.
encode each time bin. The time delay embedding would be
added to the sentence embedding of tweet. The time delay We also investigate these models with and without the time-
embedding, TDE, for each tweet is: delay embedding as described in Section 4.6. However, as
pos time delay information did not improve upon PLAN and
TDEpos,2i = sin , (8) StA-PLAN for Twitter15 and Twitter16, we did not run ex-
100002i/dmodel periments for StA-HiTPLAN + time delay for Twitter15 and
pos
TDEpos,2i+1 = cos , (9) Twitter16. For Twitter15 and Twitter 16, we compared our
100002i/dmodel proposed models with RvNN models proposed by Ma et
where pos represents the time bin each tweet fall into and al. (2018b). As we are using different preprocessing steps
pos ∈ [0, 100), i refers to the dimension and dmodel refers (described in 5.1) for Twitter15 and Twitter16, we retrained
to the total number of dimensions of the model. the RvNN models with our implementation. We report both
the original results and results from our implementation. For
5 Experiments and Results PHEME, we compare against the LSTM models proposed
We evaluate our model based on two rumour detection data by Kumar et al. (2019). As there were several combination
sets - (i) Twitter15 and Twitter16 and (ii) PHEME 5 events of embeddings and models in (Kumar and Carley 2019), we
data set. We show the statistics of the data sets in Table 1. compare our results with the best results reported for each
variant of LSTM model proposed.
5.1 Data and Pre-processing We summarize our experimental results in Tables 2 and
3. For Twitter 15 and Twitter 16, all our models outper-
For the PHEME 5 data set, we follow the experimental set-
form tree recursive neural networks, and our best models
ting of Kumar and Carley (2019) in using event wise cross-
outperform by 14.2% for Twitter15 and 6.8% for Twitter16.
validation. For the Twitter15 and Twitter 16 data sets, there
For PHEME, our best model outperforms previous work by
is a large proportion of retweets in each claim: 89% for Twit-
1.6% F1-score.
ter15 and 90% for Twitter16. As we assume that retweets do
not contribute new information to the model, we removed 5.3 Discussion of results
all retweets for Twitter15 and Twitter16. After the removal
of retweets, we observe that a small number of claims would In the rest of this section, we analyze the performance of our
be left with only the source tweet. Since the principle behind proposed models on Twitter15, Twitter16 and PHEME. We
our methodology is that we could exploit crowd signals for found conflicting conclusions for the different datasets. In
rumor detection, claims without any replies should then be particular, although all of the datasets are similar in nature,
“unverified”. Hence, we amended the labels for such claims the state-of-the-art performance of PHEME is far worse than
to be “unverified” in the training data (1.49% of Twitter15 - that of Twitter15 and Twitter16. As such, we also provide
8 False, 10 True and 3 Non-Rumour. 1.46% of Twitter16 - 9 analysis suggesting possible explanations for this disparity
False, 2 True and 0 Non-Rumour). In order for our results to in results for the two datasets.
be comparable with previous work, we excluded such claims Structural Information: We proposed two methods -
from our testing set. We used the original splits released StA-PLAN and time-delay embeddings that aim to capture
by (Ma, Gao, and Wong 2018b) to split our data. We show structural and chronological information among posts. StA-
the statistics of the data sets after pre-processing in Table 1. PLAN is the best performer for Twitter15 but did not out-
perform PLAN for Twitter16, though results were not sub-
5.2 Experimental Setup stantially different. The reason structural aware model only
In all experiments, we used the GLOVE 300d (Pennington, works for Twitter15 might be because the Twitter15 data set
Socher, and Manning 2014) embedding to represent each to- is substantially bigger (in terms of number of tweets, see
ken in a tweet. (Preliminary experiments with BERT (Devlin Table 1) than both PHEME and Twitter16. A big data set
et al. 2018) did not improve results, and were computation- might be necessary to exploit the complicated structural in-
ally far more expensive than GLOVE). We used the same formation in the structure aware model. Time-delay infor-
set of hyper parameters in all of our experiments for both mation was useful for PHEME, where all proposed models
Twitter15 Twitter16
Method Accuracy F T U NR Accuracy F T U NR
BU-RvNN (Original) 70.8 72.8 75.9 65.3 69.5 71.8 71.2 77.9 65.9 72.3
TD-RvNN (Original) 72.3 75.8 82.1 65.4 68.2 73.7 74.3 83.5 70.8 66.2
BU-RvNN (Ours) 70.5 71.0 72.1 73.0 65.5 80.6 75.5 89.3 83.0 73.4
TD-RvNN (Ours) 65.9 66.1 68.9 71.4 55.9 76.7 69.8 87.2 81.3 66.1
PLAN 84.5 85.8 89.5 80.2 82.3 87.4 83.9 91.7 88.8 85.3
StA-PLAN 85.2 84.6 88.4 83.7 84.0 86.8 83.3 92.7 88.8 82.6
StA-HiTPLAN 80.8 80.2 85.1 76.0 81.7 80.7 76.5 88.8 82.0 74.9
PLAN + time-delay 84.1 84.2 87.3 80.3 84.2 84.8 77.6 89.7 85.6 84.9
StA-PLAN + time-delay 85.0 85.7 88.3 81.4 84.4 86.6 83.3 92.3 86.6 84.2

Table 2: Accuracy on Twitter15 and Twitter16, where F, T, U and NR stands for False, True, Unverified and Non-rumor
respectively. We report the F1-Score for each individual class. The results of rows with (Original) were referenced from (Ma,
Gao, and Wong 2017), while the remaining rows are based on our own implementation of the models.

Method Macro F-Score dividual words that could be useful in determining veracity
Branch LSTM - Multitask 35.9 of the claim. We compute the statistics of the words “true”,
Tree LSTM - Multitask 37.9 “real” , “fake” and “lies” in the three data sets as shown in
BCTree LSTM - Multitask 37.1
Table 4. These words act as a proxy for crowd signals for
PLAN 36.0
StA-PLAN 34.9 the model to learn from and we observe that the usage of
StA-HiTPLAN 37.9 such words is the lowest in PHEME. It was also pointed out
PLAN + Time Delay 38.6 in (Kumar and Carley 2019) that most of the replying tweets
StA-PLAN + Time Delay 36.9 in PHEME were largely neutral comments. Therefore, these
StA-HiTPLAN + Time Delay 39.5
StA-HiTPLAN + Time Delay (Random split) 77.4 observations suggests that there is weaker crowd signal in
PHEME. Hence, token level attention might have been nec-
Table 3: F1-score on PHEME. We used the same train-test essary to do well in the PHEME. We provide an example
splits as (Kumar and Carley 2019) (Except the last row) and where token-level attention is useful for PHEME in Fig-
the results of the first three rows were referenced from the ure 3. The important tweet shown in the example is however
paper. not straightforward and would require inferring that ”doesn’t
need facts to write stories” would imply that the claim is
fake. Therefore, token-level self-attention is required to ac-
Words Twitter15 Twitter16 PHEME curately capture the meaning of the phrase.
“true” 20.4 17.6 9.5 We further analyze the performance of our model on the
“real” 34.4 25.4 11.7 PHEME dataset in the section below.
“fake” 14.1 10.9 2.2
“lies” 5.9 17.6 2.1
5.4 Analyzing results for PHEME
Table 4: Percentage of claims containing each word. In this section, we provide some error analysis and intu-
ition on the disparity between the performance for Twit-
ter15, Twitter16 and PHEME.
performed better with time delay. On the other hand, time
delay information was not useful for Twitter15 and Twit- Out-of-domain classification One key difference be-
ter16. Overall, it is unclear if structural information is useful tween Twitter15 and Twitter16 and PHEME is that the train-
for these data sets, and we leave further investigation to fu- test split for PHEME was done at an event level whereas
ture work. Twitter15 and Twitter16 were split randomly. As there are
no overlapping events in the training and testing data for
Token level self-attention: We proposed a token level PHEME, the model essentially has to learn to perform cross-
self-attention mechanism to model the relationship between domain classification. As the model would have naturally
tokens in a tweet with our StA-HiTPLAN model. StA- learnt event specific features from the train set, these event
HiTPLAN was the best performer for PHEME but did not specific features would result in poorer performance on test
outperform our baseline model for Twitter15 and Twitter16. set from another event. To verify our hypothesis, we trained
To investigate why StA-HiTPLAN was the best performer and tested the StA-HiTPLAN + time-delay model by split-
for PHEME, we hypothesize that the signals for rumor de- ting the train and test set randomly. As seen in Table 3, we
tection in PHEME could be much weaker or expressed in a were able to obtain an F-score of 77.4. (37.5 higher com-
more implicit manner. Therefore, it would be necessary to pared to using events split). Because splitting by events a
study the interaction between the words to better capture the more realistic setting, we would explore methods to make
meaning of the whole sentence. To this end, we identified in- our model event agnostic in future works.
(Label) Claim Important Tweets #Tweets
1 @HuffingtonPost ........ then they aren’t vegetarians. 33
(U NVERIFIED) Surprising 2 @HuffingtonPost this article is stupid. If they ever eat meat, they are not
number of vegetarians secretly vegetarian.
eat meat 3 @HuffingtonPost @laurenisaslayer LOL this could be a The Onion article
(T RUE) Officials took away 1 @NotExplained how can it be unknown if the officials took it down...... 46
this Halloween decoration after They have to touch it and examine it
reports of it being a real suicide 2 @NotExplained did anyone try walking up to it to see if it was real or fake?
victim. It is still unknown. this one seems like an easy case to solve
URL 3 @NotExplained thats from neighbours
(FALSE) CTV News confirms that 1 @inky mark @CP24 as part of a co-op criminal investigation one would 5
Canadian authorities have provided US URL doesn’t need facts to write stories it appears.
authorities with the name Michael 2 @CP24 I think that soldiers should be armed and wear protective vests
Zehaf-Bibeau in connection to Ottawa when they are on guard any where.
shooting 3 @CP24 That name should not be mentioned again.

Table 5: Samples of tweet level explanation for each claim. We sort tweets based on the number of times it was identified as
the most relevant tweet to the most important tweet and show the top three tweets. The right most column gives the number of
tweets in the thread.

5.5 Explaining the predictions


One key advantage of our model is that we could examine
the attention weights in our model to interpret the results of
our model. We illustrate how we can generate both token-
level and tweet-level explanations for our predictions.
Post-Level Explanations As described in Section 4, we
used the self-attention mechanism to aggregate information
among tweets. The amount of information a tweet is propa-
gating to another tweet is weighted by the relatedness be-
tween the pair, measured by the self-attention weight be-
tween them - higher self-attention weight imply higher re-
latedness. We then use the attention mechanism to interpo-
late tweets in the final layer before the prediction layer. This
generates a representation vector for the claim that would
be used for prediction. The attention weight for each tweet
indicates the level of importance the model has placed on
that tweet for prediction. Higher attention weight implies Figure 3: Heatmap showing the important tokens with token-
higher importance. Therefore, to generate the explanations level self-attention for a fake claim in PHEME. A lighter
for each claim, we obtain the tweet with the highest attention colour means higher importance.
weight at the final layer - this would give us the most impor-
tant tweet, tweetimpt , for prediction. After which, we ob-
tained the most relevant tweet, tweetrel,i to tweetimpt at the most important tweet to predict the fake claim, ”CTV News
ith MHA layer. We do so by obtaining tweets with highest confirms that Canadian authorities have provided US au-
self-attention weight with this particular tweet at each MHA thorities with the name Michael Zehaf-Bibeau in connection
layer. The same tweet could be identified as the most rel- to Ottawa shooting” correctly. This claim was obtained from
evant tweet multiple times. We ranked each tweet based on the PHEME dataset. As shown in Figure 3, most tokens in
the number of times it was identified as a relevant tweet. The the tweet have equally high attention weights with tokens at
top three tweets would be the explanation for this particular the end of the tweet. A further examination of the attention
claim. Table 5 shows examples of the top three tweets: we weights show that high weights were placed on the phrase
see replies that cast doubt on, or confirm, the claim. These ”facts to write stories it appears”. As such, ”facts to write
replies were accurately identified and used by our model in stories it appears” could be deemed as an important phrase
debunking or confirming the factuality of claims despite a that could explain the prediction of our model for this claim.
large number of other tweets present in the conversation.
Token-Level Explanation As described in Section 5.5, 6 Conclusion
we could extract the important tweets to a prediction - These We have proposed three models that outperformed state-of-
tweets provide tweet-level explanations for our model. the-art models on three data sets. Our model utilizes the self-
Following the steps described, we obtained ”@inky mark attention mechanism to model pairwise interactions between
@CP24 as part of a co-op criminal investigation one would posts. We utilized the attention mechanism to provide pos-
URL doesn’t need facts to write stories it appears.” as the sible explains of the prediction by extracting the important
posts that resulted in the prediction. We also investigated neural networks. In Proceedings of the Twenty-Fifth International
mechanisms to capture structure and time information. Joint Conference on Artificial Intelligence, IJCAI’16.
In this paper, we focused only on data with commu- Ma, J.; Gao, W.; and Wong, K.-F. 2017. Detect rumors in mi-
nity response. Recent papers that perform rumor detec- croblog posts using propagation structure via kernel learning. In
tion with user identity information have shown superior re- Proceedings of the 55th Annual Meeting of the Association for
sults. Lastly, another direction in fake news detection is fact Computational Linguistics (Volume 1: Long Papers).
checking with reliable sources. Fact checking and rumor de- Ma, J.; Gao, W.; and Wong, K.-F. 2018a. Detect rumor and stance
tection could provide complementary information and could jointly by neural multi-task learning. In Companion Proceedings
be done in a joint manner. We would consider this in future. of the The Web Conference 2018.
Ma, J.; Gao, W.; and Wong, K.-F. 2018b. Rumor detection on twit-
Acknowledgement ter with tree-structured recursive neural networks. In Proceedings
of the 56th Annual Meeting of the Association for Computational
We would like to thank all the anonymous reviewers for their Linguistics (Volume 1: Long Papers).
help and insightful comments. This piece of work was done Mihalcea, R., and Strapparava, C. 2009. The lie detector: Ex-
when Qian Zhong was at SMU. plorations in the automatic recognition of deceptive language. In
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers.
References O’Brien, N.; Latessa, S.; Evangelopoulos, G.; and Boix, X. 2018.
Castillo, C.; Mendoza, M.; and Poblete, B. 2011. Information cred- The language of fake news: Opening the black-box of deep learning
ibility on twitter. In Proceedings of the 20th International Confer- based detectors. In Proceedings of the Neurips 2018 Workshop on
ence on World Wide Web. AI for social good.
Chen, T.; Wu, L.; Li, X.; Zhang, J.; Yin, H.; and Wang, Y. 2017. Ott, M.; Choi, Y.; Cardie, C.; and Hancock, J. T. 2011. Finding
Call attention to rumors: Deep attention based recurrent neural net- deceptive opinion spam by any stretch of the imagination. In Pro-
works for early rumor detection. CoRR abs/1704.05973. ceedings of the 49th Annual Meeting of the Association for Com-
Ciampaglia, G. L.; Shiralkar, P.; Rocha, L. M.; Bollen, J.; Menczer, putational Linguistics: Human Language Technologies - Volume 1,
F.; and Flammini, A. 2015. Computational fact checking from HLT ’11.
knowledge networks. CoRR abs/1501.03471. Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove:
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Global vectors for word representation. In Empirical Methods in
Pre-training of deep bidirectional transformers for language under- Natural Language Processing (EMNLP).
standing. arXiv preprint arXiv:1810.04805. Rapoza, K. 2017. Can ’fake news’ impact the stock market?
Enayet, O., and El-Beltagy, S. R. 2017. Niletmrg at semeval-2017 Rubin, V.; Conroy, N.; Chen, Y.; and Cornwell, S. 2016. Fake
task 8: Determining rumour and veracity support for rumours on news or truth? using satirical cues to detect potentially misleading
twitter. In Proceedings of the 11th International Workshop on Se- news. In Proceedings of the Second Workshop on Computational
mantic Evaluation (SemEval-2017). Approaches to Deception Detection.
Fuller, C. M.; Biros, D. P.; and Wilson, R. L. 2009. Decision Sharma, K.; Qian, F.; Jiang, H.; Ruchansky, N.; Zhang, M.; and
support for determining veracity via linguistic-based cues. Decis. Liu, Y. 2019. Combating fake news: A survey on identification and
Support Syst. 46(3). mitigation techniques. ACM Transactions on Intelligent Systems
Funke, D. 2019. Snopes pulls out of its fact-checking partnership and Technology (TIST) 10(3):21.
with facebook. Shaw, P.; Uszkoreit, J.; and Vaswani, A. 2018. Self-attention with
Gunther, R.; Beck, P. A.; and Nisbet, E. C. 2018. Fake news did relative position representations. In Proceedings of the 2018 Con-
have a significant impact on the vote in the 2016 election: Original ference of the North American Chapter of the Association for Com-
full-length version with methodological appendix. Columbus, OH: putational Linguistics: Human Language Technologies, Volume 2
Ohio State University. (Short Papers).
Hu, X.; Tang, J.; Zhang, Y.; and Liu, H. 2013. Social spammer Thorne, J.; Vlachos, A.; Cocarascu, O.; Christodoulopoulos, C.;
detection in microblogging. In Proceedings of the Twenty-Third and Mittal, A. 2018. The fact extraction and verification (fever)
International Joint Conference on Artificial Intelligence. shared task. In Proceedings of the First Workshop on Fact Extrac-
tion and VERification (FEVER).
Kochkina, E.; Liakata, M.; and Zubiaga, A. 2018. Pheme dataset
for rumour detection and veracity classification. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.;
Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is
Kumar, S., and Carley, K. M. 2019. Tree lstms with convolution all you need. CoRR abs/1706.03762.
units to predict stance and rumor veracity in social media conver-
sations. In Proceedings of the Annual Meeting of the Association Wang, W. Y. 2017. ”liar, liar pants on fire”: A new benchmark
for Computational Linguistics, ACL 2019. dataset for fake news detection. In Proceedings of the 55th An-
nual Meeting of the Association for Computational Linguistics,
Kwon, S.; Cha, M.; Jung, K.; Chen, W.; and Wang, Y. 2013. ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 2: Short
Prominent features of rumor propagation in online social media. Papers, 422–426.
In ICDM, 1103–1108. IEEE Computer Society.
Wu, K.; Yang, S.; and Zhu, K. Q. 2015. False rumors detection
Li, Q.; Zhang, Q.; and Si, L. 2019. Rumor detection by exploit- on sina weibo by propagation structures. 2015 IEEE 31st Interna-
ing user credibility information, attention and multi-task learning. tional Conference on Data Engineering.
In Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics. Yang, F.; Liu, Y.; Yu, X.; and Yang, M. 2012. Automatic detec-
tion of rumor on sina weibo. In Proceedings of the ACM SIGKDD
Ma, J.; Gao, W.; Mitra, P.; Kwon, S.; Jansen, B. J.; Wong, K.-F.; and Workshop on Mining Data Semantics.
Cha, M. 2016. Detecting rumors from microblogs with recurrent

You might also like