Interpretable Rumor Detection in Microblogs by Attending To User Interactions
Interpretable Rumor Detection in Microblogs by Attending To User Interactions
Ling Min Serena Khoo, Hai Leong Chieu Zhong Qian, Jing Jiang
DSO National Laboratories Singapore Management University
12 Science Park Drive 80 Stamford Road
Singapore 118225 Singapore 178902
{klingmin,chaileon}@dso.org.sg qianzhongqz@163.com,jingjiang@smu.edu.sg
arXiv:2001.10667v1 [cs.CL] 29 Jan 2020
Attention
Tweet Level Self-Attention
tends upon Equation 2 by adding aVij when computing the
output vector. Both aVij and aK
TWEET LEVEL SELF ATTENTION
Feed Forward
Add & Norm
Feed Forward
Add & Norm
Feed Forward
as described in Section 3. These vectors are learned parame-
TIME DELAY
EMBEDDING Add & Norm
Self-Attention
Add & Norm
Self-Attention
Add & Norm
Self-Attention
ters in our model. We learn two distinct vectors to ensure the
SOURCE X0 ... Xn K V Q K V Q K V Q
suitability for use of these vectors in the two different equa-
tions. Intuitively, aK
MAXPOOL ENCODING ENCODING ENCODING
X00 X01 X
... 0n
X10 X11 X
... 1n
Xn0 Xn1 X
... nn
ij gives the compatibility function more
SOURCE REPLY X1 ... REPLY Xn X00 X01
SOURCE
...
X0n X10 X11
REPLY X1
...
X1n
...
Xn0 Xn1
REPLY Xn
...
Xnn
information to better determine compatibility; compatibility
(a) PLAN Model is now determined by both textual content and structural re-
lationship of a pair of tweet instead of solely textual content.
(b) StA-HiTPLAN Model
The addition of aVij allows both structural and content infor-
mation to be propagated to other tweets.
Figure 2: Proposed models
4.5 Structure Aware Hierarchical Token and
linear structure with the source tweet as the first tweet. For Post-Level Attention Network
our PLAN model, we applied max pooling to each tweet (StA-HiTPLAN)
xi in the linear structure to obtain it’s sentence representa- Our PLAN model performs max-pooling to get the sentence
tion x0i . We then pass a sequence of sentence embedding representation of each tweet. However, it could be more
X 0 = (x01 , x02 , ..., x0n ) through s number of multi-head at- ideal to allow the model to learn the importance of the word
tention (MHA) layers to model the interactions between the vectors instead. Hence, we propose a hierarchical attention
tweets. We refer to these MHA layers as post-level atten- model - attention at a token-level then at a post-level. An
tion layers. As such, this transforms X 0 = (x01 , x02 , ..., x0n ) overview of the hierarchical model is shown in Figure 2b.
to U = (u1 , u2 , ..., un ) . Lastly, we used the attention mech- Instead of using max-pooling to obtain sentence represen-
anism to interpolate the tweets before passing through a fully tation, we performed token-level self-attention before using
connected layer for prediction. the attention mechanism to interpolate the output. This ap-
αk = softmax(γ T uk ), (3) proach would also learn a more complex sentence repre-
m
sentation for each tweet. More formally, each tweet could
X be represented as a sequence of word tokens where xi =
v= αk uk , (4)
(xi,1 , xi,2 , ..., xi,|xi | ). We passed the sequence of word to-
k=0
kens in a tweet through sword number of MHA layers. This
p= softmax(WpT v + bp ), (5) allows interactions between tokens in a tweets and we refer
to these layers as token-level attention layers. After which,
where γ ∈ Rdmodel , αk ∈ R, Wp ∈ Rdmodel ,K , b ∈ Rdmodel ,
we used the attention mechanism to interpolate the output
uk is the output after passing through s number of MHA lay-
from the MHA layers to obtain a sentence representation for
ers, K is the number of output classes, v and p are the rep-
each tweet. The sentence embedding for each tweet will then
resentation vector and prediction vector for X respectively.
be used to perform structure aware post-level self-attention
4.4 Structure Aware Post-Level Attention as described in Section 4.4.
Network (StA-PLAN) 4.6 Time Delay Embedding
One possible limitation of our model is that we lose struc-
tural information by organising tweets in a linear structure. Tweets created at different time intervals could be in-
Structural information that are inherently present in a con- terpreted differently. Tweets expressing disbelief when a
versation tree might still be useful for fake news detection. source claim is first created could be common as the claim
Tree models are superior in this aspect since the structural may have not been verified. However, doubtful tweets at
information is modelled explicitly. To combine the strengths later stage of propagation could indicate a high tendency that
of tree models and the self-attention mechanism, we ex- the source claim is fake. Therefore, we investigated the use-
tended our PLAN model to include structural information fulness of encoding tweets with time delay information with
explicitly. We adopted the formula in Shaw, Uszkoreit, and all three of our proposed models - PLAN, StA-PLAN and
Vaswani (2018) to perform structure aware self-attention: StA-HiTPLAN.
To include time delay information for each tweet, we bin
the tweets based on their latency from the time the source
qi kjT + aK
ij tweet was created. We set the total number of time bins to
αij = softmax( √ ), (6)
dk be 100 and each bin represents a 10 minutes interval. Tweets
n
X with latency of more than 1,000 minutes would fall into the
zi = αij (vj + aVij ) (7) last time bin. We used the positional encoding formula intro-
j=1 duced in the transformer network in Vaswani et al. (2017) to
Data set Twitter15 Twitter16 PHEME PHEME and Twitter15 and Twitter16. Our model dimension
Tree-depth 2.80 2.77 3.12
is 300 and the dimension of intermediate output is 600. We
Num leaves 34.2 31.9 10.3 used 12 post-level MHA layers and 2 token-level MHA lay-
Num tweets 40.6 37.9 14.9 ers. For training of the model, we used the ADAM optimizer
False 334 172 393 with 6000 warm start-up steps. We used an initial learning
True 350 189 1008 rate of 0.01 with 0.3 dropout. We used a batch size of 32
Unverified 358 190 571 for PLAN and StA-PLAN, and 16 for StA-HiTPLAN due
Non-rumor 371 205 - to memory limitations of the GPUs used. We compare the
Total trees 1413 756 1972 following models:
Total tweets 57,368 27,652 29,383
• PLAN: The post-level attention network in Section 4.3.
• StA-PLAN: The structure aware post-level attention net-
Table 1: Average tree depths, number of leaves and tweets.
work in Section 4.4.
• StA-HiTPLAN: The structure aware hierarchical atten-
tion model in Section 4.5.
encode each time bin. The time delay embedding would be
added to the sentence embedding of tweet. The time delay We also investigate these models with and without the time-
embedding, TDE, for each tweet is: delay embedding as described in Section 4.6. However, as
pos time delay information did not improve upon PLAN and
TDEpos,2i = sin , (8) StA-PLAN for Twitter15 and Twitter16, we did not run ex-
100002i/dmodel periments for StA-HiTPLAN + time delay for Twitter15 and
pos
TDEpos,2i+1 = cos , (9) Twitter16. For Twitter15 and Twitter 16, we compared our
100002i/dmodel proposed models with RvNN models proposed by Ma et
where pos represents the time bin each tweet fall into and al. (2018b). As we are using different preprocessing steps
pos ∈ [0, 100), i refers to the dimension and dmodel refers (described in 5.1) for Twitter15 and Twitter16, we retrained
to the total number of dimensions of the model. the RvNN models with our implementation. We report both
the original results and results from our implementation. For
5 Experiments and Results PHEME, we compare against the LSTM models proposed
We evaluate our model based on two rumour detection data by Kumar et al. (2019). As there were several combination
sets - (i) Twitter15 and Twitter16 and (ii) PHEME 5 events of embeddings and models in (Kumar and Carley 2019), we
data set. We show the statistics of the data sets in Table 1. compare our results with the best results reported for each
variant of LSTM model proposed.
5.1 Data and Pre-processing We summarize our experimental results in Tables 2 and
3. For Twitter 15 and Twitter 16, all our models outper-
For the PHEME 5 data set, we follow the experimental set-
form tree recursive neural networks, and our best models
ting of Kumar and Carley (2019) in using event wise cross-
outperform by 14.2% for Twitter15 and 6.8% for Twitter16.
validation. For the Twitter15 and Twitter 16 data sets, there
For PHEME, our best model outperforms previous work by
is a large proportion of retweets in each claim: 89% for Twit-
1.6% F1-score.
ter15 and 90% for Twitter16. As we assume that retweets do
not contribute new information to the model, we removed 5.3 Discussion of results
all retweets for Twitter15 and Twitter16. After the removal
of retweets, we observe that a small number of claims would In the rest of this section, we analyze the performance of our
be left with only the source tweet. Since the principle behind proposed models on Twitter15, Twitter16 and PHEME. We
our methodology is that we could exploit crowd signals for found conflicting conclusions for the different datasets. In
rumor detection, claims without any replies should then be particular, although all of the datasets are similar in nature,
“unverified”. Hence, we amended the labels for such claims the state-of-the-art performance of PHEME is far worse than
to be “unverified” in the training data (1.49% of Twitter15 - that of Twitter15 and Twitter16. As such, we also provide
8 False, 10 True and 3 Non-Rumour. 1.46% of Twitter16 - 9 analysis suggesting possible explanations for this disparity
False, 2 True and 0 Non-Rumour). In order for our results to in results for the two datasets.
be comparable with previous work, we excluded such claims Structural Information: We proposed two methods -
from our testing set. We used the original splits released StA-PLAN and time-delay embeddings that aim to capture
by (Ma, Gao, and Wong 2018b) to split our data. We show structural and chronological information among posts. StA-
the statistics of the data sets after pre-processing in Table 1. PLAN is the best performer for Twitter15 but did not out-
perform PLAN for Twitter16, though results were not sub-
5.2 Experimental Setup stantially different. The reason structural aware model only
In all experiments, we used the GLOVE 300d (Pennington, works for Twitter15 might be because the Twitter15 data set
Socher, and Manning 2014) embedding to represent each to- is substantially bigger (in terms of number of tweets, see
ken in a tweet. (Preliminary experiments with BERT (Devlin Table 1) than both PHEME and Twitter16. A big data set
et al. 2018) did not improve results, and were computation- might be necessary to exploit the complicated structural in-
ally far more expensive than GLOVE). We used the same formation in the structure aware model. Time-delay infor-
set of hyper parameters in all of our experiments for both mation was useful for PHEME, where all proposed models
Twitter15 Twitter16
Method Accuracy F T U NR Accuracy F T U NR
BU-RvNN (Original) 70.8 72.8 75.9 65.3 69.5 71.8 71.2 77.9 65.9 72.3
TD-RvNN (Original) 72.3 75.8 82.1 65.4 68.2 73.7 74.3 83.5 70.8 66.2
BU-RvNN (Ours) 70.5 71.0 72.1 73.0 65.5 80.6 75.5 89.3 83.0 73.4
TD-RvNN (Ours) 65.9 66.1 68.9 71.4 55.9 76.7 69.8 87.2 81.3 66.1
PLAN 84.5 85.8 89.5 80.2 82.3 87.4 83.9 91.7 88.8 85.3
StA-PLAN 85.2 84.6 88.4 83.7 84.0 86.8 83.3 92.7 88.8 82.6
StA-HiTPLAN 80.8 80.2 85.1 76.0 81.7 80.7 76.5 88.8 82.0 74.9
PLAN + time-delay 84.1 84.2 87.3 80.3 84.2 84.8 77.6 89.7 85.6 84.9
StA-PLAN + time-delay 85.0 85.7 88.3 81.4 84.4 86.6 83.3 92.3 86.6 84.2
Table 2: Accuracy on Twitter15 and Twitter16, where F, T, U and NR stands for False, True, Unverified and Non-rumor
respectively. We report the F1-Score for each individual class. The results of rows with (Original) were referenced from (Ma,
Gao, and Wong 2017), while the remaining rows are based on our own implementation of the models.
Method Macro F-Score dividual words that could be useful in determining veracity
Branch LSTM - Multitask 35.9 of the claim. We compute the statistics of the words “true”,
Tree LSTM - Multitask 37.9 “real” , “fake” and “lies” in the three data sets as shown in
BCTree LSTM - Multitask 37.1
Table 4. These words act as a proxy for crowd signals for
PLAN 36.0
StA-PLAN 34.9 the model to learn from and we observe that the usage of
StA-HiTPLAN 37.9 such words is the lowest in PHEME. It was also pointed out
PLAN + Time Delay 38.6 in (Kumar and Carley 2019) that most of the replying tweets
StA-PLAN + Time Delay 36.9 in PHEME were largely neutral comments. Therefore, these
StA-HiTPLAN + Time Delay 39.5
StA-HiTPLAN + Time Delay (Random split) 77.4 observations suggests that there is weaker crowd signal in
PHEME. Hence, token level attention might have been nec-
Table 3: F1-score on PHEME. We used the same train-test essary to do well in the PHEME. We provide an example
splits as (Kumar and Carley 2019) (Except the last row) and where token-level attention is useful for PHEME in Fig-
the results of the first three rows were referenced from the ure 3. The important tweet shown in the example is however
paper. not straightforward and would require inferring that ”doesn’t
need facts to write stories” would imply that the claim is
fake. Therefore, token-level self-attention is required to ac-
Words Twitter15 Twitter16 PHEME curately capture the meaning of the phrase.
“true” 20.4 17.6 9.5 We further analyze the performance of our model on the
“real” 34.4 25.4 11.7 PHEME dataset in the section below.
“fake” 14.1 10.9 2.2
“lies” 5.9 17.6 2.1
5.4 Analyzing results for PHEME
Table 4: Percentage of claims containing each word. In this section, we provide some error analysis and intu-
ition on the disparity between the performance for Twit-
ter15, Twitter16 and PHEME.
performed better with time delay. On the other hand, time
delay information was not useful for Twitter15 and Twit- Out-of-domain classification One key difference be-
ter16. Overall, it is unclear if structural information is useful tween Twitter15 and Twitter16 and PHEME is that the train-
for these data sets, and we leave further investigation to fu- test split for PHEME was done at an event level whereas
ture work. Twitter15 and Twitter16 were split randomly. As there are
no overlapping events in the training and testing data for
Token level self-attention: We proposed a token level PHEME, the model essentially has to learn to perform cross-
self-attention mechanism to model the relationship between domain classification. As the model would have naturally
tokens in a tweet with our StA-HiTPLAN model. StA- learnt event specific features from the train set, these event
HiTPLAN was the best performer for PHEME but did not specific features would result in poorer performance on test
outperform our baseline model for Twitter15 and Twitter16. set from another event. To verify our hypothesis, we trained
To investigate why StA-HiTPLAN was the best performer and tested the StA-HiTPLAN + time-delay model by split-
for PHEME, we hypothesize that the signals for rumor de- ting the train and test set randomly. As seen in Table 3, we
tection in PHEME could be much weaker or expressed in a were able to obtain an F-score of 77.4. (37.5 higher com-
more implicit manner. Therefore, it would be necessary to pared to using events split). Because splitting by events a
study the interaction between the words to better capture the more realistic setting, we would explore methods to make
meaning of the whole sentence. To this end, we identified in- our model event agnostic in future works.
(Label) Claim Important Tweets #Tweets
1 @HuffingtonPost ........ then they aren’t vegetarians. 33
(U NVERIFIED) Surprising 2 @HuffingtonPost this article is stupid. If they ever eat meat, they are not
number of vegetarians secretly vegetarian.
eat meat 3 @HuffingtonPost @laurenisaslayer LOL this could be a The Onion article
(T RUE) Officials took away 1 @NotExplained how can it be unknown if the officials took it down...... 46
this Halloween decoration after They have to touch it and examine it
reports of it being a real suicide 2 @NotExplained did anyone try walking up to it to see if it was real or fake?
victim. It is still unknown. this one seems like an easy case to solve
URL 3 @NotExplained thats from neighbours
(FALSE) CTV News confirms that 1 @inky mark @CP24 as part of a co-op criminal investigation one would 5
Canadian authorities have provided US URL doesn’t need facts to write stories it appears.
authorities with the name Michael 2 @CP24 I think that soldiers should be armed and wear protective vests
Zehaf-Bibeau in connection to Ottawa when they are on guard any where.
shooting 3 @CP24 That name should not be mentioned again.
Table 5: Samples of tweet level explanation for each claim. We sort tweets based on the number of times it was identified as
the most relevant tweet to the most important tweet and show the top three tweets. The right most column gives the number of
tweets in the thread.