How Not To Predict Elections
How Not To Predict Elections
Abstract—Using social media for political discourse is increas- Being able to make predictions based on publicly available
ingly becoming common practice, especially around election time. data would have numerous benefits in areas such as health (e.g.
Arguably, one of the most interesting aspects of this trend is the predictions of flu epidemics [5], [6]), business (e.g., prediction
possibility of “pulsing” the public’s opinion in near real-time and,
thus, it has attracted the interest of many researchers as well as of box-office success of movies [7] and product marketability
news organizations. Recently, it has been reported that predicting [4]), economics (e.g., predictions on stock market trends and
electoral outcomes from social media data is feasible, in fact it is housing market trends [3], [8], [9]), and politics (e.g., trends
quite simple to compute. Positive results have been reported in a in public opinion [10]), to name a few.
few occasions, but without an analysis on what principle enables
them. This, however, should be surprising given the significant However, there have also been reports on Twitter’s ability to
differences in the demographics between likely voters and users predict with amazing accuracy the voting results in the recent
of online social networks. 2009 German elections [11] and in the 2010 US Congressional
This work aims to test the predictive power of social media elections [12]. Given the significant differences in the demo-
metrics against several Senate races of the two recent US Con- graphics between likely voters and users of social networks [1]
gressional elections. We review the findings of other researchers
and we try to duplicate their findings both in terms of data
questions arise on what is the underlying operating principle
volume and sentiment analysis. Our research aim is to shed light enabling these predictions. Could it be simply a matter of
on why predictions of electoral (or other social events) using coincidence or is there a reason why general trends are as
social media might or might not be feasible. In this paper, we accurate as specific demographics? Should we expect these
offer two conclusions and a proposal: First, we find that electoral methods to be accurate again in future elections? These are
predictions using the published research methods on Twitter
data are not better than chance. Second, we reveal some major
the questions we seek to address with our work.
challenges that limit the predictability of election results through The rest of this paper is organized as follows: The next
data from social media. We propose a set of standards that any section II reviews past research on electoral predictions using
theory aiming to predict elections (or other social events) using social media data. Section III describes a number of new
social media should follow.
experiments we conducted testing the predictability of the last
I. I NTRODUCTION two rounds of US elections based on Twitter volume and
sentiment analysis. Section IV describes a set of standards
In recent years, the use of social media for communication that any methodology of electoral predictions should follow
has dramatically increased. Research has shown that 22% of in order to be consistently competent against the statistical
adult internet users were engaged with the political campaign sampling methods employed by professional pollsters. The
on Twitter, Facebook and Myspace in the months leading final section V has our conclusions and proposes new lines
up to the November 2010 US elections [1]. Empowered by of research.
the APIs that many social media companies make available,
researchers are engaged in an effort to analyze and make
sense of the data collected through these social communication II. P REDICTING PAST E LECTIONS
channels. Theoretically, social media data, if used correctly,
can lead to predictions of events in the near future influenced In the previous section we mentioned some of the attempts
by human behavior. In fact, to describe this phenomenon, to use Twitter and Google Trends for predictions of real
[2] talk about “predicting the future” while [3] have coined world outcomes and external market events. What about the
the term “predicting the present”. In fact, researchers have important area of elections? One would expect that, following
reported that the volume of Twitter chat over time can be the previous research literature (e.g. [11], [12]), and given the
used to predict several kinds of consumer metrics such as the high utilization that the Web and online social networks have
likelihood of success of new movies before their release [2] in the US [1], Twitter volume should be have been able to
and the marketability of consumer goods [4]. These predictions predict consistently the outcomes of the US Congressional
are explained by the perceived ability of Twitter chat volume elections. Let us examine the instances and methods that
and Google Search Trends to monitor and record general social have been used in the past in the claims of electoral results
trends as they occur. predictions and discuss their predictive power.
A. Claims that Social Media Data predicted elections was rather high but it was not significant for the pre-electoral
The word “prediction” means foreseeing the outcome of polls, and they conclude that sentiment analysis on Twitter data
events that have not yet occurred. In this sense, the authors seems to be a promising field of research to replace traditional
are not aware of any publications or claims that, using social polls although, they find, it’s not quite there yet.
media data, someone was able to propose a method that would The work by [11] focuses directly on whether Twitter
predict correctly and consistently the results of elections before can serve as a predictor of electoral results. In that paper,
the elections happened. What has happened, however, is that a strong statement is made about predictability, namely that
on several occasions, post processing of social media data “the mere number of tweets mentioning a political party can
has resulted in claims that they might had been able to make be considered a plausible reflection of the vote share and
correct electoral predictions. Such claims are discussed in the its predictive power even comes close to traditional election
following subsection. polls.” In fact, they report a mean average error (MAE) of only
1.65%. Moreover, these researchers found that co-occurrence
B. Claims that Social Media Data could have predicted elec- of political party mentions accurately reflected close political
tions positions between political parties and plausible coalitions.
Probably due to the promising results achieved by many of More recently, [12] used the Tweets sent by the electoral
the projects and studies discussed in the section I, there is a candidates, not the general public, and reported success in
relatively high amount of hype surrounding the feasibility of “building a model that predicts whether a candidate will win or
predicting electoral results using social media. It must be noted lose with accuracy of 88.0%”. While this concluding statement
that most of that hype is fueled by traditional media and blogs, seems strong, a closer look in the claims reveals that they
usually bursting prior and after electoral events. For example, found their model to be less successful, as they admit that
shortly after the recent 2010 elections in the US, flamboyant “applying this technique, we correctly predict 49 out of 63
statements made it to the news media headlines. From those (77.7%) of the races”.
arguing that Twitter is not a reliable predictor (e.g. [13]) to
those claiming just the opposite, that Twitter (and Facebook) C. Claims that Social Media Data did not predict the elections
was remarkably accurate (e.g. [14]). Moreover, the degree of
accuracy of these “predictions” was usually assessed in terms The previous subsection reveals some inconsistencies with
of percentage of correctly guessed electoral races – e.g., the electoral predictions in scholarly publications. While candi-
winners of 74% for the US House and 81% for the US Senate date counts of Twitter messages predicted with remarkable
races were predicted [15] – without further qualification. Such accuracy electoral results in Germany in 2009 [11], a more
qualifications are important since a few US races are won elaborated method did not correlate well with pre-electoral
by very tight margins, while most of them are won with polls in the US 2008 Presidential elections [10]. Could it be
comfortable margins. These predictions were not compared that some of those results were just a matter of chance or the
against traditional ways of prediction, such as professional side-effect of technical problems? Who is right?
polling methods, or even trivial prediction methods based on The work by [17] focuses on the use of Google search
incumbency (the fact that those who are already in office are volume (not Twitter) as a predictor for the 2008 and 2010
far more likely to be re-elected in the US). US Congressional elections. They divided the electoral races
Compared to the media coverage, the number of scholarly in groups depending on the degree they were contested by
works on the feasibility of predicting popular opinion and the candidates, and they find that only a few groups of races
elections from social media is relatively small. Nevertheless, were “predicted” above chance using Google Trends – in one
it does tend to support a positive opinion on the predictive case achieving 81% of correct results. However, they report
power of social media as a promising line of research, while that those promising results were achieved by chance: while
exposing some caveats of the methods. Thus, according to the best group’s predictions were good in 2008 (81%), for the
[16], the number of Facebook fans for election candidates had same group the predictions were very poor in 2010 (34%).
a measurable influence on their respective vote shares. These Importantly, even when the predictions were better than
researchers assert that “social network support, on Facebook chance, they were not competent compared to the trivial
specifically, constitutes an indicator of candidate viability of method of predicting through incumbency. For example, in
significant importance [...] for both the general electorate and 2008, 91.6% of the races were won by incumbents. Even
even more so for the youngest age demographic.” in 2010, in elections with major public discontent, 84.5% of
A study of a different kind was conducted by [10]. They the races, were won by incumbents. Given that, historically,
analyzed the way in which simple sentiment analysis methods the incumbent candidate gets re-elected about 9 out of 10
could be applied to tweets as a tool of automatically pulsing times, the baseline for any competent predictor should be
public opinion. These researchers correlated the output of such the incumbent re-election rate. According to such a baseline,
a tool with the temporal evolution of different indices such as Google search volume proves to be a poor electoral predic-
the index of Consumer Sentiment, the index of Presidential tor. Compared to professional pollsters (e.g., The New York
Job Approval, and several pre-electoral polls for the US 2008 Times), the predictions were far worse; and, in some groups
Presidential Race. The correlation with the first two indices of races the predictions were even worse than chance!
In [18], the sentiment analysis methods of [10] and [11] are the names of candidates for five highly contested races for the
applied to tweets obtained during the US 2008 Presidential US Senate, 13,019 tweets were collected, contributed by 6,970
elections (Obama vs. McCain). [18] assigned a voting inten- different Twitter accounts.
tion to every individual user in the dataset, along with the These two datasets are different. The MAsen10 is an almost
user’s geographical location. Thus, electoral predictions were complete set of tweets, while USsen10 provides a random
computed for different states instead of simply the whole of sample, but because of its randomness, it should accurately
the US, and found that every method examined would have represent the volume and nature of tweets during that pre-
largely overestimated Obama’s victory, predicting (incorrectly) election week.
that Obama would have won even in Texas. In addition, [18] The first prediction method we examined is the one de-
provides some suggestions on the way in which such data scribed by [11], which consists of counting the number of
could be filtered to improve prediction accuracy. In this sense, tweets mentioning each candidate. According to that study, the
it points out that demographic bias in the user base of Twitter proportion of tweets mentioning each candidate should closely
and other social media services is an important electoral factor reflect the actual vote share in the election. Tweets containing
and, therefore, bias in data should be corrected according to the names of both candidates were not included, focusing only
user demographic profiles. on tweets mentioning one candidate at a time.
Recently, [19] provided a thorough response to the work of The second prediction method extends the ideas from [10],
[11] arguing that those authors relied on a number of arbitrary which described a way to compute a sentiment score for a
choices which make their method virtually useless for future topic being discussed on Twitter. To that end, [10] relied
elections. They point out that, by taking into account all of the on the subjectivity lexicon collected by [20] and labeled
parties running for the elections, the method by [11] would tweets containing any positive word as positive tweets, and
actually have predicted a victory for the Piratenpartei (Pirate the ones containing any negative word as negative tweets.
Party) (which received 2% of the votes but no seats in the Then, the sentiment score is defined to be the ratio between the
German parliament). number of positive and negative tweets. It must be noted that,
In this paper we decided to examine closer the claims according to [10], the number of polarized words in the tweet
of electoral predictions described in the previous subsection. is not important, and tweets can be simultaneously considered
Since we had collected data Twitter data from the US Con- as positive and negative. In addition, sentiment scores for
gressional Elections in 2010, we were in a position to examine topics with very different volumes of tweets are not easily
whether the methods proposed were as successful in instances comparable. Because of these issues, some changes had to be
other than the ones they were developed for. Moreover, we made to [10]’s approach in order to compute predicted vote
wanted to analyze why would electoral predictions using social shares. In our study, the lexicon employed is also [20], but
media may (or may not) be possible. In the next section III tweets are considered either positive or negative but not both.
we describe our computational experiments and in section IV Every tweet is labeled as positive, negative, or neutral, based
we analyze the operating models behind electoral predictions. on the sum of such labeled words (positive words contribute
+1, while negative words contribute -1). A tweet might be
III. N EW EXPERIMENTS ON T WITTER AND ELECTIONS labeled neutral when the sum of polarized words is 0, or
For our study, we used two data sets related to elections that when no contributing words appeared in it. Given the two-
took place in the US during 2010. Predictions were calculated party nature of the races, the vote share is calculated with this
based on Twitter chatter volume, as in [11], and then based on formula:
sentiment analysis of tweets, in ways similar to [10]. While pos(c1 ) + neg(c2 )
we did not have comparable data to examine the methods of vote share(c1 ) = (1)
pos(c1 ) + neg(c1 ) + pos(c2 ) + neg(c2 )
[12], we discuss some of its findings in the next section. where c1 is the candidate for whom support is being com-
The first data set we used belongs to the 2010 US Sen- puted while c2 is the opposing candidate; pos(c) and neg(c)
ate special election in Massachusetts (“MAsen10”), a highly are, respectively, the number of positive and negative tweets
contested race between Martha Coakley (D) and Scott Brown mentioning candidate c.
(R). The data set contains 234,697 tweets contributed by
56,165 different Twitter accounts, collected with the use of A. Results of Applying the Prediction Methods
Twitter streaming API, configured to retrieve near real-time For the MAsen10 data it was possible to make a more
tweets containing the names of any of the two candidates. The detailed analysis, since the data contained tweets before the
collection took place from January 13 to January 20, 2010, the election day (6 days of data), the election day (20 hours of
day after the elections. data), and post-election (10 hours of data). The 47,368 tweets
The second data set contains all the tweets provided by that mentioned both candidates were not used.
the Twitter “gardenhose” in the week from October 26 to Table I shows the number of tweets mentioning each candi-
November 1, the day before the general US Congressional date and the election results predicted from the volume. The
elections in November 2, 2010 (“USsen10”). The gardenhose total count of tweets we collected (53.25% - 46.75% in favor
provides a uniform sampling of the Twitter data. The daily of Brown) reflects closely the election outcome (Brown 51.9%
snapshots contained between 5.6 and 7.7 million tweets. Using - Coakley 47.1%). Correct prediction?
Coakley Brown POS NEG NEUT Accuracy
#tweets % #tweets % opposing Brown 124 76 150 21.71%
Pre-elec. (6 days) 52,116 53.86 44,654 46.14 opposing Coakley 70 67 105 27.68%
Elec. day (20 hrs) 21,076 49.94 21,123 50.06 supporting Brown 216 45 254 41.94%
Post-elec. (10 hrs) 14,381 29.74 33,979 70.26 supporting Coakley 240 72 213 45.71%
Total 87,573 46.75 99,756 53.25 neutral 249 82 296 47.20%
36.85%
TABLE I
T HE SHARE OF TWEETS FOR EACH CANDIDATE IN THE MA SEN 10 DATA TABLE IV
SET. N OTICE THAT THE PRE - ELECTION SHARE DIDN ’ T PREDICT THE FINAL C ONFUSION MATRIX FOR THE EVALUATION OF THE AUTOMATIC
RESULT (B ROWN WON 51.9% OF THE VOTES ). SENTIMENT ANALYSIS COMPUTED AGAINST A MANUALLY LABELED SET
OF TWEETS .
Coakley Brown
Pre-election 46.5% 53.5%
Election-day 44.25% 55.8% the sentiment analysis method. This difference was intriguing
Post-election 27.2% 72.8%
and we decided to study it closer. While a thorough evaluation
All 41.0% 59.0%
of the accuracy of sentiment analysis regarding political con-
TABLE II
P REDICTIONS BASED ON VOTE SHARE FOR M ASEN 10 DATA SET BASED ON
versation is out of the scope of this paper, some evidence on
SENTIMENT ANALYSIS . T HE PRE - ELECTION PREDICTION CORRECTLY the issues affecting simple methods based on polarity lexicons
PREDICTS B ROWN AS THE WINNER WITH A SMALL ERROR (1.1% FOR is provided from three different angles:
CORRECTED ELECTION RESULTS , ALSO SEE TABLE III).
1) Compared against manually labeled tweets: To evalu-
ate the accuracy of the above described sentiment analysis
method, a set of tweets were manually assigned to one of
the following labels: opposing Brown, opposing Coakley,
We refrained from declaring victory in the predictive power supporting Brown, supporting Coakley, or neutral. This set of
of Twitter when we realized that the share volume for the tweets was chosen to reflect “one tweet, one vote”: From the
pre-election period, actually predicted a win for Coakley, set of Twitter users that had indicated their location in the state
not Brown. Table I also shows how the number of tweets of Massachusetts, we chose users with a single tweet in the
was affected by electoral events. Brown received 1/3 of all corpus. This set contains 2,259 tweets. We read the tweets and
his mentions in the 10 hours post-election, when everyone manually assigned labels to them. Our labels were compared
started talking about his win, an important win that would against those assigned by the automatic method, producing the
have repercussions for the health care reform, a major issue confusion matrix in Table IV.
at the time. Brown’s win broke the filibuster-proof power of The results show that the accuracy of the sentiment analysis
democrats in the US Senate and produced a lot of tweets. is only 36.85%, slightly better than a classifier randomly as-
While the simple Twitter share of pre-election tweets signing the same three labels (positive, negative, and neutral).
couldn’t predict the result of the MAsen10 election, applying 2) Effect of misleading propaganda: A second evaluation
sentiment analysis to tweets and calculating the vote share was performed on a particular set of tweets, namely those in-
with Equation (1), comes close to electoral results, as shown cluded in a “Twitter bomb” targeted at Coakley [21] containing
in Table II. For a second time in our research effort we re- a series of tweets spreading misleading information about her.
frained from declaring victory in Twitter’s power in predicting The corpus used in this study contained 925 tweets that were
elections, and decided to take a closer look in our data. part of such the Twitter bomb. According to the automatic
The two prediction methods were further applied to 5 other sentiment analysis, 369 of them were positive messages, 212
highly contested senate races from the USsen10 data set. The were neutral, and only 344 were negative. While all of these
results of the 6 races are summarized in Table III. The actual tweets were part of an orchestrated smearing campaign against
results of the election don’t always sum up to 100% because Coakley, most of them were characterized as neutral or even
in a few races more than two candidates participated. So, in positive by the automatic sentiment analysis.
order to calculate the mean average error (MAE), the results Therefore, we conclude that by just relying on polarity
were normalized to sum up to 100%. Using the values of lexicons the subtleties of propaganda and disinformation are
the corrected election results, MAE values were calculated not only missed but even wrongly interpreted.
for both methods. The Twitter volume method had an error of 3) Relation to presumed political leaning: Finally, an ad-
17.1%, while the sentiment analysis had an error of 7.6%. In ditional experiment was conducted to test the assumption
other words, both MAE values are unacceptably high. Each underlying this application of sentiment analysis, namely, that
method was able to correctly predict the winner in only half the political preference of users can be derived from their
of the races. tweets. To derive the political preference from the tweets, for
every user, the corresponding tweets were grouped together
B. Sentiment Analysis Accuracy and their accumulated polarity score was attributed to the user.
The result in Table III show that while both prediction The presumed political orientation of a user was calculated
methods are correct only half of the time, MAE is smaller for following the approach described by [22]. This approach
State Senate Race Election Result Normalized Result Twitter Volume Sentiment Analysis
MA Coakley [D] vs. Brown[R] 47.1% - 51.9% 47.6% - 52.4% 53.9% - 46.1% 46.5% - 53.5%
CO Bennet [D] vs Buck [R] 48.1% - 46.4% 50.9% - 49.1% 26.3% - 73.7% 63.3% - 36.7%
NV Reid [D] vs Angle [R] 50.3% - 44.5% 53.1% - 46.9% 51.2% - 48.8% 48.4% - 51.6%
CA Boxer [D] vs Fiorina [R] 52.2% - 44.2% 54.1% - 45.9% 57.9% - 42.1% 47.8% - 52.2%
KY Conway [D] vs Paul [R] 44.3% - 55.7% 44.3% - 55.7% 4.7% - 95.3% 43.1% - 56.9%
DE Coons [D] vs O’Donnell [R] 56.6% - 40.0% 58.6% - 41.4% 32.1% - 67.9% 38.8% - 61.2%
TABLE III
T HE SUMMARY OF ELECTORAL AND PREDICTED RESULTS FOR 6 HIGHLY CONTESTED SENATE RACES . N UMBERS IN BOLD SHOW RACES WHERE THE
WINNER WAS PREDICTED CORRECTLY BY THE TECHNIQUE . B OTH T WITTER VOLUME AND S ENTIMENT A NALYSIS METHODS WERE ABLE TO PREDICT
CORRECTLY 50% OF THE RACES . I N THIS SAMPLE , INCUMBENTS WON IN ALL THE RACES THEY RUN (NV, CA, CO), AND 84.5% OF ALL 2010 RACES .
Pearson’s r
makes use of the ADA scores, which range from 0 (most Opinion on Brown vs Avg. ADA scores -0.150799848
conservative) to 100 (most liberal). ADA (Americans for Opinion on Coakley vs Avg. ADA scores +0.09304417
Democratic Action) is a liberal, political think-tank that pub- Voting orientation vs Avg. ADA scores -0.178902764
lishes scores for each member of the US Congress according TABLE V
to their voting record in key progressive issues. Official Twitter C ORRELATION BETWEEN AVERAGED ADA SCORES ( WHICH
PURPORTEDLY REFLECT USERS POLITICAL PREFERENCE ) AND THE
accounts for 210 members of the House and 68 members of the OPINIONS ON THE TWO CANDIDATES AND THE VOTING ORIENTATION . T HE
Senate were collected. Then, the Twitter followers of all these CORRELATIONS FOUND ARE CONSISTENT WITH THE INITIAL HYPOTHESES
accounts were collected, and every user received the average BUT VERY WEAK TO BE USEFUL .