Proceedings of the Tenth International AAAI Conference on
Web and Social Media (ICWSM 2016)
Fusing Audio, Textual, and Visual Features
for Sentiment Analysis of News Videos
Moises H. R. Pereira1 , Flavio L. C. Padua2 , Adriano C. M. Pereira3 ,
Fabrcio Benevenuto4 , Daniel H. Dalip5
1,5
Engineering and Technology Institute, University Center of Belo Horizonte (UNI-BH), Belo Horizonte, MG, Brazil
2
Department of Computing, CEFET-MG, Belo Horizonte, MG, Brazil
3,4
Department of Computer Science, Federal University of Minas Gerais (UFMG), Belo Horizonte, MG, Brazil
moises.ramos@prof.unibh.br, cardeal@decom.cefetmg.br, {adrianoc, fabricio}@dcc.ufmg.br, dhasan@prof.unibh.br
Abstract Newscast analysis is essential to media annalists in sev-
This paper presents a novel approach to perform sentiment
eral domains, especially in the journalism eld (Stegmeier
analysis of news videos, based on the fusion of audio, textual 2012). Since newscasts is a specic type of discourse and a
and visual clues extracted from their contents. The proposed sociocultural practice, discourse analysis techniques (Cha-
approach aims at contributing to the semiodiscoursive study raudeau 2002) have been applied in order to analyse the
regarding the construction of the ethos (identity) of this me- newscast structure in many levels of descriptions, concern-
dia universe, which has become a central part of the modern- ing some properties, such as their general thematics, enun-
day lives of millions of people. To achieve this goal, we ap- ciation schemes and discourse style as newscast production
ply state-of-the-art computational methods for (1) automatic dimensions (Cheng 2012).
emotion recognition from facial expressions, (2) extraction Normally, speeches are analysed without the support of
of modulations in the participants speeches and (3) senti-
ment analysis from the closed caption associated to the videos
computational tools such as automatized annotation soft-
of interest. More specically, we compute features, such as, wares and video analytic programs. Only recently, with the
visual intensities of recognized emotions, eld sizes of par- development of some areas such as the sentiment analy-
ticipants, voicing probability, sound loudness, speech funda- sis, computational linguistics, multimedia systems and com-
mental frequencies and the sentiment scores (polarities) from puter vision, new methods have been proposed to support
text sentences in the closed caption. Experimental results with the discourse analysis, especially in multimedia content such
a dataset containing 520 annotated news videos from three as TV newscasts (Pantti 2010). However, to the best of our
Brazilian and one American popular TV newscasts show that knowledge, there is no previous effort that has attempted to
our approach achieves an accuracy of up to 84% in the senti- use multimodal features (e.g. audio, textual and visual fea-
ments (tension levels) classication task, thus demonstrating tures) in order to measure the tension of the news. Also, ap-
its high potential to be used by media analysts in several ap-
plications, especially, in the journalistic domain.
proaches to infer the tension can have a good applicability
to infer importance of the news and then helping to organize
and summarize news.
Introduction In a rst step towards this goal, we present here a com-
The newscast is the backbone of all television networks in putational approach in order to support the study of tension
the world. In order to transmit credibility, during a TV news- level in news from multimodals features available in their
cast, it can be often seen a spectacularization of the informa- videos. The proposed approach allows researchers to per-
tion by the journalists as a way to prepare the mental model form a semiodiscursive analysis of verbal and non-verbal
of the viewers (Goffman 1981). languages which are manifested through facial expressions
As the enunciative communication process is singular (i.e. and by gestural movements of the journalists in front of the
made usually with just one interlocutor), the information camera which is a visual way to express their ideas (Eisen-
subject choose the style resources in which he/she will use stein, Barzilay, and Davis 2008). In our experiments, we
for an specic communication situation such as vocal modu- shown the effectiveness of proposed approach as well as the
lations, corporal expressivity, allowing the newscast to stand importance of sentiment analysis to infer high tension news.
on what is being reported and to establish an identica-
tion with the viewer (Eisenstein, Barzilay, and Davis 2008). Related Work
In addition, TV newscasts show the story tags followed by
some news with high or moderated emotional content, vary- Tension and Sentiment Analysis in News
ing the emotional tension as more news are presented dur- Many people read online news from websites of the great
ing the program. Generally, the last news have content with communication portals. These news websites need to cre-
low emotional tension, except in moments of great social ate effective strategies to draw people attention to these con-
upheaval (Pereira, Padua, and David-Silva 2015). tents. In this context, Reis et al. (2015) investigate strategies
Copyright c 2016, Association for the Advancement of Articial used by online news organizations in designing their head-
Intelligence (www.aaai.org). All rights reserved. lines. It was analysed the content of 69,907 headlines pro-
659
duced by four major media companies during a minimum of in general from videos and audio signals. The activity detec-
eight consecutive months in 2014. They nd out that news tion, voice monitoring and face detection are also resources
with a negative sentiment tend to generate many views as offered by this framework.
well as negative comments. In the result analysis, authors We can observe that there is a demand in analyzing the
point out that the greater the negative tension of the news, emotional content of videos and news in many types of me-
the greater the need that the user feels to give their opinion. dia. In this context, this paper applies robust techniques that
This, in controversial subjects, has a great possibility of be- allow to automatically determine the emotional tension lev-
ing a comment that contradicts the opinion of another user, els using sentiment analysis techniques in news videos for
promoting discussions on posts. content-based analysis of news programs.
Pantti (2010) studies the emotional value of the journal-
ists expressions in newscasts at the expense of public emo- The Proposed Approach
tions. This work provides evidences of how the journalists This section presents the multimodal features used and our
evaluate the role and position of emotions in the media cov- approach to combine textual, visual and audio information
erage and the empathy induced by the news. Furthermore, in order to perform the computation of tension levels in the
this work studies how the journalists emotional discourse is narrative content from events shown in the videos.
linked to their idea of good journalism and their professional
image. To accomplish this, they used a set of interviews from Multimodal Features
television professionals who work with utility programs and
In this work, multimodal features are organized into two
advertising news from Finland and Netherlands.
groups: (1) the audiovisual clues and (2) the sentiment
Multimodal Sentiment Analysis scores of textual sentences obtained from the closed caption
(textual information). Audio visual clues are represented by
Regarding multimedia les, it is not enough to process only the visual intensity of the recognized emotion, the partici-
one information modality in order to assertively perform pants eld size and the prosodic features of the audio sig-
sentiment analysis in this content. In this context, Poria et nal that corresponds at aspects of speech that go beyond the
al. (2016) presented an innovative approach for multimodal phonemes and deal with the sound quality: voicing proba-
sentiment analysis which consists of collecting sentiment of bility, loudness and fundamental frequency.
videos on the Web through a model that fuses audio, visual
and textual modalities as information resources. A feature Visual Intensity This feature was measured based on the
vector were created in an fusion approach in an feature and output margin, in other words, the distance to the separation
decision level, obtaining a precision of around 80%, repre- hyperplane of the classiers used in Bartlett et al. (2006) un-
senting a increase of more than 20% of precision when com- der a one-to-all approach, where it was used a Linear SVM
paring to all the state-of-the-art systems. classier for each of the modeled emotions (happiness, sur-
Maynard, Dupplaw, and Hare (2013) describe an ap- prise, aversion, contempt, anger, fear and sadness). The vi-
proach for sentiment analysis based on social media con- sual intensity of the emotion is computed for each frame of
tent, combining opinion mining in text and in multimedia the news video under analysis. In frames where no face was
resources (e.g. images, videos), focusing on entity and event detected, the system computes the emotion as Nonexistent.
recognition to help archivists in selecting material for in-
Participants Field Size We calculate the ratio between the
clusion in social media in order to solve ambiguity and to
recognition area of the largest face detected in the frame and
provide more contextual information. They use Natural Lan-
the current frame resolution to extract the proportion of faces
guage Processing (NLP) tools and a rule based approach for
values of the participants in the camera placement. In the
the text, concerning issues inherent to social media, such as
TV newscast universe, when there are several individuals in
grammatically incorrect text, use of profanity and sarcasm.
different eld sizes of camera placement, the focus will be
Audio and Video Emotion Recognition on the individual with the closest eld size, that is, whose
face occupies the largest area of the frame. Furthermore, the
Ekman and Friesen (1978) showed evidences that facial ex- larger the area that the face occupies in the frame, the greater
pressions of emotions can be inferred by rapid face signs. the emotional intensity that it was pretended to be applied as
These signals are characterized by changes in face appear- a communication strategy of the program in the varied use
ance that last seconds or fractions of a second. Thus, the au- of eld size during the exhibition (Gutmann 2012).
thors formulated a model of basic emotions, called the Facial
Action Coding System (FACS), based on six facial expres- Voicing Probability Shows the probability to have a tone
sions (happiness, surprise, disgust, anger, fear and sadness) differentiation during the speech in the next moment.
which are found in many cultures and presented in the same
Sound Loudness Sound loudness reects the perception
way, from children to seniors. The singularity points were
of loudness of the sound wave by the human ear measured
mapped to each type of facial expression through tests on a
in decibels (dB).
vast image database.
Regarding the recognition of prosodic features in the Fundamental Frequency Corresponds to the rst har-
sound modulations of audio signals, Eyben et al. (2013) monic of a sound wave, this is the most inuential frequency
showed the development of openSMILE, a framework for for the perception of a particular sound and one of the main
extracting features of emotional speech, music and sounds elements for characterizing the voice (Eyben et al. 2013).
660
Sentiment Scores Extracted from the closed caption. 2013) to extract the audio component of the spectrum and
Each sentence is formed from a subtitle text and it is ana- obtain the prosodic features of loudness, voicing probability
lyzed by 18 state-of-the-art methods (Araujo et al. 2014) for and the fundamental frequency of the speech modulations
textual sentiment analysis, generating a vector of 18 scores for each hundredth of a second these audio signals.
(-1, 0, or +1), one for each method. We sum its values in In Step 4, the closed caption is processed in order to
order to obtain the sentiment score for that sentence. extract only the textual content regarding the transcribed
speech (subtitle text). After that, each subtitle text is trans-
Tension Levels Computation formed into a single sentence. Each sentence is recorded in
In this paper, we extract data from the (1) audio prosodic a text le and then the text is submitted to sentiment anal-
features, (2) the visual features in the videos, (3) and the ysis process. To achieve this goal, we used the iFeel to ex-
sentiment scores extracted on the closed caption. After that, tract the sentiment polarities of each sentence, classifying
we determine the corresponding tension levels. Figure 1 them as positive, neutral or negative, for the 18 state-of-the-
presents the overview of the proposed approach. art methods (Araujo et al. 2014). Note that, the emoticons
In Step 1, the developed system extracts the multimodal were not used since in the closed caption there are not the
resources from TV news videos, where, each video contains specic character set used to represent emotions, proposed
different news. By doing this, we can obtain a list of images the emoticons method (Park et al. 2013).
from the video (frames), the audio signal in WAV format The Step 5, shown in Figure 1, obtains the tension level
and the text obtained from the closed caption. These modal- of a specic video based on the sum with the highest value
ities of visual, audio and textual information are organized among the two tension levels mapped: Low Tension and
in three processing lines, one for each modality. High Tension. The sums of the emotions recognized by the
In Step 2, using the obtained frames from the previous facial expressions and sentiment scores of the closed cap-
step, we apply the methods of facial expressions and emo- tions for each tension level are weighted by audiovisual
tion recognition proposed in Bartlett et al. (2006), obtaining clues calculated over the video.
the values for the visual features (visual intensity and eld A particular news video is classied by the tension level
size) on the emotional expressiveness of the face. whose emotions were totaled with the highest values of mul-
In Step 3, the audio signal is processed by extracting the timodal features calculated throughout the video during the
corresponding acoustic data, also considering the speech in- automatic recognition.
stants of individuals, that is, when speech occurs. To accom-
plish this, we use the openSMILE framework (Eyben et al. Preliminary Results
In this section we describe the dataset, the annotation pro-
cess, the performed experiments on the implemented system
and discusses the preliminary results in order to evaluate the
performance of the proposed approach in some aspects.
Dataset and Annotation Process
The dataset constructed to evaluate the proposed approach
in this work has 520 news videos obtained from 27 exhibi-
tions of four TV newscasts, three Brazilian news programs
(namely Jornal da Record, Band News and Jornal Nacional)
and an American television news program (CNN)1 . More
specically, we used 226 news videos from Band News
shown from 10th to 14th of June, 2013, from 23rd to 28th of
April and from 7th to 12nd of December, 2015; 237 videos
of the Jornal da Record (JR) shown on May 24th, 2013, Jan-
uary 10th, 2015, February 5th, 2015 and 2nd 10th 16th and
18th of March, 2015; 47 videos from Jornal Nacional (JN)
shown on January 20th and December 17th, 2015; and 10
videos of CNN News shown on April 6th, 2015.
In order to evaluate the proposed model, four profession-
als from several elds of knowledge (Journalism and Lan-
guages, Mathematical Modeling, Law and Pedagogy) made
manual annotations of the tension levels of the news videos.
The tension levels considered for annotation and explained
below were Low Tension and High Tension. Among the 520
videos analyzed, all contributors have annotated the same
1
Among the 520 videos of the dataset, 264 have closed caption.
Figure 1: Overview of the proposed approach in sentiment Then, in videos which does not have closed caption, they received
analysis for tension levels computation. zero in this feature.
661
Some Processing Approach Acknowledgments
Multimodal All 100% of
Features Annotations Concordance The authors would like to thank the support of CNPq-
Participants Brazil under Procs. 468042/2014-8 and 313163/2014-6,
Field Sizes 0.79 0.85 FAPEMIG-Brazil under Proc. APQ-01180-10, CEFET-MG
Sentiment under Proc. PROPESQ-023-076/09, and of CAPES-Brazil.
Scores 0.72 0.83
Proposed References
Method 0.78 0.84
Araujo, M.; Goncalves, P.; Cha, M.; and Benevenuto, F.
2014. iFeel: A System that Compares and Combines Senti-
Table 1: Acuracy for each processing approach for the newly ment Analysis Methods. In Proc. of WWW14.
proposed multimodal features and our proposed approach.
Bartlett, M. S.; Littlewort, G.; Frank, M.; Lainscsek, C.;
Fasel, I.; and Movellan, J. 2006. Fully Automatic Facial
Action Recognition in Spontaneous Behavior. In Proc. of
tension level to 381 videos, thus reaching 73.27% of agree-
FGR06, 223230. Southampton: IEEE.
ment and in 18.46% of the annotations (96) reached an
agreement among 75% of the contributors. Charaudeau, P. 2002. A Communicative Conception of Dis-
course. Discourse studies 4(3):301318.
Field Sizes x Sentiment Scores x Our Approach Cheng, F. 2012. Connection between News Narrative Dis-
course and Ideology based on Narrative Perspective Analy-
We analyze the performance of our newly proposed features sis of News Probe. Asian Social Science 8:7579.
for inferring the tension level of newscast and also our pro- Eisenstein, J.; Barzilay, R.; and Davis, R. 2008. Discourse
posed approach. Furthermore, we performed our compari- Topic and Gestural Form. In Proc. of AAAI08, 836841.
son in two manners: (1) considering all the videos and (2)
considering only those videos that all the annotators agreed Ekman, P., and Friesen, W. 1978. Facial Action Coding
each other (100 % of annotation concordance). System (FACS): Manual. Consulting Psychologists Press.
Table 1 shows the accuracy considering this two ap- Eyben, F.; Weninger, F.; Gro, F.; and Schuller, B. 2013.
proaches when using just specic features and using our ap- Recent Developments in openSMILE, the Munich Open-
proach. As expected, the performance was better when there Source Multimedia Feature Extractor. In Proc. of ACM
was 100% agreement among the annotators. Note that, to MM13, 835838.
compare the approaches, we performed paired t-test between Goffman, E. 1981. The Lecture. In Forms of talk. Pennsyl-
the processing approaches and the results are different with vania: University of Pennsylvania Press. 162195.
an 95% of acuracy. Gutmann, J. 2012. What Does Video-Camera Framing Say
In the experiments implemented, we could observe that during the News? A Look at Contemporary Forms of Visual
sentiment scores were the best for inferring High Tension Journalism. Brazilian Journalism Research 8(2):6479.
videos. Thus, we can conclude that Sentiment Scores are Maynard, D.; Dupplaw, D.; and Hare, J. 2013. Multimodal
good to infer High Tension videos, however, Sentiment Sentiment Analysis of Social Media. Proc. of BCS SGAI
Scores alone are not the best to infer Lower Tension videos. SMA13 4455.
Then, if we want a better performance for the Low Tension
videos, we can use our proposed approach. Pantti, M. 2010. The Value of Emotion: An Examination of
Television Journalists Notions on Emotionality. European
Journal of Communication 25(2):168181.
Concluding Remarks Park, J.; Barash, V.; Fink, C.; and Cha, M. 2013. Emoticon
Facial expressions are forms of non-verbal communication Style: Interpreting Differences in Emoticons Across Cul-
that externalize our demonstrations of stimulus in which we tures. In Proc. of ICWSM13.
are subjected and, therefore, are elements that are part of me- Pereira, M. H. R.; Padua, F. L. C.; and David-Silva, G.
dia news. In order to legitimize the reported fact as commu- 2015. Multimodal Approach for Automatic Emotion Recog-
nicative strategy, newscasters express their emotions, pro- nition Applied to the Tension Levels Study in TV Newscasts.
viding evidences about tension of the speech generated by Brazilian Journalism Research 11(2):146167.
the news and the pattern in the sequencing of the news con- Poria, S.; Cambria, E.; Howard, N.; Huang, G. B.; and Hus-
tained in the program. In this sense, this work presented one sain, A. 2016. Fusing Audio, Visual and Textual Clues for
approach to infer the tension of news videos taking into ac- Sentiment Analysis from Multimodal Content. Neurocom-
count multiple sources of evidences. puting 174:5059.
By experiments we could show that sentiment analysis of Reis, J.; Benevenuto, F.; Vaz de Melo, P.; Prates, R.; Kwak,
the text, which were for the rst time used in this domain, H.; and An, J. 2015. Breaking the News: First Impressions
can be important for inferring high tension news. We also Matter on Online News. In Proc. of ICWSM15.
show that our approach can have a good performance to infer
High tension news. However, for the identication of Low Stegmeier, J. 2012. Toward a computer-aided methodology
Tension news, our approach are only worse than using just for Discourse Analysis. Stellenbosch Papers in Linguistics
sentiment analysis. 41:91114.
662