A Methodology For Social BI
A Methodology For Social BI
ABSTRACT tastes, thoughts, and actions has been made available in the
Social BI (SBI) is the emerging discipline that aims at com- form of preferences, opinions, geolocation, etc. This huge
bining corporate data with textual user-generated content wealth of information is raising an increasing interest from
(UGC) to let decision-makers analyze their business based decision makers because it can give them a timely perception
on the trends perceived from the environment. Despite the of the market mood and help them explain the phenomena
increasing diffusion of SBI applications, no specific and or- of business and society.
ganic design methodology is available yet. In this paper we Social Business Intelligence (SBI) is the emerging disci-
propose an iterative methodology for designing and main- pline that aims at effectively and efficiently combining cor-
taining SBI applications that reorganizes the activities and porate data with UGC to let decision-makers analyze and
tasks normally carried out by practitioners. Effective sup- improve their business based on the trends and moods per-
port to quick maintenance iterations is a key feature in this ceived from the environment [6]. As in traditional business
context due to the huge dynamism of the UGC and to the intelligence, the goal of SBI is to enable powerful and flex-
pressing need of immediately perceiving and timely react- ible analyses for decision makers (simply called users from
ing to changes in the environment. The paper is completed now on) with a limited expertise in databases and ICT. In
by two case studies of real SBI projects, related to Italian the context of SBI, the most widely used category of UGC
politics and to the consumer goods area respectively, aimed is the one coming in the form of textual clips. Clips can
at proving that the adoption of a structured methodology either be messages posted on social media (such as Twitter,
positively impacts on the project success. Facebook, blogs, and forums) or articles taken from on-line
newspapers and magazines. Digging information useful for
users out of textual UGC requires first crawling the web to
Categories and Subject Descriptors extract the clips related to a subject area, then enriching
H.4.2 [Information Systems Applications]: Types of them in order to let as much information as possible emerge
Systems—Decision Support; D.2.10 [Software Engineer- from the raw text. The subject area defines the project scope
ing]: Design—methodologies and extent, and can be for instance related to a brand or a
specific market. Enrichment activities may simply identify
Keywords the structured parts of a clip, such as its author, or even use
sentiment analysis techniques [13] to interpret each sentence
Business Intelligence, Sentiment Analysis, Design method- and if possible assign a sentiment (also called polarity, i.e.,
ologies, User-Generated Content positive, negative, or neutral) to it. We will call SBI pro-
cess the one whose phases range from web crawling to users’
1. INTRODUCTION AND MOTIVATION analyses of the results.
Social networks and portable devices enabled simplified SBI has emerged as an application and research field in the
and ubiquitous forms of communication which significantly last few years. Though a wide literature is available about
contributed, during the last decade, to a boost in the volun- the different phases of the SBI process, no methodology is
tary sharing of personal information. Most of us can connect available yet to organize the different design activities. In-
to the Internet anywhere, anytime, and continuously send deed, in real SBI projects, practitioners typically carry out
messages to a virtual community centered around blogs, fo- a wide set of task but they lack an organic and structured
rums, social networks, and the like. As a result, an enormous view of the design process. In particular, a distinctive and
amount of user-generated content (UGC) related to people’s nonnegotiable feature of these projects is that they call for
an effective and efficient support to maintenance iterations,
because of the huge dynamism of the UGC and of the press-
Permission to make digital or hard copies of all or part of this work for ing need of immediately perceiving and timely reacting to
personal or classroom use is granted without fee provided that copies are changes in the environment. In the direction of achieving the
not made or distributed for profit or commercial advantage and that copies required responsiveness, in this paper we propose an itera-
bear this notice and the full citation on the first page. To copy otherwise, to tive methodology that reorganizes the activities and tasks
republish, to post on servers or to redistribute to lists, requires prior specific for developing and maintaining SBI processes. To evaluate
permission and/or a fee.
IDEAS ’14, July 07 - 09 2014, Porto, Portugal
the impact of an engineered approach on the project success
Copyright © 2014 ACM 978-1-4503-2627-8/14/07 ...$15.00. (in terms of both correctness and productivity) we present
http://dx.doi.org/10.1145/2628194.2628250
IDEAS14 207
and discuss two case studies of real SBI projects, related to ear approaches such as waterfall with specific reference to
Italian politics and to the consumer goods area respectively. data warehouse design. A waterfall approach was first pro-
While the first one was fully supported by our methodology, posed in [8]; a distinguishing feature was the inclusion of
the second one was mainly guided by the previous experience a conceptual design phase aimed at better formalizing the
of the design team. data schema. Later on, the same authors proposed Four-
The paper is structured as follows. After discussing the Wheel-Drive [9], an agile methodology that specializes re-
related literature in Section 2, in Section 3 we describe an ar- cent findings in software engineering to the peculiarities of
chitecture for SBI and in Section 4 we introduce our method- BI projects. Similarly, the work in [11] breaks with strictly
ology and its activities. Then, in Section 5 we discuss two sequential approaches by applying two agile development
case studies, while in Section 6 we draw the conclusions. techniques, namely scrum and eXtreme Programming. A
different approach to tackle data warehouse design complex-
2. RELATED WORK ity is the MDA methodology proposed in [14] to better sep-
arate the system functionality from its implementation; in
SBI is at the crossroads between several research areas
practice, strictly applying this methodology may be hard
that differently contribute to make the resulting analyses
due to the poor aptitude of users to reading formal models
effective and helpful to users. As shown in Figure 1, the
and investing resources in low-values activities. A pragmatic
SBI process requires first of all to capture and store large
comparison between data warehouse design methodologies
set of unstructured or semi-structured data available on the
is offered in [25], where 15 different solutions proposed by
web, social networks, and other textual repositories. Web
BI vendors are examined. The authors emphasize the lack
crawling is a central issue in information retrieval, in whose
of software-independent approaches, and point out that all
context powerful languages to automatically and precisely
the proposed solutions hardly can deal with changes and
capture the relevant data to be extracted were studied [4, 20,
market evolution, which creates a robustness problem. This
2, 5]. Storing the crawled data for post-analysis obviously
is the first reason why methods for designing classical BI
poses a big data problem due to the cardinality of the clips
applications cannot be directly applied to the SBI domain.
and to the heterogeneity of the related metadata [27].
One further reason is that in a BI project most of the at-
Semantic enrichment of raw clips and text understand-
tention is dedicated to static (multidimensional) modeling,
ing have been studied in several areas of computer science.
that largely determines the overall effectiveness, while in
Enrichment activities range from the simple identification
SBI a satisfying level of effectiveness can only be achieved
of relevant parts (e.g., author, title, language) if the clip is
only through a coordinated design of crawling and semantic
semi-structured, to the use of either natural language pro-
enrichment.
cessing (NLP) or text analysis techniques to interpret each
sentence and if possible assign a sentiment to it (i.e., sen-
timent analysis or opinion mining [13]). While NLP ap- 3. AN ARCHITECTURE FOR THE SBI PRO-
proaches try to obtain a full text understanding [29], text CESS
mining approaches rely on different techniques (e.g., n-grams) In [6] we proposed an architecture for the SBI process
either to find in the text interesting patterns (e.g., named where the information resulting from clip analysis is stored
entities [21], relationships between topics [23], or clip senti- into a data mart in the form of multidimensional cubes to
ment [19]) or to classify/cluster them [28]. The effectiveness be accessed through OLAP techniques. This allows for over-
of the different approaches largely depends on the quality coming the limitations of traditional approaches to the anal-
of the raw text to be analyzed; in general, NLP is effec- ysis of textual UGC, where only static or poorly flexible
tive on syntactically-correct texts (such as on-line newspa- reports are provided and historical data are not made avail-
pers and blogs) while it falls short on ill-formed sentences able. The core of our architecture is shown in Figure 1, and
or when Internet dialects are used (e.g., on social networks). it features:
Also hybrid approaches between classical NLP and statis-
tical techniques have been tried, either mainly user-guided, • An ODS (Operational Data Store) that stores all the
like in [12], or automated and unsupervised, like in [7]. relevant data about clips, their topics, their authors,
In the area of BI, most efforts for the social content field and their source channels; to this end, a relational
have been focused in identifying data representations that database is coupled with a document-oriented database
enable powerful and flexible analyses of data. For example, that can efficiently store and search the text of the clips
the topic cube approach [30] extends traditional cubes to and with a triple store to represent the topic ontology.
cope with a topic hierarchy and to store probabilistic content
measures of text documents learned through a probabilistic • A data mart that stores clip and topic information in
topic model. In [3], the authors model the topic hierarchy the form of a set of multidimensional cubes to be used
as a directed acyclic graph of topics where each topic can for decision making.
have several parents. In [6] we proposed meta-stars, whose • A crawling component that runs a set of keyword-
basic idea is to use meta-modeling coupled with navigation based queries to retrieve the clips (and the related
tables and with traditional dimension tables to cope with the meta-data) that lie within the subject area.
dynamism of topic hierarchies; to the best of our knowledge,
this is currently the only proposal that enables full OLAP • An ETL (Extraction, Transformation, and Loading)
analyses on social data. Finally, in [7] a multidimensional component that turns the semi-structured output of
data model is proposed to integrate sentiment data extracted the crawler into a structured form and loads it into
from opinion posts in a corporate data warehouse. the ODS, and then periodically extracts data about
As to the methods for designing classical BI applications, clips and topics from the ODS to load them into the
the available literature mainly focuses on traditional, lin- data mart.
IDEAS14 208
Figure 1: An architecture for SBI (three sample clips in Italian on the left, an excerpt from a domain ontology
on the right)
• A semantic enrichment component that works on the • Level 2: End-to-End. Here, an end-to-end software/
ODS to extract the semantic information hidden in service is acquired and tuned. Customers only need
the clips, such as the topic(s) related to the clip, the to carry out a limited set of tuning activities that are
syntactic and semantic relationships between words, typically related to the subject area, while a service
and the sentiment related to a whole sentence or to provider or a system integrator ensures the effective-
each single topic it contains. ness of the technical (and domain-independent) phases
of the SBI process. Examples of tools in this cate-
• An OLAP front-end to enable interactive and flexible
gory are Brandwatch and Tracx (www.tracx.com), both
analysis sessions of the multidimensional cubes.
offered in a software-as-a-service fashion and able to
In the implementation adopted for both case studies dis- manage most phases of the SBI process.
cussed in Section 5 we used Brandwatch (a well-known me-
dia monitoring commercial tool, www.brandwatch.com) for • Level 3: Off-the-Shelf. This type of projects consists
keyword-based crawling, Talend (www.talend.com) for ETL, in adopting, typically in a as-a-service manner, an off-
SyN Semantic Center (www.synthema.it) for semantic en- the-shelf solution supporting a set of reports and dash-
richment, Oracle for storing the ODS, the domain ontology, boards that can satisfy the most frequent user needs
and the data mart, and MongoDB (www.mongodb.org) as in the SBI area (e.g., average sentiment, top topics,
the document database; for OLAP analyses we developed trending topics, and their breakdown by source/author
an ad-hoc interface using JavaScript. /sex). With this approach the customer has a very
The components mentioned above are normally present, limited view of the single activities that constitute the
though with different levels of sophistication, in most current SBI process, so she has little or no chance of positively
commercial solutions for SBI. However, as we will show in impacting on activities that are not directly related to
Table 1, the roles in charge of designing, tuning, and main- the analysis of the final results. The service provider,
taining each component may vary from project to project. for instance Lexalytics or Verint, is in charge of ensur-
In regards to this, SBI projects can be classified as follows: ing the effectiveness of the process.
• Level 1: Best-of-Breed. In this type of projects, a best- Moving from level 1 to 3, projects require less technical ca-
of-breed policy is followed to acquire tools specialized pabilities from customers and ensure a shorter set-up time,
in one of the steps necessary to transform raw clips but they also allow less control of the overall effectiveness
in semantically-rich information. This approach is of- and less flexibility in analyzing the results. Noticeably, our
ten followed by those who run a medium to long-term architecture fits projects of all three levels —though only in
project to get full control of the SBI process by finely a best-of-breed project customers would have a direct and
tuning all its critical parameters, typically aimed at complete view of all the components.
implementing ad-hoc reports and dashboards to en-
able sophisticated analyses of the UGC. For example, 4. METHODOLOGICAL FRAMEWORK
SAS provides a set of modular components to support The iterative methodology we propose is aimed at letting
the different process phases (e.g., crawling and text harmoniously coexist all the activities involved in an SBI
mining) that can be separately tuned and used in com- project. These activities are to be carried out in tight con-
bination with components provided by other vendors. nection one to each other, always keeping in mind that each
IDEAS14 209
with reference to a level-1 project, the team roles mainly
involved in each activity and task, distinguishing between
designer, programmer, and end-user. The underlying idea is
that a programmer, besides showing database and BI skills,
should be competent in information retrieval, text mining,
and NLP; the designer is a 360◦ SBI expert and must be
able to guide the customer through all the crucial decisions
required by the project, ranging from properly picking the
crawling keywords to correctly organizing the topic ontology.
4.1 Macro-Analysis
During this activity, users are interviewed to define the
project scope and the set of inquiries the system will answer
to. An inquiry captures an informative need of a user; from a
conceptual point of view it is specified by three components:
what, i.e., one or more topic on which the inquiry is focused
(e.g., the Prime Minister); how, i.e., the type of analysis the
user is interested in (e.g., top related topics); where, i.e., the
data sources to be searched (e.g., the Wall Street Journal
website).
Inquiries drive the definition of subject area, themes, and
topics. The subject area of a project is the domain of interest
for the users (e.g., Italian national politics), meant as the
set of themes about which information is to be collected. A
theme (e.g., education) includes a set of specific topics (e.g.,
school reform). Laying down themes and topics at this early
stage is useful as a foundation for designing a core taxonomy
of topics during the first iteration of ontology design; themes
can also be used to enforce an incremental decomposition of
the project. In practice, this activity should also produce a
Figure 2: Functional view of our methodology for first assessment of which sources cannot be excluded from
SBI design the source selection activity since they are considered as
extremely relevant (e.g., the corporate website and Facebook
pages).
Two examples of inquiries in the subject area of Italian
of them heavily affects the overall system performance and politics are:
that a single problem can easily neutralize all other opti- • What are the reactions to the Job Act proposed by the
mization efforts. Prime Minister on newspapers belonging to different
The activities that make up our methodology are shown political areas?
in Figure 2; they were conceived on the one hand to sup-
port and speed up the initial design of an SBI process, on • To what extent European Election themes influence the
the other to maximize the effectiveness of the user analy- Prime Minister behavior?
ses by continuously optimizing and refining all its phases.
These maintenance activities are necessary in SBI projects These inquires determine two different themes, namely la-
because of the continuous —and often quite fast— environ- bor policy and European policy, and several topics such as
ment variability mainly related to volatile nature of web data welfare, minimum wage, Maastricht Treaty, and Eurosceptic.
sources, which asks for high responsiveness. This variability
impacts every single activity, from crawling design to seman- 4.2 Ontology Design
tic enrichment design, and leads to constantly having to cope During this activity, customers work on themes and top-
with changes in requirements. Note that three tracks are de- ics to build and refine the domain ontology that models the
picted in Figure 2: a crawling track centered on the crawling subject area (see [17] for a survey of techniques for ontol-
component, a semantic track centered on semantic enrich- ogy design). Noticeably, the domain ontology is not just
ment, and a data track centered on ETL and OLAP. These a list of keywords; indeed, it can also model relationships
tracks, whose activities require different technical skills and (e.g., hasKind, isMemberOf) between topics. An excerpt
may be executed by different team members, are partially from the domain ontology for the Italian politics subject
concurrent in our methodology but they still require a high area, designed using Protégé, is shown in Figure 1. Once
coordination. designed, this ontology becomes a key input for almost all
Table 1 shows which activities and single tasks are carried process phases: semantic enrichment relies on the domain
out for each track by the customer depending on the project ontology to better understand UGC meaning; crawling de-
level as defined in Section 1. The remaining activities are sign benefits from topics in the ontology to develop better
either in charge of the service provider/system integrator or crawling queries and establish the content relevance; ETL
they are not carried out at all, thus reducing the effective- and OLAP design heavily uses the ontology to develop more
ness of the SBI process. On the other hand, Table 2 shows, expressive, comprehensive, and intuitive dashboards. With
IDEAS14 210
Table 1: Activities (in italics) and tasks in charge of the customer for different project types, grouped by
track; tasks executed in projects of higher levels are carried out in projects of lower levels too
reference to Figure 1, organizing the politicians in a hierar- duce the coverage. Once it is confirmed that the emerging
chy allows to roll-up from a politician to its party and to topic is really pertinent to the project scope, it must be
its wing, which means for instance that the opinions about added to the domain ontology and related to the existing
a wing can be obtained as an average of the opinions about topics so as to increase the coverage again. Note that as-
all the politicians belonging to the parties of that wing. sessing the ontology coverage is made harder by off-topic
The complexity of ontology design, maintenance, and evo- clips (see Section 4.4) that negatively impact on the cov-
lution may vary according to the adopted tool and tech- erage; this induces a strong connection between ontology
niques [18, 26, 10]. In practice, the main task of this activ- design and crawling design.
ity consists in detecting as many domain-relevant topics and
themes as possible and organizing them into a classification 4.3 Source Selection
hierarchy. In most cases this entails distinguishing all the Source selection is aimed at identifying as many web do-
existing relationships between topics and expressing them mains as possible for crawling. The set of potentially rele-
into a categorization framework with a fixed number of pre- vant sources can be split in two families: primary sources
defined levels that supports the types of analyses users are and minor sources. The first set includes all the sources
expected to carry out. This fixed-depth limitation can actu- mentioned during the first macro-analysis iteration, namely:
ally be overcome; for instance, in the architecture proposed 1. All the corporate communication channels (the corpo-
in [6], a meta-star solution enables topics to be arranged in rate website, Facebook page, Twitter account, and any
a dynamic and recursive hierarchy so as to support more other official brand profile on any platform). Every in-
sophisticated OLAP queries. teraction recorded on these sources and every opinion
An effective way to measure the ontology maturity level is expressed on these media could be critical and has to
to use as an indicator the coverage that the ontology achieves be brought to the company attention as soon as possi-
of the retrieved clips (i.e., the percentage of clips that in- ble.
clude at least one ontology topic). Obviously, the goal is to
achieve a 100% coverage, meaning that all the clips retrieved 2. So-called generalist sources, such as the online version
are relevant to the subject area. This gives rise to an impor- of the major publications. Though these sources pub-
tant task of ontology design, which we call ontology coverage lish information dealing with several areas, not only
analysis. Unfortunately, the coverage tends to decrease in with the project one, they must be monitored because
time due to the dynamism of the UGC. Indeed, new po- of their wide user-base and of their quality and credi-
tentially relevant keywords are continuously brought to the bility.
users’ attention by an analysis of the retrieved clips; if these The user-base of minor sources is smaller but not less rele-
keywords are confirmed to be relevant (so-called emerging vant to the project scope. Minor sources include lots of small
topics), they must be timely included in the crawling queries platforms which produce valuable information with high in-
so as to avoid that some interesting UGC is missed and some formative value because of their major focus on themes re-
critical trend or phenomenon is not detected. This leads to lated to the subject area: in short, a small group of users
enlarging the scope of retrieved clips, and inevitably to re- who generate a high rate of pertinent clips.
IDEAS14 211
There are several ways for identifying the set of poten- choose to release some constraints aimed at letting a wider
tially relevant sources: (i) Conducting interviews with do- set of clips “slip through the net”, and only filter them at a
main experts, who usually are marketing operators; (ii) An- later stage using the search features of the underlying docu-
alyzing back-links and third-party references to the corpo- ment DBMS (e.g., MongoDB). This choice must be carefully
rate communication channels; (iii) Searching the web using evaluated: on the one hand, it implies that more data are
themes and topics as keywords, which can be done through retrieved and stored, on the other hand, it enables the team
search engines ranging from generalist ones such as Google to delay (and possibly to change) its decisions about the in-
to domain-specific and platform-specific ones such as Open- topic perimeter. Even if the team chooses to let more clips
polis and Social Mention; (iv) Considering all the local edi- enter the document repository, the inherent nature of tasks
tions of major newspapers. 2 and 3 does not change; however, these tasks are partially
Once a set of candidate sources has been established, de- delayed to a further filtering step to be carried out before
ciding which of them are to be actually crawled is the re- semantic enrichment.
sult of a trade-off between achieving a satisfying coverage of With reference to Italian politics, a simple crawling query
the subject area on the one hand, and optimizing the effort that retrieves clips related to the opinions expressed in Italy
for analyzing the retrieved clips. To evaluate this trade-off, about the debate between the Italian and German Prime
tools such as web directories (e.g., Alexa) can be used to es- Ministers about the Italian debit, could be ((Renzi AND
timate the number of accesses and traffic level of candidate Merkel) NEAR/20 (deficit OR raw:3%)) AND country:it1 .
web sites. Of course, even the set of selected sources must
be maintained, so the web must be periodically monitored 4.5 Semantic Enrichment Design
to evaluate and dynamically include new relevant sources. This activity involves several tasks whose purpose is to in-
crease the accuracy of text analytics so as to maximize the
4.4 Crawling Design process effectiveness in terms of extracted entities and sen-
A relevant source that produces in-topic clips, normally timent assigned to clips; entities are concepts that emerge
also generates lots of valueless content (off-topic clips) that from semantic enrichment but are not part of the domain
lies outside the project scope and slows down the whole pro- ontology yet (for instance, they could be emerging topics).
cess while possibly hiding relevant content. Crawling design, The specific tasks to be performed depend on the seman-
maybe one of the most complex and time-consuming activ- tic engine adopted and on how semantic enrichment is car-
ities, aims at retrieving in-topic clips by filtering off-topic ried out. For instance, SyN Semantic Center (used for both
clips out. Starting from the topics in the domain ontology case studies presented in this paper) executes a two-steps
and from the additional keywords possibly discovered dur- process [16]: first, relevant knowledge is identified from the
ing source selection, a set of queries are created to search for clips through lexical analysis, i.e., by detecting semantic re-
relevant clips across the selected sources. Three subsequent lations and facts based on the slot grammar method [15] and
tasks are involved in this activity: adopting morphological, syntactic, semantic, semiometric,
and statistical criteria; then, clips are classified according to
1. Template design consists in an analysis of the code
their topics using both supervised and unsupervised cluster-
structure of the source website to enable the crawler
ing criteria.
to detect and extract only the informative UGC (e.g.,
In general, two main tasks that enrich and improve its
by excluding external links, advertising, multimedia,
linguistic resources can be distinguished:
and so on).
2. Based on the templates designed, query design de- • Dictionary enrichment, that requires including new en-
velops a set of queries to extract the relevant clips. tities missing from the dictionary and changing the
Normally, these are complex Boolean queries that ex- sentiment of entities (polarization) according to the
plicitly mention both relevant keywords to extract on- specific subject area (e.g., in “I always eat fried cutlet”,
topic clips and irrelevant keywords to exclude off-topic the word “fried” has a positive sentiment, but in the
clips. food market area a sentence like “These cutlets taste
like fried” should be tagged with a negative sentiment
3. Content relevance analysis aims at evaluating the ef- because fried food is not considered to be healthy).
fectiveness of crawling by measuring the percentage Here, a typical error is related to failing to connect
of in-topic clips. At this stage, the analysis must be an entity to its different synonyms or aliases, which
carried out by manually labeling a sample of the re- dramatically distorts all the figures based on count-
trieved clips. Besides distinguishing between in-topic ing topic occurrences. To avoid this problem, a layer
and off-topic clips, users should also try to classify the of aliases can be added between topics and entities.
causes of errors to speed up the following iterations. Aliases are useful to associate to a single topic enti-
Identifying clips that have been retrieved due to an ties that can differ from the given topic due to typos
incorrect query template and keywords that led to ex- or due to the use of synonyms. For example, in the
tracting off-topic clips is typically very useful, because Italian politics domain, “PD” and “PD – L” are both
it enables the team to trigger a new iteration where synonyms of “Partito Democratico”. Such knowledge
crawling queries are refined to more effectively cut off- can be hosted either in the ontology (see [6]) or within
topic clips out. the semantic engine.
Note that filtering off-topic clips at crawling time could • Inter-word relation definition, that establishes or mod-
be difficult due to the limitations of the crawling language, ifies the existing semantic, and sometimes also syn-
and also risky because the in-topic perimeter could change
1
during the analysis process. For these reasons, the team can The query syntax is the one used by Brandwatch.
IDEAS14 212
tactic, relations between words. Relations are linguis- 2. go back to semantic enrichment design to solve prob-
tically relevant because they can deeply modify the lems related for instance to misunderstandings of sen-
meaning of a word or even the sentiment of an en- tences or wrong polarization of clips.
tire sentence determining the difference between right
and wrong interpretation (e.g., “a Pyrrhic victory” has 3. go back to ETL & OLAP design to fix ETL errors and
negative sentiment though “victory” is positive). improve reports and dashboards;
Modifications in the linguistic resources may produce unde- 4. go back to ontology design to further enrich and extend
sired side effects; so, after completing these tasks, a correct- the domain ontology with new relevant entities that
ness analysis should be executed aimed at measuring the emerged from an analysis of the clips retrieved;
actual improvements introduced and the overall ability of
the process in understanding a text and assigning the right 5. go back to source selection to include new sources or
sentiment to it. This is normally done, using regressive test exclude some sources that are no longer relevant;
techniques, by manually tagging an incrementally-built sam-
ple set of clips with a sentiment; it is always recommended 6. go back to macro-analysis to enlarge the subject area
to ask different users for tagging clips, and then use a voting or refine inquiries.
system to determine a majority group that will be considered
as an oracle. The overall correctness of semantic enrichment
strongly depends on the selected sources and on how specific 5. CASE STUDIES
the subject area is. Reaching a correctness level of 70% can In this section we will describe our experience with two
be seen as a very good result considering that, according real SBI projects, which helped us in tailoring our methodol-
to the literature, a realistic upper bound to the inter-tagger ogy and demonstrate that an engineered approach positively
agreement among three or more users when manually tag- impacts on the project success, meant in terms of both cor-
ging clips is around 70% [1]. rectness and productivity. In particular we will analyze two
projects: a level-1 project in the subject area of Italian pol-
4.6 ETL & OLAP Design itics (PR-Pol) and a level-2 project in the subject area of
The main tasks in this activity are: a large consumer goods company (PR-CG). Both projects
adopted an iterative approach and the tasks carried out are
• ETL design and implementation, that strongly depends approximately the same, but while in PR-Pol our method-
on features of the semantic engine, on the richness of ology was enforced, in PR-CG the team was mainly guided
the meta-data retrieved by the crawler (e.g., URLs, by its previous experience. As shown later, this will lead to
author, source type, platform type), and on the possi- some inefficiencies in PR-CG.
ble presence of specific data acquisition channels like The PR-CG working group was led by a system integrator
CRM, enterprise databases, etc. with significant skills in SBI, featuring one project manager,
one chief of consulting services, and six developers. The
• KPI design; different kinds of KPIs can be designed team was completed by an external scientific supervisor and
and calculated depending on which kinds of meta-data by the innovation chief of the customer company. Though
the crawler fetches. PR-CG was a level-2 project, we had a chance to monitor
the activities of both the customer and the system integra-
• Dashboard design, during which a set of reports is built tor. The PR-Pol working group was quite smaller: it only
that captures the user needs expressed by inquiries included one project manager, one scientific supervisor, two
during macro-analysis. Note that, in some cases, the developers, and the customer (the mayor of a large Italian
specific tool adopted for reporting may be unable to city in this case). Overall, though the two projects are not
satisfactorily meet the users requirements; in this case fully comparable in terms of size and working group compo-
a totally custom interface should be implemented from sition, they cover most of the critical issues related to SBI
scratch. projects so they provide a good support for discussing the
features of our methodology.
4.7 Execution and Test According to the classification proposed by [24], our case
This activity has a basic role in the methodology, as it studies can be described as explanatory/exploratory (they
triggers a new iteration in the design process. Crawling aim at confirming the effectiveness of our methodology in
queries are executed, the resulting clips are processed, and real contexts, but also at finding new insights and at better
the reports are launched over the enriched clips. The specific tuning the approach), positivist (they use effort and cor-
tests related to each single activity, described in the preced- rectness measurements), quantitative and qualitative (they
ing subsections, can be executed separately though they are quantitatively assess the validity of the approach, but they
obviously inter-related. The first test executed is normally also collect qualitative judgments by the team), and flex-
the one of crawling; even after a first round, the semantic ible (due to the inherent dynamics of an SBI project, the
enrichment tests can be run on the resulting clips. Similarly, requirements continuously change during the case studies).
when the first enriched clips are available, the test of ETL A more complete description can be given by answering the
and OLAP can be triggered. basic questions proposed by [22]:
The test results are inspected with users, which may easily
lead to: • Objective—What to achieve?: the case studies aim at
proving that the adoption of our methodology has a
1. go back to crawling design to better tune the crawling positive impact on the productivity and correctness of
queries or templates to improve precision and recall; SBI projects.
IDEAS14 213
• The case—What is studied?: we study two real projects • The customer’s effort is clearly reduced in a level-2
with different characteristics and in different areas; project. In particular, if no external provider is used
both projects were carried out by skilled teams but for crawling, template design may end up for being
with different compositions and size. very time consuming, which results in the largest time
overhead.
• Theory—Frame of reference: the theoretical frame-
work we adopted is the one defined by the activities As to semantic enrichment design, we will focus on sen-
and tasks our methodology builds upon. timent analysis, one of the more complex and important
phases of the SBI process, that consists in determining the
• Research questions—What to know?: we study how sentiment associated to a specific clip. Though the correct-
the two projects differ in terms of required effort and ness of this analysis is obviously related to the capabilities
delivered utility. of the semantic enrichment engine, a fine tuning can lead
• Methods—How to collect data?: for PR-CG, the ef- to dramatic improvements. Both our projects share the
fort for the different activities and tasks was derived a same engine: SyN Semantic Center, a well-known commer-
posteriori from an analysis of the time-sheets recorded cial suite that enables a linguistic and semantic analysis of
by the system integrator, while for PR-Pol it has been any piece of textual information based on its morphology,
measured at project time; as to correctness, it has been syntax, and semantics using logical-functional rules. So we
estimated by asking some domain experts to manually investigated how the correctness of sentiment analysis was
tag a set of clips and comparing the resulting tags with affected by the adoption of our methodology by asking five
those automatically obtained by semantic enrichment. domain experts to manually tag a large set of clips (about
1, 500) with their sentiment and then submitting them to
• Selection strategy—Where to seek data?: we selected the tuned/non-tuned engine. Tuning had a similar duration
two projects of different levels to achieve a wide cover- in the two projects (about two months) and led to a sim-
age of the aspects involved in SBI design. PR-Pol was ilar number of changes in the engine (about 330). Table
a level-1 project on a very wide and dynamic domain, 4 shows the results: clips are classified according to three
led by a small team; PR-CG was a level-2 project on criteria (media type, difficulty of a human expert in defin-
a more narrow domain, led by a system integrator. ing the sentiment, sentiment); the correct sentiment is as-
sumed to be the one chosen by the majority of the domain
In Table 3 we show the time spent on each task distin- experts. The semantic engine initially performed worse for
guishing the first iterations from the maintenance ones; miss- Pr-Pol than for Pr-CG because the politics subject area uses
ing items in the maintenance column denote activities made a wider terminology and is probably more complex than the
on demand, i.e., only at some iterations. Some comments consumer goods one. However, the improvements obtained
on the values reported are necessary: for Pr-Pol are clearly larger than those for Pr-CG. An in-
depth analysis of the approach adopted by the Pr-CG team
• Even if macro-analysis poses no particular problems,
evidenced a lack of attention to the side effects of word po-
it usually requires a large amount of time because it is
larization, that often introduced as many errors as those
carried out during non-technical meetings that involve
that were solved. Conversely, a more structured approach
several different corporate departments.
(see Section 4.5) and a continuous and iterative check of the
• Maintaining the domain ontology requires more time in side effects made the PR-Pol team’s effort more effective.
PR-Pol than in PR-CG. The reason is that the Italian Our case studies confirmed that ontology design and crawl-
politics subject area is quite wider than the consumer ing design are the two most strictly-coupled activities and
goods one, which implies a larger amount of dynamic that their synchronization is a key factor to increase the
contents to be analyzed in order to verify which new overall performance. On the one hand, within crawling de-
topics are to be added to the ontology. sign, the query design and content relevance analysis tasks
are based on the topics determined by ontology design; on
• The time saving in semantic enrichment design for PR- the other, the coverage achieved for the domain ontology
Pol is mainly due to the adoption of a structured set of mostly depends on how effectively crawling is able to ex-
tests that has led the team to easily obtain the desired clude off-topic clips. In PR-Pol, at each iteration of ontol-
level of performances. This time saving is not apparent ogy design, coverage analysis of the available clips is always
in maintenance iterations due to the higher complexity made twice: once before adding new topics and once after-
of the politics subject area. words. The clips that remain uncovered are then handed
on, together with the updated ontology, to crawling design
• In query design and content relevance analysis, the and signaled as off-topic clips (i.e., crawling queries must be
amount of time needed to test how the developed queries updated to discard these clips). This simple but effective
work largely depends on the project level. In a level-2 protocol is applied every two days; in about 8 solar weeks
project, the customer usually delegates crawling to an the topics in the ontology increased from 139 to 225, and its
external service provider, who normally is capable of coverage from 93% to 98%.
estimating the volume of clips retrieved by each spec-
ified query. Conversely, in a level-1 project, crawling
has to be managed in every aspect, so that the effec- 6. DISCUSSION AND FINAL REMARKS
tiveness of a query can only be assessed after a whole Responsiveness in an SBI project is not a choice but rather
clip acquisition session, that usually lasts 24 hours; as a necessity, since the frequency of changes requires a tight
a result, the execution of this activity can be signifi- involvement of domain experts to detect these changes and
cantly longer. rapid iterations to keep the process well-tuned. Such a fran-
IDEAS14 214
Table 3: Time spent on tasks, expressed in man-days for first iterations and in man-days per week in
maintenance iterations (n.a. stands for not available because the task has been outsourced)
Activity/Task PR-CG PR-Pol
1st Iter. Maint. Iter. 1st Iter. Maint. Iter.
Macro-Analysis 10 — 9 —
Ontology Design 4 0.6 7 1.5
Topics Definition 2 0.5 2 1
Inter-Topic Relation Definition 2 0.1 5 0.5
Source Selection 3 1 5 1
Semantic Enrichment Design 7 0.75 5 1
Crawling Design 10 1 29 1.5
Template Design n.a. n.a. 15 —
Query Design & Content Relevance Analysis 10 1 14 1.5
ETL & OLAP Design 15 — 24 —
ETL Design & Implementation 5 — 10 —
KPI Design 5 — 7 —
Dashboard design 5 — 7 —
Execution & Test 3 — 5 —
Total 52 3.35 84 5
In charge to the customer 15 0.85 84 5
IDEAS14 215
Information Systems Frontiers, 15(3):331–349, 2013. [26] L. Stojanovic. Methods and tools for ontology
[8] M. Golfarelli and S. Rizzi. Data Warehouse design: evolution. PhD thesis, Forschungszentrum Informatik,
Modern principles and methodologies. McGraw-Hill, Karlsruhe, 2004.
2009. [27] A. Thusoo, Z. Shao, S. Anthony, D. Borthakur,
[9] M. Golfarelli, S. Rizzi, and E. Turricchia. Modern N. Jain, J. S. Sarma, R. Murthy, and H. Liu. Data
software engineering methodologies meet data warehousing and analytics infrastructure at Facebook.
warehouse design: 4WD. In Proc. DaWaK, pages In Proc. SIGMOD Conference, pages 1013–1020, 2010.
66–79, Toulouse, France, 2011. [28] X. Wang, A. McCallum, and X. Wei. Topical n-grams:
[10] M. Hepp, P. D. Leenheer, A. de Moor, and Y. Sure, Phrase and topic discovery, with an application to
editors. Ontology Management, volume 7 of Semantic information retrieval. In Proc. ICDM, pages 697–702,
Web And Beyond Computing for Human Experience. Washington, DC, USA, 2007.
Springer, 2008. [29] J. Yi, T. Nasukawa, R. C. Bunescu, and W. Niblack.
[11] R. Hughes. Agile Data Warehousing: Delivering Sentiment analyzer: Extracting sentiments about a
world-class business intelligence systems using Scrum given topic using natural language processing
and XP. IUniverse, 2008. techniques. In Proc. ICDM, pages 427–434,
[12] J. Kahan and M.-R. Koivunen. Annotea: an open Melbourne, Florida, 2003.
RDF infrastructure for shared web annotations. In [30] D. Zhang, C. Zhai, and J. Han. Topic Cube: Topic
Proc. WWW, pages 623–632, Hong Kong, China, 2001. modeling for OLAP on multidimensional text
[13] B. Liu and L. Zhang. A survey of opinion mining and databases. In Proc. SDM, pages 1123–1134, 2009.
sentiment analysis. In Mining Text Data, pages
415–463. Springer, 2012.
[14] J.-N. Mazón and J. Trujillo. An MDA approach for
the development of data warehouses. In Proc. JISBD,
pages 208–208, 2009.
[15] M. McCord. Slot grammar: A system for simpler
construction of practical natural language grammars.
In R. Studer, editor, Natural Language and Logic,
volume 459 of Lecture Notes in Computer Science,
pages 118–145. Springer, 1989.
[16] F. Neri, C. Aliprandi, and F. Camillo. Mining the web
to monitor the political consensus. In U. K. Wiil,
editor, Counterterrorism and Open Source
Intelligence, volume 2 of Lecture Notes in Social
Networks, pages 391–412. Springer, 2011.
[17] N. F. Noy and C. Hafner. The state of the art in
ontology design: A survey and comparative review. AI
Magazine, 18(3):53–74, 1997.
[18] N. F. Noy and M. C. A. Klein. Ontology evolution:
Not the same as schema evolution. Knowl. Inf. Syst.,
6(4):428–440, 2004.
[19] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?:
Sentiment classification using machine learning
techniques. In Proc. EMNLP, pages 79–86, 2002.
[20] S. Raghavan and H. Garcia-Molina. Crawling the
hidden web. In Proc. VLDB, pages 129–138, Rome,
Italy, 2001.
[21] A. Ritter, S. Clark, Mausam, and O. Etzioni. Named
entity recognition in tweets: An experimental study.
In Proc. EMNLP, pages 1524–1534, Edinburgh,
Scotland, 2011.
[22] C. Robson. Real World Research. Blackwell, 2002.
[23] B. Rosenfeld and R. Feldman. Clustering for
unsupervised relation identification. In Proc. CIKM,
pages 411–418, Lisbon, Portugal, 2007.
[24] P. Runeson and M. Höst. Guidelines for conducting
and reporting case study research in software
engineering. Empirical Software Engineering,
14(2):131–164, 2009.
[25] A. Sen and A. P. Sinha. A comparison of data
warehousing methodologies. Commun. ACM,
48(3):79–84, 2005.
IDEAS14 216