Introduction to business
intelligence and data
analysis
IT300
TBS 2020-2021
1
What is Business intelligence ?
▪ Zeng et al. (2006) define Business Intelligence (BI)
as the process of collection, treatment and
diffusion of information that has an objective, the
reduction of uncertainty in the making of all
strategic decisions.
▪ Stackowiak et al. (2007) define BI as the process of
taking large amounts of data, analyzing that data,
and presenting a high-level set of reports that
condense the essence of that data into the basis
of business actions, enabling management to
make fundamental daily business decisions. 2
What is Business intelligence ?
▪ To learn from the past and forecast the future,
many companies are adopting BI tools and
systems.
▪ In such rough and competitive environment,
strategic decision making is extremely complex
and often requires the consideration of several
objectives while satisfying hard constraints.
As a result, the main concern of the managers is to
deliver sophisticated solutions strategies in an
attempt to reach the predefined objectives and
fulfill all system constraints
3
Business intelligence
▪ The institutions and firms operate in an open
system (that is impacted by external variables)
▪ There is a big need to make business analysis and
strategic decision making
▪ Is to find a match between the opportunities in
the environment and the strengths and
weaknesses of the firm
4
Decision-making support systems
▪ Decision making support systems are
information systems (SI) which are designed
to interactively support all phases of an end-
user’s decision making process in
organizations.
▪ Many organizations are turning to decision
support systems to improve decision
making.
▪ Turning a challenge to a learning curve.
5
Decision Making Process
1. Define the problem: the manager must identify the
problem.
2. Establish decision criteria and goals : obtaining
necessary information and data.
3. Analyze Data : Formulate a model between goals and
the important variables (physical models as a scale
model of a building):
The approaches to decision making include:
▪ the use of models,
▪ the use quantitative methods,
▪ the analysis of trade-offs,
▪ establishing priorities,
▪ the systems approach 6
Decision Making Process
Models are often a key tool used by all decision makers.
A model is an abstraction of reality, a simplified
representation of something. For example, a child’s toy
car is a model of a real automobile.
4. Identify and evaluate various alternatives or
solutions.
5. Select the best alternative (decision theory : an
analytical approach to select the best alternative,
it is closely related to the field of game theory).
6. Implement your decision.
7
Decision Making process
Define the
1 Problem
Obtaining
2 Data
Decision making
Analyze Data
3
Create
4 alternatives
Select the
best
5 alternative
Implement
the decision
6 alternative 8
Data analysis
Information processing is the analysis of a large
quantity of data or other forms of information to
support decision making and to discover
knowledge in data.
This is indeed the biggest challenge posed by big
and often unstructured data: how to analyze it in a
useful way.
Objectives of Data analysis:
• Increase the effectiveness of the manager’s
decision making process,
• Support the manager in the decision making
process but not replace it,
• And improve the directions of the decision
9
making.
Evolution of Database Technology
1960s :
• Data collection, database creation, IMS and
network DBMS
1970s :
• Relational data model, relational DBMS
implementation
1980s:
• RDBMS, advanced data models and
application-oriented DBMS
1990s—2000s:
• Data mining, data warehousing (DW),
multimedia databases and web databases 10
Origins of Data Warehouses
▪ Database developers understood that their
software was required for both transactional and
analytical processing.
▪ However, operational and analytical data are
separate with different requirements and different
user communities.
▪ Once these differences were understood, new data
bases were created specifically for analysis use.
11
Origins of Data Warehouses
▪ Operational processing (transactional processing)
captures, stores and manipulates data to support
daily operations.
▪ Information processing is the analysis of data or
other forms of information to support decision
making.
▪ DW can consolidate and integrate information
from many internal and external sources and
arrange it in a meaningful format for making
business decisions.
12
What is a Data Warehouse ?
▪ According to Inmon’s (father of data warehousing) :
It is a collection of integrated, subject-oriented,
databases designed to support the DSS function,
where each unit of data is non-volatile and relevant
to some moment in time.
▪ Or a DW is : A subject-oriented, integrated, time-
variant, non-updatable collection of data used in
support of management decision-making processes:
▪ Subject-oriented: e.g. customers, patients, products
▪ Integrated: Consistent naming conventions, formats,
encoding structures; from multiple data sources
▪ Time-variant: Can study trends and changes
▪ Non-updatable: Read-only, periodically refreshed 13
Need for Data Warehousing
▪ A DW allows its users to extract required data, for
business analysis and strategic decision making
▪ DW is a process, not a product
▪ DW is an architecture : is the way of organizing data
14
Database, Data warehouse and
Data set
▪ DB : contains tables, rows refer to records and
columns to fields. Most DBs are relational DBs
(relating tables to reduce redundancy & improve
DB performance via the normalization process)
▪ DW : is a type of DB that has been denormalized
& archived.
▪ Denormalization is the process of combining
some tables into a single table. This may
introduce duplicate data, but will reduce the
number of joins a query has to process.
▪ Data set : is a sub-set of a DW or a DB. It is usually
denormalized so that only one table is used. 15
How Do Data Warehouses Differ
From Operational Systems?
▪ Goals
▪ Structure
▪ Size
▪ Performance optimization
▪ Technologies used
16
Need to separate operational and
information systems
Three primary factors:
▪ A DW centralizes data that are scattered
throughout disparate operational systems and
makes them available for DM.
▪ A well-designed data warehouse adds value to
data by improving their quality and consistency.
▪ A separate DW eliminates much of the contention
for resources that results when information
applications are mixed with operational
processing.
17
Comparison of Database Types
Data warehouse Operational system
Subject oriented Transaction oriented
Large (hundreds of GB up to Small (MB up to several GB)
several TB)
Historical data Current data
Denormalized table structure Normalized table structure
(few tables, many columns per (many tables, few columns per
table) table)
Batch updates Continuous updates
Usually very complex queries Simple to complex queries
18
From the Data Warehouse to Data
Marts
▪ A data mart contains only those data that are
specific to a particular group. For example, the
marketing data mart may contain only data
related to items, customers, and sales.
▪ Data marts are confined to subjects.
▪ Data marts are small in size.
▪ Data marts are customized by department.
19
How Data Warehousing works
20
How Data Warehousing works
Extraction Transformation Loading–ETL tools
Extract Transform Load
& Clean
Sources DSA DW
DSA: A Data Staging Area is a temporary location where data from
21
source systems are copied.
Example of analytical questions: Historical
analysis of the number of passengers
•How many passengers are frequent flyers?
• How much time do passengers spend on
average in the different zones?
•How much time do passengers spend on average
before entering different zones? How is the
distribution of time spent per passenger in a
given zone?
•Which day of week are the zones used the most?
•When is there a risk of bottlenecks in specific
zones?
22
ER Model vs. Multidimensional
Model
▪ Why don’t we use the entity-relationship (ER)
model in data warehousing?
▪ ER model: a data model for general purposes
– All types of data are equal, difficult to identify the
data that is:
• important for business analysis
• No difference between: What is important ? What
just describes the important?
• Normalized databases (many details that can affect
privacy and security)
– Hard to overview a large ER diagram (e.g., over 100
entities/relations for an enterprise)
23
ER Model vs. Multidimensional
Model
▪ Traditional DBs generally deal with two-dimensional
data. However, querying performance in a multi-
dimensional data storage model is more efficient.
▪ More built in “meaning”
– What is important
– What describes the important
– What we want to optimize
▪ Recognized by OLAP/BI tools : Tools that offer powerful
query facilities based on Multi-Dimensional (MD) design
24
Multidimensional Model
▪ Data is divided into: Facts and Dimensions
▪ A fact is the important entity: exp a sale
▪ Facts have measures that can be aggregated: sales
price
▪ Dimensions describe facts
▪ Facts “live” in a MD cube
▪ Goal for dimensional modeling:
– Surround facts with as much context (dimensions) as
possible
– Hint: redundancy may be ok (in well-chosen places)
– But you should not try to model all relationships in the
data (unlike E/R and OO modeling!) 25
Dimension
▪ Dimensions are the core of MD databases
▪ Dimensions are used for
▪ Selection of data
▪ Grouping of data at the right level of detail
▪ Dimensions consist of dimension values
▪ Product dimension has values ”milk”, ”cream”, …
▪ Time dimension has values ”1/1/2001”, ”2/1/2001”,…
▪ Dimension values may have an ordering
▪ Used for comparing cube data across values
▪ Especially used for Time dimension
26
Dimension
▪ Dimensions have hierarchies with levels
▪ Typically 3-5 levels (of detail)
▪ Dimension values are organized in a tree structure
▪ Product: Product->Type->Category
▪ Store: Store->Area->City->County
▪ Time: Day->Month->Quarter->Year
▪ Dimensions have a bottom level and a top level
▪ Levels may have attributes
▪ Simple, non-hierarchical information
▪ Day has Workday as attribute
▪ Dimensions should contain much information
▪ Time dimension may contain holiday, season, events,…
▪ Good dimensions have 50-100 or more attributes/levels
27
Facts
▪ Facts represent the subject of the desired analysis
• The important in the business that should be
analyzed
▪ A fact is identified via its dimension values
• A fact is a non-empty cell
▪ Generally, a fact should:
• Be attached to exactly one dimension value in
each dimension
• Only be attached to dimension values in the
bottom levels
28
Measures
▪ Measures represent the fact property that the
users want to study and optimize
▪ Example: total sales price
▪ A measure has two components
▪ Numerical value: (exp: sales price)
▪ Aggregation formula (exp: SUM): used for
aggregating/combining a number of measure values
into one
29
Multidimensional Model
Example: sales of supermarkets
• Facts and measures
– Each sales record is a fact, and its sales value is a
measure
• Dimensions
– Group correlated attributes into the same
dimension
– Each sales record is associated with its values of
Product, store, Time
30
Granularity: Dimensionality Hierarchy
▪ Granularity of facts is important
▪ Level of detail
▪ Given by combination of bottom levels
▪ A dimensional hierarchy defines mappings from a set of
lower-level concepts to higher level concepts.
Country
Year
2D data
Region Season
Quarter
City
Month Week
Area
31
ZipCode Day
Schema Design
▪ A schema is a logical description of the entire
database.
▪ Much like a database, a data warehouse also
requires to maintain a schema.
▪ A database uses relational model, while a data
warehouse uses Star, Snowflake, and Fact
Constellation schema.
32
Star schema
▪ A star schema consists of two types of tables:
• fact table
• dimension tables
▪ Each dimension in a star schema is represented
with only one-dimension table.
▪ This dimension table contains the set of
attributes.
33
Star schema: Components
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
34
Snowflake schema
▪ Snowflake schema is an expanded version of a
star schema in which dimension tables are
normalized into several related tables.
▪ Advantages
• Small saving in storage space
• Normalized structures are easier to update and
maintain
▪ Disadvantages
• A schema that is less intuitive
• The ability to browse through the content is difficult
• A degraded query performance because of additional
joins.
35
Snowflake schema : Example
time
item
time_key
day item_key supplier
day_of_the_week Sales Fact Table item_name
supplier_key
month brand
time_key supplier_type
quarter type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key city
branch_type
dollars_sold
city_key
avg_sales city
province_or_street
Measures country
36
Fact Constellation Schema
▪ A fact constellation has multiple fact tables. It is
also known as galaxy schema.
▪ The following diagram shows two fact tables,
namely sales and shipping.
37
Fact Constellation Schema
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
branch location_key location to_location
branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
38
shipper_type
38
On-Line Analytical Processing (OLAP)
▪ Original definition : The dynamic synthesis,
analysis, and consolidation of large volumes of
multi-dimensional data, [Codd, 1993].
▪ Describes a technology that is designed to
optimize the storing and querying of large
volumes of multi-dimensional data that is
aggregated (summarized) to various levels of
detail to support the analysis of this data.
39
The Complete Decision Support
System
Information Data Warehouse OLAP Servers Clients
Sources Server (Tier 2) (Tier 3)
(Tier 1)
e.g., MOLAP
OLAP
Semi-structured
Sources
serve
extract Query/Reporting
transform Data
serve
load Warehouse
refresh e.g., ROLAP
.
Data Mining
Operational serve
DB’s
Data Marts
40