0% found this document useful (0 votes)
40 views10 pages

DSS Ch03

Lecture notes decision support system Chapter 3

Uploaded by

coolco270
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views10 pages

DSS Ch03

Lecture notes decision support system Chapter 3

Uploaded by

coolco270
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Chapter 3:

Data Warehousing

Learning Objectives for Chapter 3

1. Understand the basic definitions and concepts of data warehouses


2. Understand data warehousing architectures
3. Describe the processes used in developing and managing data warehouses
4. Explain data warehousing operations
5. Explain the role of data warehouses in decision support
6. Explain data integration and the extraction, transformation, and load (ETL) processes
7. Describe real-time (active) data warehousing
8. Understand data warehouse administration and security issues

Section 3.2

Data warehouse

A data warehouse is defined in this section as “a pool of data produced to support decision
making.” A physical repository where relational data are specially organized to provide enterprise-
wide, cleansed data in a standardized format

The data warehouse is a collection of integrated, subject-oriented databases designed to support


DSS functions, where each unit of data is non-volatile and relevant to some moment in time”.

Data warehouse versus database

Technically a data warehouse is a database, albeit with certain characteristics to facilitate its role in
decision support. Specifically, however, it is an “integrated, time-variant, nonvolatile, subject-
oriented repository of detail and summary data used for decision support and business
analytics within an organization.” These characteristics are not necessarily true of databases in
general—though each could apply individually to a given one.

As a practical matter, most databases are highly normalized, in part to avoid update anomalies.
Data warehouses are highly denormalized for performance reasons. This is acceptable because
their content is never updated, just added to. Historical data are static.

Operational Data Store (ODS) is the database from which a business operates on an on-going
basis.
An EDW (Enterprise Data Warehouse) is an all-encompassing DW that covers all subject areas
of interest to the entire organization.

1
Data mart is a smaller DW designed around one problem, organizational function, topic, or other
suitable focus area.

Metadata is data about data

Metadata is beneficial because it is the mean through which applications and users access the
content of a data warehouse, through which its security is managed, and through which
organizational management manages, in the true sense of the word, its information assets. Most
database management systems would be unable to function without at least some metadata. Indeed,
the use of metadata, which enable data access through names and logical relationships rather than
physical locations, is fundamental to the very concept of a DBMS.Metadata are essential to any
database, not just a data warehouse.

Section 3.3

Description of the data warehousing process

The data warehousing process consists of the following steps:


1. Data are imported from various internal and external sources
2. Data are cleansed and organized consistently with the organization’s needs
3. Data are loaded into the enterprise data warehouse, or Data are loaded into data marts.
4. If desired, data marts are created as subsets of the EDW, or the data marts are consolidated
into the EDW
5. Analyses are performed as needed

The major components of a data warehouse

 Data sources. Data are sourced from operational systems and possibly from external data
sources.
 Data extraction and transformation. Data are extracted and properly transformed using
custom-written or commercial software called ETL.
 Data loading. Data are loaded into a staging area, where they are transformed and cleansed.
The data are then ready to load into the data warehouse.
 Comprehensive database. This is the EDW that supports decision analysis by providing
relevant summarized and detailed information.
 Metadata. Metadata are maintained for access by IT personnel and users. Metadata include
rules for organizing data summaries that are easy to index and search.
 Middleware tools. Middleware tools enable access to the data warehouse from a variety of
front-end applications.

Middleware Tools and their Role

Middleware tools enable access to the data warehouse. Power users such as analysts may write
their own SQL queries. Others may access data through a managed query environment. There are
many front-end applications that business users can use to interact with data stored in the data
repositories, including data mining, OLAP, reporting tools, and data visualization tools. All these
have their own data access requirements. Those may not match with how a given data warehouse
must be accessed. Middleware translates between the two.

2
Section 3.4 DW Architecture

Three-tier architecture
1. Data acquisition software (back-end)
2. The data warehouse that contains the data & software
3.Client (front-end) software that allows users to access and analyze data

Tier 1: Tier 2: Tier 3:


Client workstation Application server Database server

Two-tier architecture

Tier 1: Tier 2:
Client workstation Application & database server

Key Similarities and Differences between a Two-Tiered Architecture and a Three-Tiered


Architecture

Both provide the same user visibility through a client system that accesses a DSS/BI application
remotely. The difference is behind the scenes and is invisible to the user: in a two-tiered
architecture, the application and data warehouse reside on the same machine; in a three-tiered
architecture, they are on separate machines.

However, Web influence on data warehouse design by making Web-based data warehousing
possible.

Web pages
Application
Server

Client Web
(Web browser) Internet/ Server
Intranet/
Extranet
Data
warehouse

3
Alternative data warehousing architectures

 Independent data marts architecture


 Data mart bus architecture with linked dimensional data marts
 Hub-and-spoke architecture (corporate information factory)
 Centralized data warehouse architecture
 Federated architecture

(a) Independent Data Marts Architecture

ETL
End user
Source Staging Independent data marts
access and
Systems Area (atomic/summarized data)
applications

(b) Data Mart Bus Architecture with Linked Dimensional Datamarts

ETL
Dimensionalized data marts End user
Source Staging
linked by conformed dimensions access and
Systems Area
(atomic/summarized data) applications

(c) Hub and Spoke Architecture (Corporate Information Factory)

ETL
End user
Source Staging Normalized relational
access and
Systems Area warehouse (atomic data)
applications

Dependent data marts


(summarized/some atomic data)

ETL (Extract, Transform and Load) is a process in data warehousing responsible for pulling data
out of the source systems and placing it into a data warehouse.
ETL involves:
- extracting the data from source systems (SAP(System Analysis and Program Development),
ERP, other operational systems), data from different source systems is converted into one
consolidated data warehouse format which is ready for transformation processing.
- transforming the data may involve the following tasks:
applying business rules (so-called derivations, e.g., calculating new measures and dimensions),
cleaning (e.g., mapping NULL to 0 or "Male" to "M" and "Female" to "F" etc.),
filtering (e.g., selecting only certain columns to load),
splitting a column into multiple columns and vice versa,
joining together data from multiple sources (e.g., lookup, merge),
- loading the data into a data warehouse or data repository other reporting applications

4
(a) Independent Data Marts Architecture

ETL
End user
Source Staging Independent data marts
access and
Systems Area (atomic/summarized data)
applications

(b) Data Mart Bus Architecture with Linked Dimensional Datamarts

ETL
Dimensionalized data marts End user
Source Staging
linked by conformed dimensions access and
Systems Area
(atomic/summarized data) applications

(c) Hub and Spoke Architecture (Corporate Information Factory)

ETL
End user
Source Staging Normalized relational
access and
Systems Area warehouse (atomic data)
applications

Dependent data marts


(summarized/some atomic data)

The factors that should be considered when deciding which architecture to use in developing a
data warehouse:

1. Information interdependence between organizational units


2. Upper management’s information needs
3. Urgency of need for a data warehouse
4. Nature of end-user tasks
5. Constraints on resources
6. Strategic view of the data warehouse prior to implementation
7. Compatibility with existing systems
8. Perceived ability of the in-house IT staff
9. Technical issues
10. Social/political factors

Section 3.5
Data Integration

Data integration is an umbrella term that covers three processes that combine to move data from
multiple sources into a data warehouse: accessing the data, combining different views of the
data, and capturing changes to the data.

Packaged Transient
application data source

Data
warehouse

Legacy
Extract Transform Cleanse Load
system

Data mart
Other internal
applications

5
ETL process:

ETL is the process through which data are loaded into a data warehouse, a DW could not exist
without it. The ETL process also contributes to the quality of the data in a DW. It includes the
following steps:

Extraction: selecting data from one or more sources and reading the selected data.

Transformation: converting data from their original form to whatever form the DW needs. This
step often also includes cleansing of the data to remove as many errors as possible.

Load: putting the converted (transformed) data into the DW.

Section 3.6
Benefits of data warehouses
Direct benefits include:
 Allowing end users to perform extensive analysis in numerous ways.
 A consolidated view of corporate data (i.e., a single version of the truth).
 Better and more timely information. A data warehouse permits information processing to be
offloaded from costly operational systems onto low-cost servers; therefore, end-user
information requests can be processed more quickly.
 Enhanced system performance. A data warehouse frees production processing because some
operational system reporting requirements are moved to DSS.
 Simplification of data access.

Indirect benefits arise when end users take advantage of these direct benefits.

Criteria for selecting a data warehouse vendor

Six important criteria are: financial strength, ERP linkages, qualified consultants, market share,
industry experience, and established partnerships. These are important to indicate that a
vendor is likely to be in business for the long term, to have the support capabilities its customers
need, and to provide products that interoperate with other products the potential user has or may
obtain.

Additionally, product functionality (Does it do what we need?), vendor strategic vision (Does their
direction make sense for our future plans and/or is it consistent with industry trends?) and quality of
customer references (What do their existing customers think of them?).

OLAP Online analytical processingis one of the most commonly used data analysis techniques in
data warehouses. OLAP is an approach to quickly answer ad hoc questions that require data
analysis. Data stored in a data warehouse can be analyzed using techniques referred to as OLAP.

The main operational structure in OLAP is based on a concept called cube. A cube in OLAP is a
multidimensional data structure (actual or virtual) that allows fast analysis of data.

6
Using OLAP, an analyst can navigate through the database and screen for a particular subset of the
data (and its progression over time) by changing the data’s orientations and defining analytical
calculations. These types of user-initiated navigation of data through the specification of slices (via
rotations) and drill down/up (via aggregation and disaggregation) are sometimes called “slice and
dice.” Commonly used OLAP operations include slice and dice, drill down, roll up, and pivot.

 Slice: A slice is a subset of a multidimensional array (usually a two-dimensional


representation) corresponding to a single value set for one (or more) of the dimensions not
in the subset.

 Dice: The dice operation is a slice on more than two dimensions of a data cube.

 Drill Down/Up: Drilling down or up is a specific OLAP technique whereby the user
navigates among levels of data ranging from the most summarized (up) to the most detailed
(down).

OLTP online transaction processing is concerned with the capture and storage of data.

ROLAP stands for Relational Online Analytical Processing. ROLAP is an alternative to the
MOLAP (Multidimensional OLAP) technology. While both ROLAP and MOLAP analytic tools
are designed to allow analysis of data through the use of a multidimensional data model, ROLAP
differs in that it does not require the pre-computation and storage of information. Instead, ROLAP
tools access the data in a relational database and generate SQL queries to calculate information at
the appropriate level when an end user requests it. MOLAP is an alternative to the ROLAP
technology. MOLAP differs from ROLAP significantly in that it requires the pre-computation and
storage of information in the cube—the operation known as preprocessing. MOLAP stores this data
in an optimized multidimensional array storage, rather than in a relational database.

HOLAP (Hybrid Online Analytical Processing) is a combination of ROLAP and MOLAP.


HOLAP allows storing part of the data in a MOLAP store and another part of the data in a ROLAP
store. The degree of control that the cube designer has over this partitioning varies from product to
product.

All of these are variations of OLAP.

Section 3.7
Major DW implementation tasks that can be performed in parallel

Reeves (2009) and Solomon (2005) provided some guidelines regarding the critical questions that
must be asked, some risks that should be weighted, and some processes that can be followed to help
ensure a successful data warehouse implementation. They compiled a list of 11 major tasks that
could be performed in parallel:

Establishment of service-level agreements and data-refresh requirements


Identification of data sources and their governance policies
Data quality planning
Data model design
ETL tool selection
Relational database software and platform selection

7
Data transport
Data conversion
Reconciliation process
Purge and archive planning
End-user support

DW implementation guidelines

 Senior management must support development of the data warehouse. The DW needs a
project champion at a high position in the organization chart. Benefits of a DW project may
be difficult to measure, so management support makes it more likely the project will receive
funding.
 Web-based data warehouses may need special security requirements. These ensure that only
authorized users have access to the data.
 Users should participate in the development process. Their participation is essential for data
modeling and access modeling. User participation ensures that the DW includes the needed
data and that decision makers can retrieve the data they need.
 DW implementation requires certain skills from members of the development team: in-depth
knowledge of database technology and the development tools used.

Risks and issues to consider and potentially avoid when developing a successful data warehouse

 Starting with the wrong sponsorship chain


 Setting expectations that you cannot meet
 Engaging in politically naive behavior
 Loading the data warehouse with information just because it is available
 Believing that data warehousing database design is the same as transactional database design
 Choosing a data warehouse manager who is technology oriented rather than user oriented
 Focusing on traditional internal record-oriented data and ignoring the value of external data
and of text, images, and, perhaps, sound and video
 Delivering data with overlapping and confusing definitions
 Believing promises of performance, capacity, and scalability
 Believing that your problems are over when the data warehouse is up and running
 Focusing on ad hoc data mining and periodic reporting instead of alerts

Scalability refers to the degree to which a system can adjust to changes in demand without major
additional changes or investments. DW scalability issues are the amount of data in the
warehouse, how quickly the warehouse is expected to grow, the number of concurrent users, and
the complexity of user queries. A data warehouse must scale both horizontally and vertically.
The warehouse will grow as a function of data growth and the need to expand the warehouse to
support new business functionality. Data growth may be a result of the addition of current cycle
data (e.g., this month’s results) and/or historical data.

Section 3.8

RDW A real-time data warehouse, in which decision-making data are updated on an ongoing basis
as business transactions occur; same as an active data warehouse (ADW).

8
The benefits of an RDW

The RDW extends the benefits of data warehousing, in general, down into tactical, and perhaps
operational, decision making. It empowers people who interact directly with customers and
suppliers by providing them with information to make decisions. It can then be extended to
customers and suppliers themselves, thus affecting almost all aspects of customer service, SCM,
logistics, and beyond. It can also facilitate e-business activities, as when sales outlets such as
overstock.com (cited in the text) use historical data to price new close-outs.

Differences between a traditional data warehouse and an RDW

1. A traditional data warehouse (TDW) is used for strategic decisions (and sometimes
tactical); an RDW for strategic and tactical (sometimes operational) ones.
2. The results of using a TDW can be hard to measure; results of using an RDW are
measured by operational data.
3. Acceptable TDW refresh rates range from daily to monthly; RDW data must be up to the
minute.
4. TDW summaries are often appropriate; RDWs must supply detailed data.
5. Small user community at upper organizational levels means a TDW supports few
concurrent users; an RDW must support many, perhaps over a thousand.
6. TDWs typically use restrictive reporting to confirm or check patterns, often predefined
summary tables; RDWs need flexible, ad hoc reporting.
7. TDW user community generally consists of power users, knowledge workers, managers,
other internal users; RDWs are used by operational staff, call centers, perhaps external
users.

The drivers for RDW

 A business often cannot afford to wait a whole day for its operational data to load into the
data warehouse for analysis.
 Traditional data warehouses have captured snapshots of an organization’s fixed states
instead of incremental real-time data showing every state change and almost analogous
patterns over time.
 With a traditional hub-and-spoke architecture, retaining the metadata in sync is difficult. It
is also costly to develop, maintain, and secure many systems as opposed to one huge data
warehouse so that data are centralized for BI/BA tools.
 In cases of huge nightly batch loads, the necessary ETL setup and processing power for
large nightly data warehouse loading might be very high, and the processes might take too
long. An EAI with real-time data collection can reduce or eliminate the nightly batch
processes.
Section 3.9
Effective security in a data warehouse should focus on four main areas:

Step 1. Establishing effective corporate and security policies and procedures. An effective security
policy should start at the top and be communicated to everyone in the organization.

9
Step 2. Implementing logical security procedures and techniques to restrict access. This includes
user authentication, access controls, and encryption.
Step 3. Limiting physical access to the data center environment.
Step 4. Establishing an effective internal control review process for security and privacy.

Skills of a Data warehouse administrator DWA

 Familiarity with high-performance hardware, software, and networking technologies, since


the data warehouse is based on those
 Solid business insight, to understand the purpose of the DW and its business justification
 Familiarity with business decision-making processes to understand how the DW will be
used
 Excellent communication skills, to communicate with the rest of the organization

Following are some of the recently popularized concepts and technologies that will play a
significant role in defining the future of data warehousing

Sourcing: Acquisition of data from diverse and dispersed sources


 Web, social media, and Big Data
 Open source software
 SaaS (software as a service) “The Extended ASP Model”
 Cloud computing Infrastructure: Architectural—hardware and software—enhancements
 Columnar (a new way to score and access data in the database)
 Real-time data warehousing
 Data warehouse appliances (all-in-one solutions to DW)
 Data management technologies and practices
 In-database processing technology (putting the algorithms where the data is)
 In-memory storage technology (moving the data in the memory for faster processing)
 New database management systems
 Advanced analytics
As the world of business becomes more global and complex, the need for business intelligence and
data warehousing tools also becomes more prominent. The fast improving information technology
tools and techniques seem to be moving in the right direction to address the needs of the future
business intelligence systems.

10

You might also like