DATA WAREHOUSINGING & DATA MINING
(KOE093)
UNIT-I
INTRODUCTION
A Database Management System (DBMS) stores data in the form of tables and uses an ER
model and the goal is ACID properties. For example, a DBMS of a college has tables for
students, faculty, etc.
A Data Warehouse is separate from DBMS, it stores a huge amount of data, which is
typically collected from multiple heterogeneous sources like files, DBMS, etc. The goal is
to produce statistical results that may help in decision-making. For example, a college
might want to see quick different results, like how the placement of CS students has
improved over the last 10 years, in terms of salaries, counts, etc
Need for Data Warehouse
An ordinary Database can store MBs to GBs of data and that too for a specific purpose. For
storing data of TB size, the storage shifted to the Data Warehouse. Besides this, a
transactional database doesn’t offer itself to analytics. To effectively perform analytics, an
organization keeps a central Data Warehouse to closely study its business by organizing,
understanding, and using its historical data for making strategic decisions and analyzing
trends.
Benefits of Data Warehouse
Better business analytics: Data warehouse plays an important role in every business to
store and analysis of all the past data and records of the company. which can further
increase the understanding or analysis of data for the company.
Faster Queries: The data warehouse is designed to handle large queries that’s why it
runs queries faster than the database.
Improved data Quality: In the data warehouse the data you gathered from different
sources is being stored and analyzed it does not interfere with or add data by itself so
your quality of data is maintained and if you get any issue regarding data quality then
the data warehouse team will solve this.
Historical Insight: The warehouse stores all your historical data which contains details
about the business so that one can analyze it at any time and extract insights from it.
Data Warehouse vs DBMS
Database Data Warehouse
A common Database is based on operational
A data Warehouse is based on analytical
or transactional processing. Each operation
processing.
is an indivisible transaction.
A Data Warehouse maintains historical
Generally, a Database stores current and up- data over time. Historical data is the data
to-date data which is used for daily kept over years and can used for trend
operations. analysis, make future predictions and
decision support.
A Data Warehouse is integrated generally
at the organization level, by combining
data from different databases.
A database is generally application specific.
Example – A data warehouse integrates
Example – A database stores related data,
the data from one or more databases , so
such as the student details in a school.
that analysis can be done to get results ,
such as the best performing school in a
city.
Constructing a Database is not so Constructing a Data Warehouse can be
expensive. expensive
Applications of Data Warehousing
Data Warehousing can be applied anywhere where we have a huge amount of data and we
want to see statistical results that help in decision making.
Social Media Websites: The social networking websites like Facebook, Twitter,
Linkedin, etc. are based on analyzing large data sets. These sites gather data related to
members, groups, locations, etc., and store it in a single central repository. Being a large
amount of data, Data Warehouse is needed for implementing the same.
Banking: Most of the banks these days use warehouses to see the spending patterns of
account/cardholders. They use this to provide them with special offers, deals, etc.
Government: Government uses a data warehouse to store and analyze tax payments
which are used to detect tax thefts.
Features of Data Warehousing
Data warehousing is essential for modern data management, providing a strong foundation
for organizations to consolidate and analyze data strategically. Its distinguishing features
empower businesses with the tools to make informed decisions and extract valuable
insights from their data.
Centralized Data Repository: Data warehousing provides a centralized repository for
all enterprise data from various sources, such as transactional databases, operational
systems, and external sources. This enables organizations to have a comprehensive view
of their data, which can help in making informed business decisions.
Data Integration: Data warehousing integrates data from different sources into a
single, unified view, which can help in eliminating data silos and reducing data
inconsistencies.
Historical Data Storage: Data warehousing stores historical data, which enables
organizations to analyze data trends over time. This can help in identifying patterns and
anomalies in the data, which can be used to improve business performance.
Query and Analysis: Data warehousing provides powerful query and analysis
capabilities that enable users to explore and analyze data in different ways. This can
help in identifying patterns and trends, and can also help in making informed business
decisions.
Data Transformation: Data warehousing includes a process of data transformation,
which involves cleaning, filtering, and formatting data from various sources to make it
consistent and usable. This can help in improving data quality and reducing data
inconsistencies.
Data Mining: Data warehousing provides data mining capabilities, which enable
organizations to discover hidden patterns and relationships in their data. This can help
in identifying new opportunities, predicting future trends, and mitigating risks.
Data Security: Data warehousing provides robust data security features, such as access
controls, data encryption, and data backups, which ensure that the data is secure and
protected from unauthorized access.
Advantages of Data Warehousing
Intelligent Decision-Making: With centralized data in warehouses, decisions may be
made more quickly and intelligently.
Business Intelligence: Provides strong operational insights through business
intelligence.
Historical Analysis: Predictions and trend analysis are made easier by storing past
data.
Data Quality: Guarantees data quality and consistency for trustworthy reporting.
Scalability: Capable of managing massive data volumes and expanding to meet
changing requirements.
Effective Queries: Fast and effective data retrieval is made possible by an optimized
structure.
Cost reductions: Data warehousing can result in cost savings over time by reducing
data management procedures and increasing overall efficiency, even when there are
setup costs initially.
Data security: Data warehouses employ security protocols to safeguard confidential
information, guaranteeing that only authorized personnel are granted access to certain
data.
Disadvantages of Data Warehousing
Cost: Building a data warehouse can be expensive, requiring significant investments in
hardware, software, and personnel.
Complexity: Data warehousing can be complex, and businesses may need to hire
specialized personnel to manage the system.
Time-consuming: Building a data warehouse can take a significant amount of time,
requiring businesses to be patient and committed to the process.
Data integration challenges: Data from different sources can be challenging to
integrate, requiring significant effort to ensure consistency and accuracy.
Data security: Data warehousing can pose data security risks, and businesses must take
measures to protect sensitive data from unauthorized access or breaches.
Implementation of Data Warehouse
Data Warehouse has been defined in many ways, making it difficult to formulate a
rigorous definition. Gradually speaking, a data warehouse is a data repository that is
kept separate from an organization’s operational database. Data warehouse systems
allow the integration of a wide variety of application systems. They support
information processing by providing a solid plan of aggregated historical data for
analysis.
Data in a data warehouse comes from the organization’s operational systems as well as
other external sources. These are collectively referred to as the source systems. The data
extracted from the source systems is stored in an area called the data staging area, where
the data is cleaned, transformed, assembled, and duplicated to prepare the data in the data
warehouse.
The data staging area is usually a set of machines where simple activities like sorting and
sequential processing take place. The data staging area does not provide as soon as possible
a system provides query or presentation services, it is classified as a presentation server. A
presentation server is the destination machine on which data is loaded from the data staging
area and directly stored for query by end-users, report authors, and other applications.
There are three different types of systems required for a data warehouse –
1. Source Systems
2. Data Staging Area
3. Presentation Server
The data moves from the data source area through the staging area to the presentation
server. The entire process is better known as ETL (extract, transform, and
load) or ETT (extract, transform, and transfer).
Components of Data Warehouse Architecture and their tasks:
1. Operational Source –
An operational Source is a data source consists of Operational Data and External Data.
Data can come from Relational DBMS like Informix, Oracle.
2. Load Manager –
The Load Manager performs all operations associated with the extraction of loading
data in the data warehouse.
These tasks include the simple transformation of data to prepare data for entry into the
warehouse.
3. Warehouse Manager –
The warehouse manager is responsible for the warehouse management process.
The operations performed by the warehouse manager are the analysis, aggregation,
backup and collection of data, de-normalization of the data.
4. Query Manager –
Query Manager performs all the tasks associated with the management of user queries.
The complexity of the query manager is determined by the end-user access operations
tool and the features provided by the database.
5. Detailed Data –
It is used to store all the detailed data in the database schema.
Detailed data is loaded into the data warehouse to complement the data collected.
6. Summarized Data –
Summarized Data is a part of the data warehouse that stores predefined aggregations
These aggregations are generated by the warehouse manager.
7. Archive and Backup Data –
The Detailed and Summarized Data are stored for the purpose of archiving and backup.
The data is relocated to storage archives such as magnetic tapes or optical disks.
8. Metadata –
Metadata is basically data stored above data.
It is used for extraction and loading process, warehouse, management process, and
query management process.
9. End User Access Tools –
End-User Access Tools consist of Analysis, Reporting, and mining.
By using end-user access tools users can link with the warehouse.
Advantages and disadvantages of the components commonly found in data
warehouses:
Data sources: Data sources are the systems or databases that provide data to the data
warehouse. Advantages of using multiple data sources include increased data coverage and
the ability to integrate diverse data types. However, disadvantages include potential data
quality issues, data inconsistencies, and increased complexity in data integration.
ETL (Extract, Transform, Load) processes: ETL processes are used to extract data from
source systems, transform it to conform to the data warehouse schema, and load it into the
data warehouse. Advantages of ETL processes include efficient data integration and
improved data quality. However, disadvantages include potential data loss or corruption,
increased processing time and complexity, and potential data inconsistency due to data
transformations.
Data storage: Data storage is the component of the data warehouse that stores the data.
Advantages of data storage in a data warehouse include the ability to store large amounts of
data in a single location, fast and efficient data retrieval, and improved data quality due to
data cleansing and standardization. Disadvantages include the high cost of data storage,
potential data loss or corruption, and potential security risks associated with storing large
amounts of sensitive data in a single location.
Data modeling: Data modeling is the process of designing the structure of the data
warehouse. Advantages of data modeling include the ability to organize and structure data
in a way that is optimized for BI activities, improved data quality due to data cleansing and
standardization, and increased scalability and flexibility. However, disadvantages include
the potential for complex data relationships and the need for specialized skills and
knowledge to design and implement an effective data model.
Data access tools: Data access tools are used to access and analyze data in the data
warehouse. Advantages of data access tools include the ability to easily access and analyze
data, improved data quality due to data cleansing and standardization, and increased speed
and efficiency of BI activities. Disadvantages include the potential for user error, the need
for specialized skills and knowledge to use the tools effectively, and potential security risks
associated with data access.
Data Marts
Data warehouse and Data Mart, both are storage components of HDFS. Data mart is such
a storage component which is concerned on a specific department of an organization. It is a
subset of the data stored in the data warehouse. Data mart is focused only on particular
function of an organization and it is maintained by single authority only, e.g. finance,
Marketing. Data Marts are small in size and are flexible.
Types of Data Mart:
There are three types of data marts:
Dependent Data Mart –
Dependent Data Mart is created by extracting the data from central repository, Data
warehouse. First data warehouse is created by extracting data (through ETL tool) from
external sources and then data mart is created from data warehouse. Dependent data mart is
created in top-down approach of data warehouse architecture. This model of data mart is
used by big organizations.
Independent Data Mart –
independent Data Mart is created directly from external sources instead of data warehouse.
First data mart is created by extracting data from external sources and then data warehouse
is created from the data present in data mart. Independent data mart is designed in bottom-
up approach of data warehouse architecture. This model of data mart is used by small
organizations and is cost effective comparatively.
Hybrid Data Mart –
This type of Data Mart is created by extracting data from operational source or from data
warehouse. 1Path reflects accessing data directly from external sources and 2Path reflects
dependent data model of data mart
Need of Data Mart:
1. Data Mart focuses only on functioning of particular department of an organization.
2. It is maintained by single authority of an organization.
3. Since, it stores the data related to specific part of an organization, data retrieval from it
is very quick.
4. Designing and maintenance of data mart is found to be quite cinch as compared to data
warehouse.
5. It reduces the response time of user as it stores small volume of data.
6. It is small in size due to which accessing data from it very fast.
7. This Storage unit is used by most of organizations for the smooth running of their
departments.
Advantages of Data Mart:
1. Implementation of data mart needs less time as compared to implementation of data
warehouse as data mart is designed for a particular department of an organization.
2. Organizations are provided with choices to choose model of data mart depending upon
cost and their business.
3. Data can be easily accessed from data mart.
4. It contains frequently accessed queries, so enable to analyse business trend
Mapping the data warehouse architecture to Multiprocessor architecture
The functions of data warehouse are based on the relational data base technology. The
relational data base technology is implemented in parallel manner. There are two advantages
of having parallel relational data base technology for data warehouse:
Linear Speed up: refers the ability to increase the number of processor to reduce response
time
Linear Scale up: refers the ability to provide same performance on the same requests as the
database size increases
Types of parallelism
There are two types of parallelism:
Inter query Parallelism: In which different server threads or processes handle multiple
requests at the same time.
Intra query Parallelism: This form of parallelism decomposes the serial SQL query into
lower level operations such as scan, join, sort etc. Then these lower level operations are
executed concurrently in parallel.
Intra query parallelism can be done in either of two ways:
Horizontal parallelism: which means that the data base is partitioned across multiple disks
and parallel processing occurs within a specific task that is performed concurrently on
different processors against different set of data
Vertical parallelism: This occurs among different tasks. All query components such as scan,
join, sort etc are executed in parallel in a pipelined fashion. In other words, an output from
one task becomes an input into another task.
Data partitioning:
Data partitioning is the key component for effective parallel execution of data base
operations. Partition can be done randomly or intelligently.
Random portioning includes random data striping across multiple disks on a single server.
Another option for random portioning is round robin fashion partitioning in which each
record is placed on the next disk assigned to the data base.
Intelligent partitioning assumes that DBMS knows where a specific record is located and
does not waste time searching for it across all disks.
The various intelligent partitioning include:
Hash partitioning: A hash algorithm is used to calculate the partition number based on the
value of the partitioning key for each row
Key range partitioning: Rows are placed and located in the partitions according to the value
of the partitioning key. That is all the rows with the key value from A to K are in partition 1,
L to T are in partition 2 and so on.
Schema portioning: an entire table is placed on one disk; another table is placed on different
disk etc. This is useful for small reference tables.
User defined portioning: It allows a table to be partitioned on the basis of a user defined
expression.
DBMS schemas for decision support
The basic concepts of dimensional modeling are: facts, dimensions and measures. A fact is a
collection of related data items, consisting of measures and context data. It typically
represents business items or business transactions. A dimension is a collection of data that
describe one business dimension. Dimensions determine the contextual background for the
facts; they are the parameters over which we want to perform OLAP. A measure is a numeric
attribute of a fact, representing the performance or behavior of the business relative to the
dimensions. Considering Relational context, there are three basic schemas that are used in
dimensional modeling:
1. Star schema
2. Snowflake schema
3. Fact constellation schema
Star schema
The multidimensional view of data that is expressed using relational data base semantics is
provided by the data base schema design called star schema. The basic of stat schema is that
information can be classified into two groups:
Facts
Dimension
Star schema has one large central table (fact table) and a set of smaller tables (dimensions)
arranged in a radial pattern around the central table.
Facts are core data element being analyzed while dimensions are attributes about the facts.
The determination of which schema model should be used for a data warehouse should be
based upon the analysis of project requirements, accessible tools and project team
preferences.
The star schema architecture is the simplest data warehouse schema. It is called a star schema
because the diagram resembles a star, with points radiating from a center. The center of the
star consists of fact table and the points of the star are the dimension tables. Usually the fact
tables in a star schema are in third normal form(3NF) whereas dimensional tables are de-
normalized. Despite the fact that the star schema is the simplest architecture, it is most
commonly used nowadays and is recommended by Oracle.
Fact Tables
A fact table is a table that contains summarized numerical and historical data (facts) and a
multipart index composed of foreign keys from the primary keys of related dimension tables.
A fact table typically has two types of columns: foreign keys to dimension tables and
measures those that contain numeric facts. A fact table can contain fact's data on detail or
aggregated level.
Dimension Tables
Dimensions are categories by which summarized data can be viewed. E.g. a profit summary
in a fact table can be viewed by a Time dimension (profit by month, quarter, year), Region
dimension
(profit by country, state, city), Product dimension (profit for product1, product2). A
dimension is a structure usually composed of one or more hierarchies that categorizes data. If
a dimension hasn't got a hierarchies and levels it is called flat dimension or list. The primary
keys of each of the dimension tables are part of the composite primary key of the fact table.
Dimensional attributes help to describe the dimensional value. They are normally descriptive,
textual values. Dimension tables are generally small in size then fact table. Typical fact tables
store data about sales while dimension tables data about geographic region (markets, cities),
clients, products, times, channels.
Measures
Measures are numeric data based on columns in a fact table. They are the primary data which
end users are interested in. E.g. a sales fact table may contain a profit measure which
represents profit on each sale.
Aggregations are pre calculated numeric data. By calculating and storing the answers to a
query before users ask for it, the query processing time can be reduced. This is key in
providing fast query performance in OLAP.
Cubes are data processing units composed of fact tables and dimensions from the data
warehouse. They provide multidimensional views of data, querying and analytical
capabilities to clients.
The main characteristics of star schema:
Simple structure -> easy to understand schema
Great query effectives -> small number of tables to join
Relatively long time of loading data into dimension tables -> de-normalization,
redundancy data caused that size of the table could be large.
The most commonly used in the data warehouse implementations -> widely supported
by a large number of business intelligence tools
Snowflake schema: is the result of decomposing one or more of the dimensions. The many-
to one relationship among sets of attributes of a dimension can separate new dimension
tables, forming a hierarchy. The decomposed snowflake structure visualizes the hierarchical
structure of dimensions very well.
Fact constellation schema: For each star schema it is possible to construct fact constellation
schema (for example by splitting the original star schema into more star schemes each of
them describes facts on another level of dimension hierarchies). The fact constellation
architecture contains multiple fact tables that share many dimension tables.
The main shortcoming of the fact constellation schema is a more complicated design because
many variants for particular kinds of aggregation must be considered and selected. Moreover,
dimension tables are still large.
Multi relational Database:
The relational implementation of multidimensional data base systems is referred to as multi
relational database systems.
Multi Dimensional Data Model
The multi-Dimensional Data Model is a method which is used for ordering data in the
database along with good arrangement and assembling of the contents in the database.
The Multi Dimensional Data Model allows customers to interrogate analytical questions
associated with market or business trends, unlike relational databases which allow
customers to access data in the form of queries. They allow users to rapidly receive answers
to the requests which they made by creating and examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing uses multi dimensional
databases. It is used to show multiple dimensions of the data to users.
It represents data in the form of data cubes. Data cubes allow to model and view the data
from many dimensions and perspectives. It is defined by dimensions and facts and is
represented by a fact table. Facts are numerical measures and fact tables contain measures
of the related dimensional tables or names of the facts
Working on a Multidimensional Data Model
On the basis of the pre-decided steps, the Multidimensional Data Model works.
The following stages should be followed by every project for building a Multi Dimensional
Data Model :
Stage 1 : Assembling data from the client : In first stage, a Multi Dimensional Data
Model collects correct data from the client. Mostly, software professionals provide
simplicity to the client about the range of data which can be gained with the selected
technology and collect the complete data in detail.
Stage 2 : Grouping different segments of the system : In the second stage, the Multi
Dimensional Data Model recognizes and classifies all the data to the respective section they
belong to and also builds it problem-free to apply step by step.
Stage 3 : Noticing the different proportions : In the third stage, it is the basis on which
the design of the system is based. In this stage, the main factors are recognized according to
the user’s point of view. These factors are also known as “Dimensions”.
Stage 4 : Preparing the actual-time factors and their respective qualities : In the fourth
stage, the factors which are recognized in the previous step are used further for identifying
the related qualities. These qualities are also known as “attributes” in the database.
Stage 5 : Finding the actuality of factors which are listed previously and their qualities
: In the fifth stage, A Multi Dimensional Data Model separates and differentiates the
actuality from the factors which are collected by it. These actually play a significant role in
the arrangement of a Multi Dimensional Data Model.
Stage 6 : Building the Schema to place the data, with respect to the information
collected from the steps above : In the sixth stage, on the basis of the data which was
collected previously, a Schema is built.
Features of multidimensional data models:
Measures: Measures are numerical data that can be analyzed and compared, such as sales
or revenue. They are typically stored in fact tables in a multidimensional data model.
Dimensions: Dimensions are attributes that describe the measures, such as time, location,
or product. They are typically stored in dimension tables in a multidimensional data model.
Cubes: Cubes are structures that represent the multidimensional relationships between
measures and dimensions in a data model. They provide a fast and efficient way to retrieve
and analyze data.
Aggregation: Aggregation is the process of summarizing data across dimensions and levels
of detail. This is a key feature of multidimensional data models, as it enables users to
quickly analyze data at different levels of granularity.
Drill-down and roll-up: Drill-down is the process of moving from a higher-level summary
of data to a lower level of detail, while roll-up is the opposite process of moving from a
lower-level detail to a higher-level summary. These features enable users to explore data in
greater detail and gain insights into the underlying patterns.
Hierarchies: Hierarchies are a way of organizing dimensions into levels of detail. For
example, a time dimension might be organized into years, quarters, months, and days.
Hierarchies provide a way to navigate the data and perform drill-down and roll-up
operations.
OLAP (Online Analytical Processing): OLAP is a type of multidimensional data model
that supports fast and efficient querying of large datasets. OLAP systems are designed to
handle complex queries and provide fast response times.
Advantages of Multi Dimensional Data Model
The following are the advantages of a multi-dimensional data model :
A multi-dimensional data model is easy to handle.
It is easy to maintain.
Its performance is better than that of normal databases (e.g. relational databases).
The representation of data is better than traditional databases. That is because the multi-
dimensional databases are multi-viewed and carry different types of factors.
It is workable on complex systems and applications, contrary to the simple one-
dimensional database systems.
The compatibility in this type of database is an upliftment for projects having lower
bandwidth for maintenance staff.
Disadvantages of Multi Dimensional Data Model
The following are the disadvantages of a Multi Dimensional Data Model :
The multi-dimensional Data Model is slightly complicated in nature and it requires
professionals to recognize and examine the data in the database.
During the work of a Multi-Dimensional Data Model, when the system caches, there is
a great effect on the working of the system.
It is complicated in nature due to which the databases are generally dynamic in design.
The path to achieving the end product is complicated most of the time.
As the Multi Dimensional Data Model has complicated systems, databases have a large
number of databases due to which the system is very insecure when there is a security
break.
Prepared By:
Manoj Kumar Sharma
Assistant Professor
Department of CSE
VGI