0% found this document useful (0 votes)

16 views29 pages

DMDW Notes Unit 1

The document provides an overview of data warehousing, highlighting its purpose for analytical reporting and decision-making by consolidating data from various sources. It details the ETL process (Extract, Transform, Load) and distinguishes between different types of data warehouses, including enterprise data warehouses, operational data stores, and data marts. Additionally, it compares OLAP and OLTP systems, discusses the benefits and drawbacks of each, and introduces the Apriori algorithm for mining frequent patterns in data.

Uploaded by

d.mukherjee2003.official

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views29 pages

DMDW Notes Unit 1

Uploaded by

d.mukherjee2003.official

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

UNIT 1

INTRODUCTION TO DATA WAREHOUSING

A Data Warehouse consists of data from multiple heterogeneous data sources and is used for
analytical reporting and decision making. Data Warehouse is a central place where data is stored
from different data sources and applications.

A Data Warehouse is used for reporting and analyzing of information and stores both historical
and current data. The data in DW system is used for Analytical reporting, which is later used by
Business Analysts, Sales Managers or Knowledge workers for decision-making.

DATABASE VS. DATA WAREHOUSE

Although a data warehouse and a traditional database share some similarities, they need not be
the same idea. The main difference is that in a database, data is collected for multiple
transactional purposes. However, in a data warehouse, data is collected on an extensive scale to
perform analytics. Databases provide real-time data, while warehouses store data to be accessed
for big analytical queries.

Data warehouse is an example of an OLAP system or an online database query answering

system. OLTP is an online database modifying system, for example, ATM. Learn more about
the OLTP vs. OLAP differences.

ETL PROCESS IN DATA WAREHOUSE

ETL stands for Extract, Transform, Load and it is a process used in data warehousing to extract
data from various sources, transform it into a format suitable for loading into a data warehouse,
and then load it into the warehouse. The process of ETL can be broken down into the following
three stages:

 Extraction:
The first step of the ETL process is extraction. In this step, data from various source
systems is extracted which can be in various formats like relational databases, No SQL,
XML, and flat files into the staging area. It is important to extract the data from various
source systems and store it into the staging area first and not directly into the data
warehouse because the extracted data is in various formats and can be corrupted also.
Hence loading it directly into the data warehouse may damage it and rollback will be
much more difficult. Therefore, this is one of the most important steps of ETL process.
 Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or
functions are applied on the extracted data to convert it into a single standard format. It
may involve following processes/tasks:

I. Filtering – loading only certain attributes into the data warehouse.

II. Cleaning – filling up the NULL values with some default values, mapping U.S.A, United
States, and America into USA, etc.

III. Joining – joining multiple attributes into one.

IV. Splitting – splitting a single attribute into multiple attributes.

V. Sorting – sorting tuples on the basis of some attribute (generally key-attribute).

 Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is
finally loaded into the data warehouse. Sometimes the data is updated by loading into the
data warehouse very frequently and sometimes it is done after longer but regular
intervals. The rate and period of loading solely depends on the requirements and varies
from system to system.

ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it can
transformed and during that period some new data can be extracted. And while the transformed
data is being loaded into the data warehouse, the already extracted data can be transformed. The
block diagram of the pipelining of ETL process is shown below:
TYPES OF DATA WAREHOUSE

The three main types of data warehouses are enterprise data warehouse (EDW), operational data
store (ODS), and data mart.

I. Enterprise Data Warehouse (EDW)

An enterprise data warehouse (EDW) is a centralized warehouse that provides decision support
services across the enterprise. EDWs are usually a collection of databases that offer a unified
approach for organizing data and classifying data according to subject.

II. Operational Data Store (ODS)

An operational data store (ODS) is a central database used for operational reporting as a data
source for the enterprise data warehouse described above. An ODS is a complementary element
to an EDW and is used for operational reporting, controls, and decision making. An ODS is
refreshed in real-time, making it preferable for routine activities such as storing employee
records. An EDW, on the other hand, is used for tactical and strategic decision support.

III. Data Mart

A data mart is considered a subset of a data warehouse and is usually oriented to a specific team
or business line, such as finance or sales. It is subject-oriented, making specific data available to
a defined group of users more quickly, providing them with critical insights. The availability of
specific data ensures that they do not need to waste time searching through an entire data
warehouse.
Online Analytical Processing (OLAP)

OLAP stands for Online Analytical Processing. OLAP systems have the capability to analyze
database information of multiple systems at the current time. The primary goal of OLAP Service
is data analysis and not data processing.

OLTP stands for Online Transaction Processing. OLTP has the work to administer day-to-day
transactions in any organization. The main goal of OLTP is data processing not data analysis.

Online Analytical Processing (OLAP) consists of a type of software tool that is used for data
analysis for business decisions. OLAP provides an environment to get insights from the database
retrieved from multiple database systems at one time.

OLAP Examples

Any type of Data Warehouse System is an OLAP system. The uses of the OLAP System are
described below.

I. Spotify analyzed songs by users to come up with a personalized homepage of their

songs and playlist.
II. Netflix movie recommendation system.
Benefits of OLAP Services

 OLAP services help in keeping consistency and calculation.

 We can store planning, analysis, and budgeting for business analytics within one
platform.
 OLAP services help in handling large volumes of data, which helps in enterprise-level
business applications.
 OLAP services help in applying security restrictions for data protection.
 OLAP services provide a multidimensional view of data, which helps in applying
operations on data in various ways.

Drawbacks of OLAP Services

 OLAP Services requires professionals to handle the data because of its complex modeling
procedure.
 OLAP services are expensive to implement and maintain in cases when datasets are large.
 We can perform an analysis of data only after extraction and transformation of data in the
case of OLAP which delays the system.
 OLAP services are not efficient for decision-making, as it is updated on a periodic basis.

Online Transaction Processing (OLTP)

Online transaction processing provides transaction-oriented applications in a 3-tier architecture.

OLTP administers the day-to-day transactions of an organization.

OLTP Examples

An example considered for OLTP System is ATM Center a person who authenticates first will
receive the amount first and the condition is that the amount to be withdrawn must be present in
the ATM. The uses of the OLTP System are described below.

I. ATM center is an OLTP application.

II. OLTP handles the ACID properties during data transactions via the application.
It’s also used for Online banking, Online airline ticket booking, sending a text message, add a
book to the shopping cart.

Benefits of OLTP Services

 OLTP services allow users to read, write and delete data operations quickly.
 OLTP services help in increasing users and transactions which helps in real-time access
to data.
 OLTP services help to provide better security by applying multiple security features.
 OLTP services help in making better decision making by providing accurate data or
current data.
 OLTP Services provide Data Integrity, Consistency, and High Availability to the data.

Drawbacks of OLTP Services

 OLTP has limited analysis capability as they are not capable of intending complex
analysis or reporting.
 OLTP has high maintenance costs because of frequent maintenance, backups, and
recovery.
 OLTP Services get hampered in the case whenever there is a hardware failure which
leads to the failure of online transactions.
 OLTP Services many times experience issues such as duplicate or inconsistent data.
DIFFERENCE BETWEEN OLAP AND OLTP

OLAP (Online Analytical

Category Processing) OLTP (Online Transaction Processing)

It is well-known as an online
It is well-known as an online database
Definition database query management
modifying system.
system.

Consists of historical data from Consists of only operational current

Data source
various Databases. data.

It makes use of a data It makes use of a standard database

Method used
warehouse. management system (DBMS).

It is subject-oriented. Used
It is application-oriented. Used for
Application for Data Mining, Analytics,
business tasks.
Decisions making, etc.

In an OLAP database, tables are In an OLTP database, tables

Normalized
not normalized. are normalized (3NF).

The data is used in planning, The data is used to perform day-to-

Usage of data
problem-solving, and decision- day fundamental operations.
OLAP (Online Analytical
Category Processing) OLTP (Online Transaction Processing)

making.

It provides a multi-dimensional It reveals a snapshot of present

Task
view of different business tasks. business tasks.

It serves the purpose to extract It serves the purpose to Insert,

Purpose information for analysis and Update, and Delete information from
decision-making. the database.

The size of the data is relatively small

Volume of A large amount of data is stored
as the historical data is archived in
data typically in TB, PB
MB, and GB.

Relatively slow as the amount of

Very Fast as the queries operate on
Queries data involved is large. Queries
5% of the data.
may take hours.

The OLAP database is not often

The data integrity constraint must be
Update updated. As a result, data
maintained in an OLTP database.
integrity is unaffected.
OLAP (Online Analytical
Category Processing) OLTP (Online Transaction Processing)

Backup and It only needs backup from time The backup and recovery process is
Recovery to time as compared to OLTP. maintained rigorously

It is comparatively fast in processing

Processing The processing of complex
because of simple and straightforward
time queries can take a lengthy time.
queries.

This data is generally managed This data is managed by clerks and

Types of users
by CEO, MD, and GM. managers.

Only read and rarely write

Operations Both read and write operations.
operations.

With lengthy, scheduled batch

The user initiates data updates, which
Updates operations, data is refreshed on a
are brief and quick.
regular basis.

Nature of The process is focused on the The process is focused on the

audience customer. market.

Database Design with a focus on the Design that is focused on the

OLAP (Online Analytical
Category Processing) OLTP (Online Transaction Processing)

Design subject. application.

Improves the efficiency of

Productivity Enhances the user’s productivity.
business analysts.

MINING FREQUENT PATTERNS

The technique of frequent pattern mining is built upon a number of fundamental ideas. The
analysis is based on transaction databases, which include records or transactions that represent
collections of objects. Items inside these transactions are grouped together as itemsets.

The importance of patterns is greatly influenced by support and confidence measurements.

Support quantifies how frequently an itemset appears in the database, whereas confidence
quantifies how likely it is that a rule generated from the itemset is accurate.
APRIORI ALGORITHM

One of the most popular methods, the Apriori algorithm, uses a step−by−step procedure to find
frequent item sets. It starts by creating candidate itemsets of length 1, determining their support,
and eliminating any that fall below the predetermined cutoff. The method then joins the frequent
itemsets from the previous phase to produce bigger itemsets repeatedly.

Once no more common item sets can be located, the procedure is repeated. The Apriori approach
is commonly used because of its efficiency and simplicity, but because it requires numerous
database scans for big datasets, it can be computationally inefficient.

Consider the following dataset and we will find frequent itemsets and generate association rules
for them.

minimum support count is 2 minimum confidence is 60%

Step-1:

K=1 (I) Create a table containing support count of each item present in dataset – Called
C1(candidate set)
(II) compare candidate set item’s support count with minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support then remove
those items). This gives us itemset L1.

Step-2:

K=2
 Generate candidate set C2 using L1 (this is called join step). Condition of joining Lk-1 and
Lk-1 is that it should have (K-2) elements in common.
 Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each
itemset)
 Now find support count of these itemsets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives
us itemset L2.

Step-3:

 Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1 is that it
should have (K-2) elements in common. So here, for L2, first element should match. So
itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4,
I5}{I2, I3, I5}
 Check if all subsets of these itemsets are frequent or not and if not, then remove that
itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For
{I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every
itemset)
 find support count of these remaining itemset by searching in dataset.

(II) Compare candidate (C3) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives
us itemset L3.

Step-4:

Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-1 (K=4) is that,
they should have (K-2) elements in common. So here, for L3, first 2 elements (items) should
match.

Check all subsets of these itemsets are frequent or not (Here itemset formed by joining L3 is {I1,
I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset in C4

We stop here because no frequent itemsets are found further

Thus, we have discovered all the frequent item-sets. Now generation of strong association rule
comes into picture. For that we need to calculate confidence of each rule.Confidence –A
confidence of 60% means that 60% of the customers, who purchased milk and bread also bought
butter.

Confidence(A->B)=Support_count(A∪B)/Support_count(A)

So here, by taking an example of any frequent itemset, we will show the rule generation. Itemset
{I1, I2, I3} //from L3 SO rules can be [I1Î2]=>[I3] //confidence = sup(I1Î2Î3)/sup(I1Î2) =
2/4*100=50% [I1Î3]=>[I2] //confidence = sup(I1Î2Î3)/sup(I1Î3) = 2/4*100=50%
[I2Î3]=>[I1] //confidence = sup(I1Î2Î3)/sup(I2Î3) = 2/4*100=50% [I1]=>[I2Î3]
//confidence = sup(I1Î2Î3)/sup(I1) = 2/6*100=33% [I2]=>[I1Î3] //confidence =
sup(I1Î2Î3)/sup(I2) = 2/7*100=28% [I3]=>[I1Î2] //confidence = sup(I1Î2Î3)/sup(I3) =
2/6*100=33% So if minimum confidence is 50%, then first 3 rules can be considered as strong
association rules.

Advantages of Apriori

 An algorithm that is simple to grasp.

 The Merge and Squash processes are simple to apply on big itemsets in huge databases.

Disadvantages of Apriori

 It requires a significant amount of calculations if the itemsets are extremely big and the
minimal support is maintained to a bare minimum.
 A full scan of the whole database is required.

FP-GROWTH ALGORITHM

A different strategy for frequent pattern mining is provided by the FP−growth algorithm. It
creates a small data structure known as the FP−tree that effectively describes the dataset without
creating candidate itemsets. The FP−growth algorithm constructs the FP−tree recursively and
then directly mines frequent item sets from it. FP−growth can be much quicker than Apriori by
skipping the construction of candidate itemsets, which lowers the number of runs over the
dataset. It is very helpful for sparse and huge datasets.

Consider the following data:-

The above-given data is a hypothetical dataset of transactions with each letter representing an
item. The frequency of each individual item is computed:-

Let the minimum support be 3. A Frequent Pattern set is built which will contain all the elements
whose frequency is greater than or equal to the minimum support. These elements are stored in
descending order of their respective frequencies. After insertion of the relevant items, the set L
looks like this:-
L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
Now, for each transaction, the respective Ordered-Item set is built. It is done by iterating the
Frequent Pattern set and checking if the current item is contained in the transaction in question.
If the current item is contained, the item is inserted in the Ordered-Item set for the current
transaction. The following table is built for all the transactions:

Now, all the Ordered-Item sets are inserted into a Trie Data Structure.
a) Inserting the set {K, E, M, O, Y}:
Here, all the items are simply linked one after the other in the order of occurrence in the set and
initialize the support count for each item as 1.

b) Inserting the set {K, E, O, Y}:

Till the insertion of the elements K and E, simply the support count is increased by 1. On
inserting O we can see that there is no direct link between E and O, therefore a new node for
the item O is initialized with the support count as 1 and item E is linked to this new node. On
inserting Y, we first initialize a new node for the item Y with support count as 1 and link the
new node of O with the new node of Y.

c) Inserting the set {K, E, M}:

Here simply the support count of each element is increased by 1.
d) Inserting the set {K, M, Y}:
Similar to step b), first the support count of K is increased, then new nodes for M and Y are
initialized and linked accordingly.

e) Inserting the set {K, E, O}:

Here simply the support counts of the respective elements are increased. Note that the support
count of the new node of item O is increased.
Now, for each item, the Conditional Pattern Base is computed which is path labels of all the
paths which lead to any node of the given item in the frequent-pattern tree. Note that the items
in the below table are arranged in the ascending order of their frequencies.

Now for each item, the Conditional Frequent Pattern Tree is built. It is done by taking the
set of elements that is common in all the paths in the Conditional Pattern Base of that item and
calculating its support count by summing the support counts of all the paths in the Conditional
Pattern Base.

From the Conditional Frequent Pattern tree, the Frequent Pattern rules are generated by
pairing the items of the Conditional Frequent Pattern Tree set to the corresponding to the item
as given in the below table.
For each row, two types of association rules can be inferred for example for the first row which
contains the element, the rules K -> Y and Y -> K can be inferred. To determine the valid rule,
the confidence of both the rules is calculated and the one with confidence greater than or equal to
the minimum confidence value is retained.

Advantages of FP Growth Algorithm

The FP Growth algorithm in data mining has several advantages over other frequent itemset
mining algorithms, as mentioned below:

 Efficiency:
FP Growth algorithm is faster and more memory-efficient than other frequent itemset
mining algorithms such as Apriori, especially on large datasets with high dimensionality.
This is because it generates frequent itemsets by constructing the FP-Tree, which
compresses the database and requires only two scans.
 Scalability:
FP Growth algorithm scales well with increasing database size and itemset
dimensionality, making it suitable for mining frequent itemsets in large datasets.
 Resistant to noise:
FP Growth algorithm is more resistant to noise in the data than other frequent itemset
mining algorithms, as it generates only frequent itemsets and ignores infrequent itemsets
that may be caused by noise.
 Parallelization:
FP Growth algorithm can be easily parallelized, making it suitable for distributed
computing environments and allowing it to take advantage of multi-core processors.
Disadvantages of FP Growth Algorithm

While the FP Growth algorithm in data mining has several advantages, it also has some
limitations and disadvantages, as mentioned below:

 Memory consumption:
Although the FP Growth algorithm is more memory-efficient than other frequent itemset
mining algorithms, storing the FP-Tree and the conditional pattern bases can still require
a significant amount of memory, especially for large datasets.
 Complex implementation:
The FP Growth algorithm is more complex than other frequent itemset mining
algorithms, making it more difficult to understand and implement.

Eclat Algorithm

Equivalence Class Clustering and bottom−up Lattice Traversal are the acronyms for the Eclat
algorithm, a well−liked frequent pattern mining method. It explores the itemset lattice using a
depth−first search approach, concentrating on the representation of vertical data formats.

Transaction identifiers (TIDs) are effectively used by Eclat to locate intersections between item
sets. This technique is renowned for its ease of use and little memory requirements, making it
appropriate for mining frequent itemsets in vertical databases.

The ECLAT algorithm stands for Equivalence Class Clustering and bottom-up Lattice
Traversal. It is one of the popular methods of Association Rule mining. It is a more efficient and
scalable version of the Apriori algorithm. While the Apriori algorithm works in a horizontal
sense imitating the Breadth-First Search of a graph, the ECLAT algorithm works in a vertical
manner just like the Depth-First Search of a graph. This vertical approach of the ECLAT
algorithm makes it a faster algorithm than the Apriori algorithm. How the algorithm work?
: The basic idea is to use Transaction Id Sets(tidsets) intersections to compute the support value
of a candidate and avoiding the generation of subsets which do not exist in the prefix tree. In the
first call of the function, all single items are used along with their tidsets. Then the function is
called recursively and in each recursive call, each item-tidset pair is verified and combined with
other item-tidset pairs. This process is continued until no candidate item-tidset pairs can be
combined. Let us now understand the above stated working with an example:- Consider the
following transactions record:-

The above-given data is a boolean matrix where for each cell (i, j), the value denotes whether the
j’th item is included in the i’th transaction or not. 1 means true while 0 means false. We now call
the function for the first time and arrange each item with its tidset in a tabular fashion:- k = 1,
minimum support = 2

Item Tidset

Bread {T1, T4, T5, T7, T8, T9}

Butter {T1, T2, T3, T4, T6, T8, T9}

Milk {T3, T5, T6, T7, T8, T9}

Coke {T2, T4}

Jam {T1, T8}

We now recursively call the function till no more item-tidset pairs can be combined:- k = 2
Item Tidset

{Bread, Butter} {T1, T4, T8, T9}

{Bread, Milk} {T5, T7, T8, T9}

{Bread, Coke} {T4}

{Bread, Jam} {T1, T8}

{Butter, Milk} {T3, T6, T8, T9}

{Butter, Coke} {T2, T4}

{Butter, Jam} {T1, T8}

{Milk, Jam} {T8}

k=3

Item Tidset
Item Tidset

{Bread, Butter, Milk} {T8, T9}

{Bread, Butter, Jam} {T1, T8}

k=4

Item Tidset

{Bread, Butter, Milk, Jam} {T8}

We stop at k = 4 because there are no more item-tidset pairs to combine. Since minimum support
= 2, we conclude the following rules from the given dataset:-

Items Bought Recommended Products

Bread Butter

Bread Milk

Bread Jam
Items Bought Recommended Products

Butter Milk

Butter Coke

Butter Jam

Bread and Butter Milk

Bread and Butter Jam

Advantages of ECLAT over Apriori algorithm:-

 Memory Requirements: Since the ECLAT algorithm uses a Depth-First Search approach,
it uses less memory than Apriori algorithm.
 Speed: The ECLAT algorithm is typically faster than the Apriori algorithm.
 Number of Computations: The ECLAT algorithm does not involve the repeated scanning
of the data to compute the individual support values.

ASSOCIATION AND CORRELATION

Association is a technique used in data mining to identify the relationships or co-occurrences

between items in a dataset. It involves analyzing large datasets to discover patterns or
associations between items, such as products purchased together in a supermarket or web pages
frequently visited together on a website. Association analysis is based on the idea of finding the
most frequent patterns or itemsets in a dataset, where an itemset is a collection of one or more
items.

Association analysis can provide valuable insights into consumer behavior and preferences. It
can help retailers identify the items that are frequently purchased together, which can be used to
optimize product placement and promotions. Similarly, it can help e-commerce websites
recommend related products to customers based on their purchase history.

Types of Associations
Here are the most common types of associations used in data mining:

 Itemset Associations: Itemset association is the most common type of association

analysis, which is used to discover relationships between items in a dataset. In this type
of association, a collection of one or more items that frequently co-occur together is
called an itemset. For example, in a supermarket dataset, itemset association can be used
to identify items that are frequently purchased together, such as bread and butter.
 Sequential Associations: Sequential association is used to identify patterns that occur in
a specific sequence or order. This type of association analysis is commonly used in
applications such as analyzing customer behavior on e-commerce websites or studying
weblogs. For example, in the weblogs dataset, a sequential association can be used to
identify the sequence of pages that users visit before making a purchase.
 Graph-based Associations Graph-based association is a type of association analysis that
involves representing the relationships between items in a dataset as a graph. In this type
of association, each item is represented as a node in the graph, and the edges between
nodes represent the co-occurrence or relationship between items. The graph-based
association is used in various applications, such as social network analysis,
recommendation systems, and fraud detection. For example, in a social network dataset,
identifying groups of users with similar interests or behaviors.
CORRELATION

Correlation Analysis is a data mining technique used to identify the degree to which two or
more variables are related or associated with each other. Correlation refers to the statistical
relationship between two or more variables, where the variation in one variable is associated
with the variation in another variable. In other words, it measures how changes in one variable
are related to changes in another variable. Correlation can be positive, negative, or zero,
depending on the direction and strength of the relationship between the variables.

For example, we are studying the relationship between the hours of study and the grades
obtained by students. If we find that as the number of hours of study increases, the grades
obtained also increase, then there is a positive correlation between the two variables. On the
other hand, if we find that as the number of hours of study increases, the grades obtained
decrease, then there is a negative correlation between the two variables. If there is no relationship
between the two variables, we would say that there is zero correlation.

Types of Correlation Analysis in Data Mining

There are three main types of correlation analysis used in data mining, as mentioned
below:

 Pearson Correlation Coefficient - Pearson correlation measures the linear relationship

between two continuous variables. It ranges from -1 to +1, where -1 indicates a perfect
negative correlation, 0 indicates no correlation, and +1 indicates a perfect positive
correlation. The Pearson correlation coefficient between two variables, X and Y, is
calculated as follows –
 Kendall Rank Correlation - Kendall correlation is a non-parametric measure of the
association between two ordinal variables. It measures the degree of correspondence
between the ranking of observations on two variables. It calculates the difference
between the number of concordant pairs (pairs of observations that have the same rank
order in both variables) and discordant pairs (pairs of observations that have an opposite
rank order in the two variables) and normalizes the result by dividing by the total number
of pairs. The formula for the Kendall correlation is –

 Spearman Rank Correlation - Spearman correlation is another non-parametric measure

of the relationship between two variables. It measures the degree of association between
the ranks of two variables. Spearman correlation is similar to the Kendall correlation in
that it measures the strength of the relationship between two variables measured on a
ranked scale. However, Spearman correlation uses the actual numerical ranks of the data
instead of counting the number of concordant and discordant pairs. The formula for
Spearman correlation is –

Interpreting Results of Correlation Analysis

After performing a correlation analysis, it is important to interpret the results to draw meaningful
conclusions about the relationship between the analyzed variables. One common way to interpret
correlation coefficients is by using the following general guidelines -

 Any score from +0.5 to +1 indicates a very strong positive correlation, meaning that the
variables are strongly related in a positive direction, increasing together or
simultaneously.
 Any score from -0.5 to -1 indicates a strong negative correlation, meaning that the
variables are strongly related in a negative direction. It also means that as one variable
decreases, the other variable increases and vice-versa.
 A score of 0 indicates no correlation, meaning there is no relationship between the
analyzed variables.

Data Warehouse and Mining-1
No ratings yet
Data Warehouse and Mining-1
40 pages
DWHDM 22cse120 Module-1
No ratings yet
DWHDM 22cse120 Module-1
45 pages
Term 1
No ratings yet
Term 1
12 pages
Unit 2 Datawarehouse
No ratings yet
Unit 2 Datawarehouse
17 pages
DWDM Book
No ratings yet
DWDM Book
58 pages
DW Notes
No ratings yet
DW Notes
57 pages
Sem3 Unit1 DW
No ratings yet
Sem3 Unit1 DW
12 pages
DWDM
No ratings yet
DWDM
107 pages
Module1 Part3
No ratings yet
Module1 Part3
46 pages
DWM Unit 1
No ratings yet
DWM Unit 1
48 pages
Online Analytical Processing: OLAP (Or Online Analytical Processing) Has Been Growing in Popularity Due To The
No ratings yet
Online Analytical Processing: OLAP (Or Online Analytical Processing) Has Been Growing in Popularity Due To The
12 pages
Data Warehouse Question Bank
No ratings yet
Data Warehouse Question Bank
14 pages
Unit 2 Question Bank
No ratings yet
Unit 2 Question Bank
7 pages
Unit II
No ratings yet
Unit II
31 pages
Unit 2
No ratings yet
Unit 2
25 pages
ETL Testing
No ratings yet
ETL Testing
32 pages
Introduction To DW
No ratings yet
Introduction To DW
59 pages
2-Data Warehousing
No ratings yet
2-Data Warehousing
30 pages
Data Ware House Concepts
No ratings yet
Data Ware House Concepts
12 pages
Database & ETL Testing Essentials
No ratings yet
Database & ETL Testing Essentials
17 pages
Database Testing vs Data Warehouse Testing
100% (2)
Database Testing vs Data Warehouse Testing
17 pages
MCS 221 Notes
No ratings yet
MCS 221 Notes
24 pages
Oracle BI - Topic - Data Warehousing
No ratings yet
Oracle BI - Topic - Data Warehousing
4 pages
In T e G R A Ti o N: Integration of Data
No ratings yet
In T e G R A Ti o N: Integration of Data
21 pages
Data Warehouse Essentials
No ratings yet
Data Warehouse Essentials
6 pages
U1-U5 Consolidated PDF
No ratings yet
U1-U5 Consolidated PDF
222 pages
Week 4
No ratings yet
Week 4
125 pages
Data Warehouse Basics (Lec. Notes 1)
No ratings yet
Data Warehouse Basics (Lec. Notes 1)
5 pages
Online Analytical Processing (OLAP) For Decision Support: Nenad Jukic, Boris Jukic, and Mary Malliaris
No ratings yet
Online Analytical Processing (OLAP) For Decision Support: Nenad Jukic, Boris Jukic, and Mary Malliaris
24 pages
Data Mining UNIT I
No ratings yet
Data Mining UNIT I
11 pages
DATA WAREHOUSE Basic Concepts
No ratings yet
DATA WAREHOUSE Basic Concepts
26 pages
Database vs. Data Warehouse Testing
No ratings yet
Database vs. Data Warehouse Testing
17 pages
Module 2
No ratings yet
Module 2
43 pages
CCS341 Data Warehousing Unit 2 Notes - Ccs341-Data-warehousing-unit-2-Notes
No ratings yet
CCS341 Data Warehousing Unit 2 Notes - Ccs341-Data-warehousing-unit-2-Notes
32 pages
Unit1 (DW&DM)
No ratings yet
Unit1 (DW&DM)
30 pages
ETL Interview Questions
No ratings yet
ETL Interview Questions
18 pages
DW Concepts
100% (1)
DW Concepts
40 pages
ETL - Extract, Transform and Load: What Is A Data Warehouse?
No ratings yet
ETL - Extract, Transform and Load: What Is A Data Warehouse?
30 pages
DMDW Co1 Session 2
No ratings yet
DMDW Co1 Session 2
39 pages
Module 1
No ratings yet
Module 1
71 pages
Unit 2 QB
No ratings yet
Unit 2 QB
8 pages
Important Concepts in Big Data
No ratings yet
Important Concepts in Big Data
6 pages
Unit Ii DWDM
No ratings yet
Unit Ii DWDM
10 pages
Chapter 12 - Data Warehousing and Online Analytical Processing
No ratings yet
Chapter 12 - Data Warehousing and Online Analytical Processing
20 pages
DWM Unit-I Notes
No ratings yet
DWM Unit-I Notes
9 pages
DW Unit-1 (1) XXXXXXXX
No ratings yet
DW Unit-1 (1) XXXXXXXX
70 pages
Unit 1 Data Warehouse Fundamentals: Structure
No ratings yet
Unit 1 Data Warehouse Fundamentals: Structure
10 pages
Session Five - Data Integration
No ratings yet
Session Five - Data Integration
11 pages
DWM Notes 1
No ratings yet
DWM Notes 1
15 pages
DM 1
No ratings yet
DM 1
20 pages
CH - 3
No ratings yet
CH - 3
45 pages
Data Warehouse
No ratings yet
Data Warehouse
4 pages
Data Warehouse - Final
No ratings yet
Data Warehouse - Final
28 pages
O Lap and Data Warehouse
No ratings yet
O Lap and Data Warehouse
16 pages
DBMS II Seven 7
No ratings yet
DBMS II Seven 7
13 pages
Data Warehousing Extract, Transform and Load (ETL)
No ratings yet
Data Warehousing Extract, Transform and Load (ETL)
32 pages
Apache Storm Tutorial Point
0% (1)
Apache Storm Tutorial Point
20 pages
CertyIQ DP-900 Web - Question Bank
No ratings yet
CertyIQ DP-900 Web - Question Bank
70 pages
PIG
No ratings yet
PIG
9 pages
Data Privacy: Building Trust in A Post - GDPR World: Frédéric Vonner, PWC Gabriela Gheorghe, PWC
No ratings yet
Data Privacy: Building Trust in A Post - GDPR World: Frédéric Vonner, PWC Gabriela Gheorghe, PWC
16 pages
Superbadge - Business Administration Specialist
No ratings yet
Superbadge - Business Administration Specialist
213 pages
Business Analytics Overview
No ratings yet
Business Analytics Overview
6 pages
Resume Aniket Mca
No ratings yet
Resume Aniket Mca
1 page
Sample Course Project Case Study
No ratings yet
Sample Course Project Case Study
261 pages
ArcGIS GIS Training Course Outline
No ratings yet
ArcGIS GIS Training Course Outline
5 pages
05 OS90515EN15GLA0 System SelfMonitoring and AdminTasks
No ratings yet
05 OS90515EN15GLA0 System SelfMonitoring and AdminTasks
68 pages
Microsoft Testking MS-900 v2020-01-29 by Philip 70q
No ratings yet
Microsoft Testking MS-900 v2020-01-29 by Philip 70q
47 pages
My Resume
No ratings yet
My Resume
1 page
Business Information System Guide
No ratings yet
Business Information System Guide
4 pages
Database Report Creation Guide
No ratings yet
Database Report Creation Guide
3 pages
Hivemq Ebook MQTT Essentials
100% (2)
Hivemq Ebook MQTT Essentials
72 pages
Department of Collegiate and Technical Education: Computer Science and Engineering
No ratings yet
Department of Collegiate and Technical Education: Computer Science and Engineering
21 pages
Keyboard Shortcuts
No ratings yet
Keyboard Shortcuts
1 page
Global Information Technology Report 2004/2005 Executive Summary
50% (2)
Global Information Technology Report 2004/2005 Executive Summary
5 pages
Integrative Programming Midterm 45
No ratings yet
Integrative Programming Midterm 45
25 pages
1.1.1 Waterfall Process Model
No ratings yet
1.1.1 Waterfall Process Model
16 pages
GE Elect MST-4 Module 2 A.Y. 2022-2023
No ratings yet
GE Elect MST-4 Module 2 A.Y. 2022-2023
49 pages
EOL FG-VM V-Series
No ratings yet
EOL FG-VM V-Series
6 pages
SDA
No ratings yet
SDA
37 pages
Universiti Malaysia Sarawak (Unimas) Faculty of Computer Science and Information Technology
No ratings yet
Universiti Malaysia Sarawak (Unimas) Faculty of Computer Science and Information Technology
4 pages
04 Data Warehouse
No ratings yet
04 Data Warehouse
13 pages
Users & Permissions - Online Help - Zoho CRM
No ratings yet
Users & Permissions - Online Help - Zoho CRM
3 pages
Data Science & AI Program Guide
No ratings yet
Data Science & AI Program Guide
4 pages
Agile VBS
No ratings yet
Agile VBS
13 pages
Module 1 Topic 1
No ratings yet
Module 1 Topic 1
14 pages
PCI DSS V 1.2
100% (2)
PCI DSS V 1.2
73 pages

DMDW Notes Unit 1

Uploaded by

DMDW Notes Unit 1

Uploaded by

UNIT 1

INTRODUCTION TO DATA WAREHOUSING

DATABASE VS. DATA WAREHOUSE

Data warehouse is an example of an OLAP system or an online database query answering

ETL PROCESS IN DATA WAREHOUSE

I. Filtering – loading only certain attributes into the data warehouse.

III. Joining – joining multiple attributes into one.

IV. Splitting – splitting a single attribute into multiple attributes.

V. Sorting – sorting tuples on the basis of some attribute (generally key-attribute).

I. Enterprise Data Warehouse (EDW)

II. Operational Data Store (ODS)

III. Data Mart

I. Spotify analyzed songs by users to come up with a personalized homepage of their

 OLAP services help in keeping consistency and calculation.

Drawbacks of OLAP Services

Online Transaction Processing (OLTP)

Online transaction processing provides transaction-oriented applications in a 3-tier architecture.

I. ATM center is an OLTP application.

Benefits of OLTP Services

Drawbacks of OLTP Services

OLAP (Online Analytical

Consists of historical data from Consists of only operational current

It makes use of a data It makes use of a standard database

In an OLAP database, tables are In an OLTP database, tables

The data is used in planning, The data is used to perform day-to-

It provides a multi-dimensional It reveals a snapshot of present

It serves the purpose to extract It serves the purpose to Insert,

The size of the data is relatively small

Relatively slow as the amount of

The OLAP database is not often

It is comparatively fast in processing

This data is generally managed This data is managed by clerks and

Only read and rarely write

With lengthy, scheduled batch

Nature of The process is focused on the The process is focused on the

Database Design with a focus on the Design that is focused on the

Design subject. application.

Improves the efficiency of

MINING FREQUENT PATTERNS

The importance of patterns is greatly influenced by support and confidence measurements.

minimum support count is 2 minimum confidence is 60%

We stop here because no frequent itemsets are found further

 An algorithm that is simple to grasp.

Consider the following data:-

b) Inserting the set {K, E, O, Y}:

c) Inserting the set {K, E, M}:

e) Inserting the set {K, E, O}:

Advantages of FP Growth Algorithm

Bread {T1, T4, T5, T7, T8, T9}

Butter {T1, T2, T3, T4, T6, T8, T9}

Milk {T3, T5, T6, T7, T8, T9}

Coke {T2, T4}

Jam {T1, T8}

{Bread, Butter} {T1, T4, T8, T9}

{Bread, Milk} {T5, T7, T8, T9}

{Bread, Coke} {T4}

{Bread, Jam} {T1, T8}

{Butter, Milk} {T3, T6, T8, T9}

{Butter, Coke} {T2, T4}

{Butter, Jam} {T1, T8}

{Milk, Jam} {T8}

{Bread, Butter, Milk} {T8, T9}

{Bread, Butter, Jam} {T1, T8}

{Bread, Butter, Milk, Jam} {T8}

Items Bought Recommended Products

Bread and Butter Milk

Bread and Butter Jam

Advantages of ECLAT over Apriori algorithm:-

ASSOCIATION AND CORRELATION

Association is a technique used in data mining to identify the relationships or co-occurrences

 Itemset Associations: Itemset association is the most common type of association

Types of Correlation Analysis in Data Mining

 Pearson Correlation Coefficient - Pearson correlation measures the linear relationship

 Spearman Rank Correlation - Spearman correlation is another non-parametric measure

Interpreting Results of Correlation Analysis

You might also like