0% found this document useful (0 votes)
8 views114 pages

DSDM Notes

The document outlines the course on Data Warehousing and Data Mining, covering fundamental concepts, data mining functionalities, and integration with database systems. It discusses various types of data, data mining tasks, and the importance of data mining in extracting valuable insights from large datasets. The course aims to provide a comprehensive understanding of data mining techniques and their applications in real-world scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views114 pages

DSDM Notes

The document outlines the course on Data Warehousing and Data Mining, covering fundamental concepts, data mining functionalities, and integration with database systems. It discusses various types of data, data mining tasks, and the importance of data mining in extracting valuable insights from large datasets. The course aims to provide a comprehensive understanding of data mining techniques and their applications in real-world scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

DATA SCIENCES: DATA WAREHOUSING AND DATA MINING

UNIT 1
INTRODUCTION

• Data Mining Task Primitives


• Integration of A Data Mining System with A
Unit Content: Database or Data
• Warehouse System, Major Issues in Data
Mining,

Course Outcome (CO) CO 214.1 Illustrate the concepts of overview of Data Mining.
Covered for this unit:

Number Lectures
planned in Teaching TWELVE
Plan:

➢ 1. Jiawei Han, Micheline Kamber, Jian Pei, Data


Mining: Concepts and Techniques, Elsevier
Reference: ➢ 2. Margaret H Dunham, Data Mining Introductory and
Advanced Topics, Pearson Education

1
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Lecture Number: 01

Introduction through Concept Map Motivation, Importance,


Topics to cover:
Definitions

Data mining is the procedure of finding useful new correlations, patterns, and trends by sharing
through a high amount of data saved in repositories, using pattern recognition technologies including
statistical and mathematical techniques. It is the analysis of factual datasets to discover unsuspected
relationships and to summarize the records in novel methods that are both logical and helpful to the
data owner.
Data Mining is similar to Data Science. It is carried out by a person, in a particular situation, on a
specific data set, with an objective. This phase contains several types of services including text mining,
web mining, audio and video mining, pictorial data mining, and social media mining. It is completed
through software that is simple or greatly specific.
Data mining has engaged a huge deal of attention in the information market and society as a whole in
current years, because of the wide availability of huge amounts of data and the imminent needed for
turning such data into beneficial data and knowledge.
Data mining can be considered as a result of the natural progress of data technology. The database
system market has supported an evolutionary direction in the development of the following
functionalities including data collection and database creation, data management, and advanced data
analysis.
For example, the recent development of data collection and database creation structure served as
necessary for the later development of an effective structure for data storage and retrieval, and query
and transaction processing. With various database systems providing query and transaction processing
as common practice, advanced data analysis has developed into the next object.
Data can be saved in several types of databases and data repositories. One data repository structure
that has appeared in the data warehouse, a repository of several heterogeneous data sources organized
under a unified schema at an individual site to support management decision making.
Data warehouse technology involves data cleaning, data integration, and online analytical processing
(OLAP), especially, analysis techniques with functionalities including summarization, consolidation,
and aggregation, and the ability to view data from multiple angles.

Lecture Number: 02

Topics to cover: Kind of Data, Data Mining Functionalities

Types of data that can be mined


1. Data stored in the database
A database is also called a database management system or DBMS. Every DBMS stores data that are
related to each other in a way or the other. It also has a set of software programs that are used to
manage data and provide easy access to it. These software programs serve a lot of purposes,

2
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

including defining structure for database, making sure that the stored information remains secured
and consistent, and managing different types of data access, such as shared, distributed, and
concurrent.
A relational database has tables that have different names, attributes, and can store rows or records of
large data sets. Every record stored in a table has a unique key. Entity-relationship model is created
to provide a representation of a relational database that features entities and the relationships that
exist between them.
2. Data warehouse
A data warehouse is a single data storage location that collects data from multiple sources and then
stores it in the form of a unified plan. When data is stored in a data warehouse, it undergoes cleaning,
integration, loading, and refreshing. Data stored in a data warehouse is organized in several parts. If
you want information on data that was stored 6 or 12 months back, you will get it in the form of a
summary.
3. Transactional data
Transactional database stores record that are captured as transactions. These transactions include
flight booking, customer purchase, click on a website, and others. Every transaction record has a
unique ID. It also lists all those items that made it a transaction.
4. Other types of data
We have a lot of other types of data as well that are known for their structure, semantic meanings,
and versatility. They are used in a lot of applications. Here are a few of those data types: data
streams, engineering design data, sequence data, graph data, spatial data, multimedia data, and more.

Data mining functionalities


Data mining functionalities are used to represent the type of patterns that have to be discovered in data
mining tasks. In general, data mining tasks can be classified into two types including descriptive and
predictive. Descriptive mining tasks define the common features of the data in the database and the
predictive mining tasks act inference on the current information to develop predictions.
There are various data mining functionalities which are as follows −
• Data characterization − It is a summarization of the general characteristics of an object class
of data. The data corresponding to the user-specified class is generally collected by a database
query. The output of data characterization can be presented in multiple forms.
• Data discrimination − It is a comparison of the general characteristics of target class data
objects with the general characteristics of objects from one or a set of contrasting classes. The
target and contrasting classes can be represented by the user, and the equivalent data objects
fetched through database queries.
• Association Analysis − It analyses the set of items that generally occur together in a
transactional dataset. There are two parameters that are used for determining the association
rules −
o It provides which identifies the common item set in the database.
o Confidence is the conditional probability that an item occurs in a transaction when
another item occurs.
• Classification − Classification is the procedure of discovering a model that represents and
distinguishes data classes or concepts, for the objective of being able to use the model to predict
the class of objects whose class label is anonymous. The derived model is established on the
analysis of a set of training data (i.e., data objects whose class label is common).

3
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

• Prediction − It defines predict some unavailable data values or pending trends. An object can
be anticipated based on the attribute values of the object and attribute values of the classes. It
can be a prediction of missing numerical values or increase/decrease trends in time-related
information.
• Clustering − It is similar to classification but the classes are not predefined. The classes are
represented by data attributes. It is unsupervised learning. The objects are clustered or grouped,
depends on the principle of maximizing the intraclass similarity and minimizing the intraclass
similarity.
• Outlier analysis − Outliers are data elements that cannot be grouped in a given class or cluster.
These are the data objects which have multiple behaviour from the general behaviour of other
data objects. The analysis of this type of data can be essential to mine the knowledge.
• Evolution analysis − It defines the trends for objects whose behaviour changes over some time.

Lecture Number: 03

Topics to cover: Kinds of Patterns, Classification of Data Mining Systems

Introduction to Predictive Analytics and Data Mining


What Kinds of Patterns Can Data Mining Discover?
Using the most relevant data (which may come from organizational databases or may be obtained from
outside sources), data mining builds models to identify patterns among the attributes (i.e., variables or
characteristics) that exist in a data set. Models are usually the mathematical representations (simple
linear correlations and/or complex highly nonlinear relationships) that identify the relationships among
the attributes of the objects (e.g., customers) described in the data set. Some of these patterns are
explanatory (explaining the interrelationships and affinities among the attributes), whereas others are
predictive (projecting future values of certain attributes). In general, data mining seeks to identify three
major types of patterns:

• Associations find commonly co-occurring groupings of things, such as “beers and diapers” or
“bread and butter” commonly purchased and observed together in a shopping cart (i.e., market-
basket analysis). Another type of association pattern captures the sequences of things. These
sequential relationships can discover time-ordered events, such as predicting that an existing
banking customer who already has a checking account will open a savings account followed by
an investment account within a year.

• Predictions tell the nature of future occurrences of certain events based on what has happened in
the past, such as predicting the winner of the Super Bowl or forecasting the absolute temperature
on a particular day.

4
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

• Clusters identify natural groupings of things based on their known characteristics, such as
assigning customers in different segments based on their demographics and past purchase
behaviors.

These types of patterns have been manually extracted from data by humans for centuries, but the
increasing volume of data in modern times has created a need for more automatic approaches. As data
sets have grown in size and complexity, direct manual data analysis has increasingly been augmented
with indirect, automatic data processing tools that use sophisticated methodologies, methods, and
algorithms. The manifestation of such evolution of automated and semi-automated means of
processing large data sets is now commonly referred to as data mining.

As mentioned earlier, generally speaking, data mining tasks and patterns can be classified into three
main categories: prediction, association, and clustering. Based on the way in which the patterns are
extracted from the historical data, the learning algorithms of data mining methods can be classified as
either supervised or unsupervised. With supervised learning algorithms, the training data includes both
the descriptive attributes (i.e., independent variables or decision variables) and the class attribute (i.e.,
output variable or result variable). In contrast, with unsupervised learning, the training data includes
only the descriptive attributes. Figure 2.3 shows a simple taxonomy for data mining tasks, along with
the learning methods and popular algorithms for each of the data mining tasks. Out of the three main
categories of tasks, prediction patterns/models can be classified as the outcome of a supervised learning
procedure, while association and clustering patterns/models can be classified as the outcome of
unsupervised learning procedures.

5
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Lecture Number: 04

Task Primitives, Integration of A Data Mining System, with


Topics to cover:
A Database or Data Warehouse System

Data Mining Task Primitives


A data mining task can be specified in the form of a data mining query, which is input to the data
mining system. A data mining query is defined in terms of data mining task primitives. These
primitives allow the user to interactively communicate with the data mining system during discovery
to direct the mining process or examine the findings from different angles or depths. The data mining
primitives specify the following,

1. Set of task-relevant data to be mined.


2. Kind of knowledge to be mined.
3. Background knowledge to be used in the discovery process.
4. Interestingness measures and thresholds for pattern evaluation.
5. Representation for visualizing the discovered patterns.

A data mining query language can be designed to incorporate these primitives, allowing users to
interact with data mining systems flexibly. Having a data mining query language provides a foundation
on which user-friendly graphical interfaces can be built.

List of Data Mining Task Primitives


A data mining query is defined in terms of the following primitives, such as:

1. The set of task-relevant data to be mined

This specifies the portions of the database or the set of data in which the user is interested. This includes
the database attributes or data warehouse dimensions of interest (the relevant attributes or dimensions).

In a relational database, the set of task-relevant data can be collected via a relational query involving
operations like selection, projection, join, and aggregation.

6
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

The data collection process results in a new data relational called the initial data relation. The initial
data relation can be ordered or grouped according to the conditions specified in the query. This data
retrieval can be thought of as a subtask of the data mining task.

This initial relation may or may not correspond to physical relation in the database. Since virtual
relations are called Views in the field of databases, the set of task-relevant data for data mining is called
a minable view.

2. The kind of knowledge to be mined

This specifies the data mining functions to be performed, such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution
analysis.

Concept hierarchy defines a sequence of mappings from low-level concepts to higher-level, more
general concepts.

o Rolling Up - Generalization of data: Allow to view data at more meaningful and explicit abstractions
and makes it easier to understand. It compresses the data, and it would require fewer input/output
operations.
o Drilling Down - Specialization of data: Concept values replaced by lower-level concepts. Based on
different user viewpoints, there may be more than one concept hierarchy for a given attribute or
dimension.

An example of a concept hierarchy for the attribute (or dimension) age is shown below. User beliefs
regarding relationships in the data are another form of background knowledge.

4. The interestingness measures and thresholds for pattern evaluation

o Simplicity: A factor contributing to the interestingness of a pattern is the pattern's overall simplicity for
human comprehension. For example, the more complex the structure of a rule is, the more difficult it is
to interpret, and hence, the less interesting it is likely to be..
o Certainty (Confidence): Each discovered pattern should have a measure of certainty associated with
it that assesses the validity or "trustworthiness" of the pattern. A certainty measure for association rules
of the form "A =>B" where A and B are sets of items is confidence. Confidence is a certainty measure.
Given a set of task-relevant data tuples, the confidence of "A => B" is defined as
Confidence (A=>B) = # tuples containing both A and B /# tuples containing A
o Utility (Support): The potential usefulness of a pattern is a factor defining its interestingness. It can be
estimated by a utility function, such as support. The support of an association pattern refers to the
percentage of task-relevant data tuples (or transactions) for which the pattern is true.
Utility (support): usefulness of a pattern
Support (A=>B) = # tuples containing both A and B / total #of tuples

7
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

o Novelty: Novel patterns are those that contribute new information or increased performance to the given
pattern set. For example -> A data exception. Another strategy for detecting novelty is to remove
redundant patterns.

5. The expected representation for visualizing the discovered patterns

This refers to the form in which discovered patterns are to be displayed, which may include rules,
tables, cross tabs, charts, graphs, decision trees, cubes, or other visual representations.

Users must be able to specify the forms of presentation to be used for displaying the discovered
patterns. Some representation forms may be better suited than others for particular kinds of knowledge.

For example, generalized relations and their corresponding cross tabs or pie/bar charts are good for
presenting characteristic descriptions, whereas decision trees are common for classification.

What is the integration of a data mining system with a database


system?

The data mining system is integrated with a database or data warehouse system so that it can do its
tasks in an effective presence. A data mining system operates in an environment that needed it to
communicate with other data systems like a database system. There are the possible integration
schemes that can integrate these systems which are as follows −
No coupling − No coupling defines that a data mining system will not use any function of a database
or data warehouse system. It can retrieve data from a specific source (including a file system), process
data using some data mining algorithms, and therefore save the mining results in a different file.
Loose Coupling − In this data mining system uses some services of a database or data warehouse
system. The data is fetched from a data repository handled by these systems. Data mining approaches
are used to process the data and then the processed data is saved either in a file or in a designated area
in a database or data warehouse. Loose coupling is better than no coupling as it can fetch some area
of data stored in databases by using query processing or various system facilities.
Semitight Coupling − In this adequate execution of a few essential data mining primitives can be
supported in the database/datawarehouse system. These primitives can contain sorting, indexing,
aggregation, histogram analysis, multi-way join, and pre-computation of some important statistical
measures, including sum, count, max, min, standard deviation, etc.
Tight coupling − Tight coupling defines that a data mining system is smoothly integrated into the
database/data warehouse system. The data mining subsystem is considered as one functional element
of an information system.

Lecture Number: 05

Topics to cover: Major Issues in Data Mining

8
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Data mining is not an easy task, as the algorithms used can get very complex and data is not always
available at one place. It needs to be integrated from various heterogeneous data sources. These factors
also create some issues. Here in this tutorial, we will discuss the major issues regarding −

• Mining Methodology and User Interaction


• Performance Issues
• Diverse Data Types Issues
The following diagram describes the major issues.

Mining Methodology and User Interaction Issues


It refers to the following kinds of issues −
• Mining different kinds of knowledge in databases − Different users may be interested in
different kinds of knowledge. Therefore it is necessary for data mining to cover a broad range
of knowledge discovery task.
• Interactive mining of knowledge at multiple levels of abstraction − The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on the returned results.
• Incorporation of background knowledge − To guide discovery process and to express the
discovered patterns, the background knowledge can be used. Background knowledge may be
used to express the discovered patterns not only in concise terms but at multiple levels of
abstraction.

9
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

• Data mining query languages and ad hoc data mining − Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.
• Presentation and visualization of data mining results − Once the patterns are discovered it
needs to be expressed in high level languages, and visual representations. These representations
should be easily understandable.
• Handling noisy or incomplete data − The data cleaning methods are required to handle the
noise and incomplete objects while mining the data regularities. If the data cleaning methods
are not there then the accuracy of the discovered patterns will be poor.
• Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.

Performance Issues
There can be performance-related issues such as follows −
• Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.
• Parallel, distributed, and incremental mining algorithms − The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the
data into partitions which is further processed in a parallel fashion. Then the results from the
partitions is merged. The incremental algorithms, update databases without mining the data
again from scratch.

Diverse Data Types Issues


• Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system
to mine all these kind of data.
• Mining information from heterogeneous databases and global information systems − The
data is available at different data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore mining the knowledge from them adds
challenges to data mining.

Lecture Number: 06

Topics to cover: Types of Data Sets and, Attribute Values

Data: It is how the data objects and their attributes are stored.
• An attribute is an object’s property or characteristics. For example. A person’s hair colour, air
humidity etc.
• An attribute set defines an object. The object is also referred to as a record of the instances or
entity.
Different types of attributes or data types:

10
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

In data mining, understanding the different types of attributes or data types is essential as it helps
to determine the appropriate data analysis techniques to use. The following are the different types
of data:
1] Nominal Data:
This type of data is also referred to as categorical data. Nominal data represents data that is
qualitative and cannot be measured or compared with numbers. In nominal data, the values
represent a category, and there is no inherent order or hierarchy. Examples of nominal data include
gender, race, religion, and occupation. Nominal data is used in data mining for classification and
clustering tasks.
2] Ordinal Data:
This type of data is also categorical, but with an inherent order or hierarchy. Ordinal data
represents qualitative data that can be ranked in a particular order. For instance, education level can
be ranked from primary to tertiary, and social status can be ranked from low to high. In ordinal
data, the distance between values is not uniform. This means that it is not possible to say that the
difference between high and medium social status is the same as the difference between medium
and low social status. Ordinal data is used in data mining for ranking and classification tasks.
3] Binary Data:
This type of data has only two possible values, often represented as 0 or 1. Binary data is
commonly used in classification tasks, where the target variable has only two possible outcomes.
Examples of binary data include yes/no, true/false, and pass/fail. Binary data is used in data mining
for classification and association rule mining tasks.
4] Interval Data:
This type of data represents quantitative data with equal intervals between consecutive values.
Interval data has no absolute zero point, and therefore, ratios cannot be computed. Examples of
interval data include temperature, IQ scores, and time. Interval data is used in data mining for
clustering and prediction tasks.
5] Ratio Data:
This type of data is similar to interval data, but with an absolute zero point. In ratio data, it is
possible to compute ratios of two values, and this makes it possible to make meaningful
comparisons. Examples of ratio data include height, weight, and income. Ratio data is used in data
mining for prediction and association rule mining tasks.
6] Text Data:
This type of data represents unstructured data in the form of text. Text data can be found in social
media posts, customer reviews, and news articles. Text data is used in data mining for sentiment
analysis, text classification, and topic modeling tasks.
Data Quality: Why do we preprocess the data?
Data preprocessing is an essential step in data mining and machine learning as it helps to ensure
the quality of data used for analysis. There are several factors that are used for data quality
assessment, including:
1. Incompleteness:
This refers to missing data or information in the dataset. Missing data can result from various
factors, such as errors during data entry or data loss during transmission. Preprocessing techniques,
such as imputation, can be used to fill in missing values to ensure the completeness of the dataset.
2. Inconsistency:

11
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

This refers to conflicting or contradictory data in the dataset. Inconsistent data can result from
errors in data entry, data integration, or data storage. Preprocessing techniques, such as data
cleaning and data integration, can be used to detect and resolve inconsistencies in the dataset.
3. Noise:
This refers to random or irrelevant data in the dataset. Noise can result from errors during data
collection or data entry. Preprocessing techniques, such as data smoothing and outlier detection,
can be used to remove noise from the dataset.
4. Outliers:
Outliers are data points that are significantly different from the other data points in the dataset.
Outliers can result from errors in data collection, data entry, or data transmission. Preprocessing
techniques, such as outlier detection and removal, can be used to identify and remove outliers from
the dataset.
5. Redundancy:
Redundancy refers to the presence of duplicate or overlapping data in the dataset. Redundant data
can result from data integration or data storage. Preprocessing techniques, such as data
deduplication, can be used to remove redundant data from the dataset.
5.Data format:
This refers to the structure and format of the data in the dataset. Data may be in different formats,
such as text, numerical, or categorical. Preprocessing techniques, such as data transformation and
normalization, can be used to convert data into a consistent format for analysis.

Lecture Number: 07

Topics to cover: Basic Statistical Descriptions of Data, Data Visualization

Data mining refers to extracting or mining knowledge from large amounts of data. In other
words, data mining is the science, art, and technology of discovering large and complex bodies of
data in order to discover useful patterns. Theoreticians and practitioners are continually seeking
improved techniques to make the process more efficient, cost-effective, and accurate. Any situation
can be analyzed in two ways in data mining:
• Statistical Analysis: In statistics, data is collected, analyzed, explored, and presented to
identify patterns and trends. Alternatively, it is referred to as quantitative analysis.
• Non-statistical Analysis: This analysis provides generalized information and includes sound,
still images, and moving images.
In statistics, there are two main categories:
• Descriptive Statistics: The purpose of descriptive statistics is to organize data and identify the
main characteristics of that data. Graphs or numbers summarize the data. Average, Mode,
SD(Standard Deviation), and Correlation are some of the commonly used descriptive statistical
methods.
• Inferential Statistics: The process of drawing conclusions based on probability theory and
generalizing the data. By analyzing sample statistics, you can infer parameters about populations
and make models of relationships within data.

12
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

There are various statistical terms that one should be aware of while dealing with statistics. Some
of these are:
• Population
• Sample
• Variable
• Quantitative Variable
• Qualitative Variable
• Discrete Variable
• Continuous Variable
Now, let’s start discussing statistical methods. This is the analysis of raw data using mathematical
formulas, models, and techniques. Through the use of statistical methods, information is extracted
from research data, and different ways are available to judge the robustness of research outputs.
As a matter of fact, today’s statistical methods used in the data mining field typically are derived
from the vast statistical toolkit developed to answer problems arising in other fields. These
techniques are taught in science curriculums. It is necessary to check and test several hypotheses.
The hypotheses described above help us assess the validity of our data mining endeavor when
attempting to infer any inferences from the data under study. When using more complex and
sophisticated statistical estimators and tests, these issues become more pronounced.
For extracting knowledge from databases containing different types of observations, a variety of
statistical methods are available in Data Mining and some of these are:
• Logistic regression analysis
• Correlation analysis
• Regression analysis
• Discriminate analysis
• Linear discriminant analysis (LDA)
• Classification
• Clustering
• Outlier detection
• Classification and regression trees,
• Correspondence analysis
• Nonparametric regression,
• Statistical pattern recognition,
• Categorical data analysis,
• Time-series methods for trends and periodicity
• Artificial neural networks
Now, let’s try to understand some of the important statistical methods which are used in data
mining:
• Linear Regression: The linear regression method uses the best linear relationship between the
independent and dependent variables to predict the target variable. In order to achieve the best
fit, make sure that all the distances between the shape and the actual observations at each point
are as small as possible..
• Classification: This is a method of data mining in which a collection of data is
categorized so that a greater degree of accuracy can be predicted and analyzed. An
effective way to analyze very large datasets is to classify them. Logistic
Regression: It can also be applied to machine learning applications and predictive
analytics. In this approach, the dependent variable is either binary (binary regression)
or multinomial (multinomial regression): either one of the two or a set of one, two,

13
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

three, or four options. With a logistic regression equation, one can estimate
probabilities regarding the relationship between the independent variable and the
dependent variable. For understanding logistic regression analysis in detail, you can
refer to logistic regression.
• Discriminant Analysis: A Discriminant Analysis is a statistical method of analyzing
data based on the measurements of categories or clusters and categorizing new
observations into one or more populations that were identified a priori. The
discriminant analysis models each response class independently then uses Bayes’s
theorem to flip these projections around to estimate the likelihood of each response
category given the value of X. These models can be either linear or quadratic.
• Linear Discriminant Analysis: According to Linear Discriminant Analysis, each
observation is assigned a discriminant score to classify it into a response variable class.
By combining the independent variables in a linear fashion, these scores can be obtained.
• Quadratic Discriminant Analysis: An alternative approach is provided by Quadratic
Discriminant Analysis. LDA and QDA both assume Gaussian distributions for the
observations of the Y classes. Unlike LDA, QDA considers each class to have its own
covariance matrix. As a result, the predictor variables have different variances across the
k levels in Y.
• Correlation Analysis: In statistical terms, correlation analysis captures the
relationship between variables in a pair. The value of such variables is usually stored
in a column or rows of a database table and represents a property of an object.
• Regression Analysis: Based on a set of numeric data, regression is a data mining
method that predicts a range of numerical values (also known as continuous values).
You could, for instance, use regression to predict the cost of goods and services based
on other variables. A regression model is used across numerous industries for
forecasting financial data, modeling environmental conditions, and analyzing trends.

What is data visualization?


Data visualization is the representation of data through use of common graphics, such as charts,
plots, infographics, and even animations. These visual displays of information communicate complex
data relationships and data-driven insights in a way that is easy to understand.

Idea generation

Data visualization is commonly used to spur idea generation across teams. They are frequently
leveraged during brainstorming or Design Thinking sessions at the start of a project by supporting the
collection of different perspectives and highlighting the common concerns of the collective. While
these visualizations are usually unpolished and unrefined, they help set the foundation within the
project to ensure that the team is aligned on the problem that they’re looking to address for key
stakeholders.

14
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Idea illustration

Data visualization for idea illustration assists in conveying an idea, such as a tactic or process. It is
commonly used in learning settings, such as tutorials, certification courses, centers of excellence, but
it can also be used to represent organization structures or processes, facilitating communication
between the right individuals for specific tasks. Project managers frequently use Gantt charts and
waterfall charts to illustrate workflows. Data modeling also uses abstraction to represent and better
understand data flow within an enterprise’s information system, making it easier for developers,
business analysts, data architects, and others to understand the relationships in a database or data
warehouse.

Visual discovery

Visual discovery and every day data viz are more closely aligned with data teams. While visual
discovery helps data analysts, data scientists, and other data professionals identify patterns and trends
within a dataset, every day data viz supports the subsequent storytelling after a new insight has been
found.

Data visualization
Data visualization is a critical step in the data science process, helping teams and individuals convey
data more effectively to colleagues and decision makers. Teams that manage reporting systems
Tables: This consists of rows and columns used to compare variables. Tables can show a great deal
of information in a structured way, but they can also overwhelm users that are simply looking for
high-level trends.
• Pie charts and stacked bar charts: These graphs are divided into sections that represent parts of a
whole. They provide a simple way to organize data and compare the size of each component to one
other.
• Line charts and area charts: These visuals show change in one or more quantities by plotting a
series of data points over time and are frequently used within predictive analytics. Line graphs utilize
lines to demonstrate these changes while area charts connect data points with line segments, stacking
variables on top of one another and using color to distinguish between variables.
• Histograms: This graph plots a distribution of numbers using a bar chart (with no spaces between the
bars), representing the quantity of data that falls within a particular range. This visual makes it easy
for an end user to identify outliers within a given dataset.
• Scatter plots: These visuals are beneficial in reveling the relationship between two variables, and
they are commonly used within regression data analysis. However, these can sometimes be confused
with bubble charts, which are used to visualize three variables via the x-axis, the y-axis, and the size
of the bubble.
• Heat maps: These graphical representation displays are helpful in visualizing behavioral data by
location. This can be a location on a map, or even a webpage.
• Tree maps, which display hierarchical data as a set of nested shapes, typically rectangles. Treemaps
are great for comparing the proportions between categories via their area size.
Open source visualization tools

Access to data visualization tools has never been easier. Open source libraries, such as D3.js, provide
a way for analysts to present data in an interactive way, allowing them to engage a broader audience
with new data. Some of the most popular open source visualization libraries include:

• D3.js: It is a front-end JavaScript library for producing dynamic, interactive data visualizations in web
browsers. D3.js (link resides outside IBM) uses HTML, CSS, and SVG to create visual

15
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

representations of data that can be viewed on any browser. It also provides features for interactions
and animations.
• ECharts: A powerful charting and visualization library that offers an easy way to add intuitive,
interactive, and highly customizable charts to products, research papers, presentations,
etc. Echarts (link resides outside IBM) is based in JavaScript and ZRender, a lightweight canvas
library.
• Vega: Vega (link resides outside IBM) defines itself as “visualization grammar,” providing support to
customize visualizations across large datasets which are accessible from the web.
• deck.gl: It is part of Uber's open source visualization framework suite. deck.gl (link resides outside
IBM) is a framework, which is used for exploratory data analysis on big data. It helps build high-
performance GPU-powered visualization on the web.
Data visualization best practices

With so many data visualization tools readily available, there has also been a rise in ineffective
information visualization. Visual communication should be simple and deliberate to ensure that your
data visualization helps your target audience arrive at your intended insight or conclusion. The
following best practices can help ensure your data visualization is useful and clear:

Know your audience(s): Think about who your visualization is designed for and then make sure
your data visualization fits their needs. What is that person trying to accomplish? What kind of
questions do they care about? Does your visualization address their concerns?

Choose an effective visual: Specific visuals are designed for specific types of datasets. For instance,
scatter plots display the relationship between two variables well, while line graphs display time series
data well. Ensure that the visual actually assists the audience in understanding your main takeaway.
Misalignment of charts and data can result in the opposite, confusing your audience further versus
providing clarity.

Lecture Number: 08

Topics to cover: Measuring Data Similarity

16
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Similarity and dissimilarity:

• Metric:

A given distance(e.g. dissimilarity) is meant to be a metric if and only if it satisfies the following four
conditions:

1- Non-negativity: d(p, q) ≥ 0, for any two distinct observations p and q.

2- Symmetry: d(p, q) = d(q, p) for all p and q.

3- Triangle Inequality: d(p, q) ≤ d(p, r) + d(r, q) for all p, q, r.

4- d(p, q) = 0 only if p = q.

🄱 . Distance Functions:
The technique used to measure distances depends on a particular situation you are working on. For
instance, in some areas, the euclidean distance can be optimal and useful for computing distances.

⓪. L2 norm, Euclidean distance.

17
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Euclidean Contours.

The most common distance function used for numeric attributes or features is the
Euclidean distance which is defined in the following formula:

Euclidean distance between two points in n-dimensional space.

As you may know, this distance metric presents well-known properties, like
symmetrical, differentiable, convex, spherical…

In 2-dimensional space, the previous formula can be expressed as:

18
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Euclidean distance between two points in 2-dimensional space.

which is equal to the length of the hypotenuse of a right-angle triangle.

Moreover, the Euclidean distance is a metric because it satisfies its criterion, as


the following illustration shows.

The Euclidean distance satisfies all the conditions for being a metric.

19
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Furthermore, the distance calculated using this formula represents the smallest
distance between each pair of points. In other words, it is the shortest path to go
from point A to point B(2-D Cartesian coordinate system), as the following figure
illustrates:

The Euclidean distance is the shortest path(Excluding the case of a wormhole in a quantum world).

Thus, it is useful to use this formula whenever you want to compute the distance between 2 points in
the absence of obstacles on the pathway. This can be considered one of the situations where you don’t
want to compute the euclidean distance; instead, you want to use other metrics like the Manhattan
distance, which will be explained later throughout this article.

Iris dataset for two types of flowers in two features’ space.

20
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

For simplicity and demonstration purposes, let’s choose only two features: petal length, petal width,
and excluding Iris-virginica data. In this way, we can plot the data points in a 2-D space where the x-
axis and the y-axis represent the petal length and the petal width, respectively.

Training dataset.

Each data point came along with its own label: Iris-Setosa or Iris-versicolor(0 and 1 in the dataset).
Thus, the dataset can be used in KNN classification because it is a supervised ML algorithm by its
nature. Let’s assume that our ML model(KNN with k = 4) has been trained on this dataset where we
have selected two input features and only twenty data points, as the previous graph shows.

Until this point, everything looks great, and our KNN classifier is ready to classify a new data point.
Therefore, we need a way to let the model decide where the new data point can be classified.

21
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Predict the label for a new data point.

As you may be thinking, the euclidean distance has been chosen to let each trained data point vote
where the new data sample can fit: Iris-Setosa or Iris-versicolor. Thus, the euclidean distance has been
calculated from the new data point to each point of our training data, as the following figure shows:

Four neighbours voted for Iris-Setosa.

Therefore, the new data sample has been classified as Iris-Setosa. Using this analogy, you can imagine
higher dimensions and other classifiers. Hopefully, You get the idea!

As stated earlier, each domain requires a specific way of computing distances. As we progress more

throughout this article, you will find out what is meant by stating this.

➀. Squared Euclidean distance.

Computing distances using this approach avoids the need to use the squared root function. As the

name reflects, the SED is equal to the euclidean distance squared. Therefore, SED can reduce

22
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

computational work while calculating distances between observations. For instance, it can be used in

clustering, classification, image processing, and other domains.

Squared Euclidean distance between two points in n-D space.

②. L1 norm, City Block, Manhattan, or taxicab distance.

Manhattan Contours.
This metric is very useful in measuring the distance between two streets in a given city, where the

distance can be measured in terms of the number of blocks that separate two different places. For

instance, according to the following image, the distance between point A and point B is roughly
equal to 4 blocks.

Manhattan distance in real world


This method was created to solve computing the distance between source and destination in a given
city where it is nearly impossible to move in a straight line because of the buildings grouped into a
grid that blocks the straight pathway. Hence the name City Block.

In n-dimensional space, the Manhattan distance is expressed as:

23
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Manhattan distance between two points in n-D space.

Recall from the previous KNN example, Computing the manhattan distance from
a new data point to the training data will produce the following values:
KNN classification using Manhattan distance(tie!)

As you can see, two data points have voted for Iris-Setosa, and another two
points have voted for Iris-versicolor, which means a tie.

24
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Manhattan distance: a tie!

I think you may have encountered this problem somewhere. An intuitive solution would be changing
the value of k, decreasing by one if k is larger than one, or increasing by one, otherwise.

However, you will get different behavior for the KNN classifier for each of the previous solutions. For
instance, in our example, k=4. Changing it to k=3 would result in the following values:
Lecture Number: 09, 10

PRE-PROCESSING: Data Quality, Major Tasks in Data


Topics to cover:
Reprocessing

What Is Data Preprocessing? Why Is It Important?

Data preprocessing is the process of transforming raw data into an understandable format. It is also an

important step in data mining as we cannot work with raw data. The quality of the data should be

checked before applying machine learning or data mining algorithms.

25
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Why Is Data Preprocessing Important?

Accuracy: To check whether the data entered is correct or not.

Completeness: To check whether the data is available or not recorded.

Consistency: To check whether the same data is kept in all the places that do or do not match.

Timeliness: The data should be updated correctly.

Believability: The data should be trustable.

Interpretability: The understandability of the data.

Major Tasks in Data Preprocessing

There are 4 major tasks in data preprocessing – Data cleaning, Data integration, Data reduction, and

Data transformation.

Lecture Number: 11

Data Reduction, Data Transformation, and Data


Topics to cover:
Discretization

Data Reduction
This process helps in the reduction of the volume of the data, which makes the analysis easier yet
produces the same or almost the same result. This reduction also helps to reduce storage space. Some

26
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

of the data reduction techniques are dimensionality reduction, numerosity reduction, and data
compression.

Dimensionality reduction: This process is necessary for real-world applications as the data size is
big. In this process, the reduction of random variables or attributes is done so that the dimensionality
of the data set can be reduced. Combining and merging the attributes of the data without losing its
original characteristics. This also helps in the reduction of storage space, and computation time is
reduced. When the data is highly dimensional, a problem called the “Curse of Dimensionality” occurs.
Numerosity Reduction: In this method, the representation of the data is made smaller by reducing the
volume. There will not be any loss of data in this reduction.
Data compression: The compressed form of data is called data compression. This compression can be
lossless or lossy. When there is no loss of information during compression, it is called lossless
compression. Whereas lossy compression reduces information, but it removes only the unnecessary
information.

Data Transformation
The change made in the format or the structure of the data is called data transformation. This step can
be simple or complex based on the requirements. There are some methods for data transformation.

Smoothing: With the help of algorithms, we can remove noise from the dataset, which helps in
knowing the important features of the dataset. By smoothing, we can find even a simple change that
helps in prediction.
Aggregation: In this method, the data is stored and presented in the form of a summary. The data set,
which is from multiple sources, is integrated into with data analysis description. This is an important
step since the accuracy of the data depends on the quantity and quality of the data. When the quality
and the quantity of the data are good, the results are more relevant.
Discretization: The continuous data here is split into intervals. Discretization reduces the data size.
For example, rather than specifying the class time, we can set an interval like (3 pm-5 pm, or 6 pm-8
pm).

Normalization: It is the method of scaling the data so that it can be represented in a smaller range.

Example ranging from -1.0 to 1.0.

Lecture Number: 12

Topics to cover: Data Cleaning and Data Integration.

Data Cleaning
Data cleaning is the process of removing incorrect data, incomplete data, and inaccurate data from the
datasets, and it also replaces the missing values. Here are some techniques for data cleaning:

Handling missing values

Standard values like “Not Available” or “NA” can be used to replace the missing values.

27
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Missing values can also be filled manually, but it is not recommended when that dataset is big.
The attribute’s mean value can be used to replace the missing value when the data is normally
distributed
wherein in the case of non-normal distribution median value of the attribute can be used.
While using regression or decision tree algorithms, the missing value can be replaced by the most
probable value.

Handling noisy data

Noisy generally means random error or containing unnecessary data points. Handling noisy data is one
of the most important steps as it leads to the optimization of the model we are using Here are some of
the methods to handle noisy data.

Binning: This method is to smooth or handle noisy data. First, the data is sorted then, and then the
sorted values are separated and stored in the form of bins. There are three methods for smoothing data
in the bin. Smoothing by bin mean method: In this method, the values in the bin are replaced by the
mean value of the bin; Smoothing by bin median: In this method, the values in the bin are replaced
by the median value; Smoothing by bin boundary: In this method, the using minimum and maximum
values of the bin values are taken, and the closest boundary value replaces the values.
Regression: This is used to smooth the data and will help to handle data when unnecessary data is
present. For the analysis, purpose regression helps to decide the variable which is suitable for our
analysis.
Clustering: This is used for finding the outliers and also in grouping the data. Clustering is generally
used in unsupervised learning.

Data Integration
The process of combining multiple sources into a single dataset. The Data integration process is one
of the main components of data management. There are some problems to be considered during data
integration.

Schema integration: Integrates metadata(a set of data that describes other data) from different sources.
Entity identification problem: Identifying entities from multiple databases. For example, the system
or the user should know the student id of one database and studentname of another database belonging
to the same entity.
Detecting and resolving data value concepts: The data taken from different databases while merging
may differ. The attribute values from one database may differ from another database. For example, the
date format may differ, like “MM/DD/YYYY” or “DD/MM/YYYY”.

28
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

DATA SCIENCES: DATA WAREHOUSING AND DATA MINING

UNIT 2
DATA WAREHOUSING AND ON-LINE ANALYTICAL PROCESSING

• Data Warehouse & Modeling


Unit Content: • Data Cube and OLAP
• Data Warehouse Implementation

CO 214.2 Explain Data Warehousing and on-line Analytical


Course Outcome (CO)
Covered for this unit: Processing in detail.

Number Lectures
planned in Teaching EIGHT
Plan:

➢ 1. Jiawei Han, Micheline Kamber, Jian Pei, Data


Mining: Concepts and Techniques, Elsevier
Reference: ➢ 2. Margaret H Dunham, Data Mining Introductory and
Advanced Topics, Pearson Education

Lecture Number: 13

Topics to cover: Data Warehouse basic concepts.

Data Warehouse
Data Warehouse is a relational database management system (RDBMS) construct to meet the
requirement of transaction processing systems. It can be loosely described as any centralized data
repository which can be queried for business benefits. It is a database that stores information oriented
to satisfy decision-making requests. It is a group of decision support technologies, targets to enabling
the knowledge worker (executive, manager, and analyst) to make superior and higher decisions. So,

29
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Data Warehousing support architectures and tool for business executives to systematically organize,
understand and use their information to make strategic decisions.

Data Warehouse environment contains an extraction, transportation, and loading (ETL) solution, an
online analytical processing (OLAP) engine, customer analysis tools, and other applications that
handle the process of gathering information and delivering it to business users.

What is a Data Warehouse?


A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than
transaction processing. It includes historical data derived from transaction data from single and
multiple sources.

A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing
support for decision-makers for data modeling and analysis.

A Data Warehouse is a group of data specific to the entire organization, not only to a particular group
of users.

It is not used for daily operations and transaction processing but used for making decisions.

A Data Warehouse can be viewed as a data system with the following attributes:

o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
o Its usage is read-intensive.
o It contains a few large tables.

"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in support of


management's decisions."

Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the source
operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e., update,
insert, and delete operations are not performed. It usually requires only two procedures in data
accessing: Initial loading of data and access to data.

30
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Goals of Data Warehousing


o To help reporting as well as analysis
o Maintain the organization's historical information
o Be the foundation for decision making.

Need for Data Warehouse


Data Warehouse is needed for the following reasons:

31
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

1. 1) Business User: Business users require a data warehouse to view summarized data from the past.
Since these people are non-technical, the data may be presented to them in an elementary form.
2. 2) Store historical data: Data Warehouse is required to store the time variable data from the past. This
input is made to be used for various purposes.
3. 3) Make strategic decisions: Some strategies may be depending upon the data in the data warehouse.
So, data warehouse contributes to making strategic decisions.
4. 4) For data consistency and quality: Bringing the data from different sources at a commonplace, the
user can effectively undertake to bring the uniformity and consistency in data.
5. 5) High response time: Data warehouse has to be ready for somewhat unexpected loads and types of
queries, which demands a significant degree of flexibility and quick response time.

Benefits of Data Warehouse


1. Understand business trends and make better forecasting decisions.
2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate, understand, and query.
4. Queries that would be complex in many normalized databases could be easier to build and maintain in
data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of information from lots of users.
6. Data warehousing provide the capabilities to analyze a large amount of historical data.

Lecture Number: 14

Topics to cover: Data Warehouse Modeling

Data Warehouse Modeling


Data warehouse modeling is the process of designing the schemas of the detailed and summarized
information of the data warehouse. The goal of data warehouse modeling is to develop a schema
describing the reality, or at least a part of the fact, which the data warehouse is needed to support.

Data modeling in data warehouses is different from data modeling in operational database systems.
The primary function of data warehouses is to support DSS processes. Thus, the objective of data

32
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

warehouse modeling is to make the data warehouse efficiently support complex queries on long term
information.

In contrast, data modeling in operational database systems targets efficiently supporting simple
transactions in the database such as retrieving, inserting, deleting, and changing data. Moreover, data
warehouses are designed for the customer with general information knowledge about the enterprise,
whereas operational database systems are more oriented toward use by software specialists for creating
distinct applications.

The data within the specific warehouse itself has a particular architecture with the emphasis on various
levels of summarization, as shown in figure:

33
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

The current detail record is central in importance as it:

o Reflects the most current happenings, which are commonly the most stimulating.
o It is numerous as it is saved at the lowest method of the Granularity.
o It is always (almost) saved on disk storage, which is fast to access but expensive and difficult to manage.

Older detail data is stored in some form of mass storage, and it is infrequently accessed and kept at a
level detail consistent with current detailed data.

Lightly summarized data is data extract from the low level of detail found at the current, detailed
level and usually is stored on disk storage. When building the data warehouse have to remember what
unit of time is summarization done over and also the components or what attributes the summarized
data will contain.

Highly summarized data is compact and directly available and can even be found outside the
warehouse.

Metadata is the final element of the data warehouses and is really of various dimensions in which it
is not the same as file drawn from the operational data, but it is used as:-

o A directory to help the DSS investigator locate the items of the data warehouse.
o A guide to the mapping of record as the data is changed from the operational data to the data warehouse
environment.
o A guide to the method used for summarization between the current, accurate data and the lightly
summarized information and the highly summarized data, etc.

Conceptual Data Model


A conceptual data model recognizes the highest-level relationships between the different entities.

Characteristics of the conceptual data model

o It contains the essential entities and the relationships among them.


o No attribute is specified.
o No primary key is specified.

We can see that the only data shown via the conceptual data model is the entities that define the data
and the relationships between those entities. No other data, as shown through the conceptual data
model.

34
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Logical Data Model


A logical data model defines the information in as much structure as possible, without observing how
they will be physically achieved in the database. The primary objective of logical data modeling is to
document the business data structures, processes, rules, and relationships by a single view - the logical
data model.

Features of a logical data model

o It involves all entities and relationships among them.


o All attributes for each entity are specified.
o The primary key for each entity is stated.
o Referential Integrity is specified (FK Relation).

The phase for designing the logical data model which are as follows:

o Specify primary keys for all entities.


o List the relationships between different entities.
o List all attributes for each entity.
o Normalization.
o No data types are listed

35
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Physical Data Model


Physical data model describes how the model will be presented in the database. A physical database
model demonstrates all table structures, column names, data types, constraints, primary key, foreign
key, and relationships between tables. The purpose of physical data modeling is the mapping of the
logical data model to the physical structures of the RDBMS system hosting the data warehouse

Characteristics of a physical data model

o Specification all tables and columns.


o Foreign keys are used to recognize relationships between tables.

The steps for physical data model design which are as follows:

o Convert entities to tables.


o Convert relationships to foreign keys.
o Convert attributes to columns.

36
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Types of Data Warehouse Models

Enterprise Warehouse
An Enterprise warehouse collects all of the records about subjects spanning the entire organization. It
supports corporate-wide data integration, usually from one or more operational systems or external
data providers, and it's cross-functional in scope. It generally contains detailed information as well as
summarized information and can range in estimate from a few gigabyte to hundreds of gigabytes,
terabytes, or beyond.

An enterprise data warehouse may be accomplished on traditional mainframes, UNIX super servers,
or parallel architecture platforms. It required extensive business modeling and may take years to
develop and build.

37
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Data Mart
A data mart includes a subset of corporate-wide data that is of value to a specific collection of users.
The scope is confined to particular selected subjects. For example, a marketing data mart may restrict
its subjects to the customer, items, and sales. The data contained in the data marts tend to be
summarized.

Data Marts is divided into two parts:

Independent Data Mart: Independent data mart is sourced from data captured from one or more
operational systems or external data providers, or data generally locally within a different department
or geographic area.

Dependent Data Mart: Dependent data marts are sourced exactly from enterprise data-warehouses.

Virtual Warehouses
Virtual Data Warehouses is a set of perception over the operational database. For effective query
processing, only some of the possible summary vision may be materialized. A virtual warehouse is
simple to build but required excess capacity on operational database servers.

Lecture Number: 15

Topics to cover: Data Cube and OLAP

What is OLAP?
OLAP stands for Online Analytical Processing, which is a technology that enables multi-
dimensional analysis of business data. It provides interactive access to large amounts of data
and supports complex calculations and data aggregation. OLAP is used to support business
intelligence and decision-making processes.

The example above is a 3D cube having attributes like branch(A,B,C,D),item


type(home,entertainment,computer,phone,security), year(1997,1998,1999) .

Data cube classification:

38
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

The data cube can be classified into two categories:


Multidimensional data cube: It basically helps in storing large amounts of data by making
use of a multi-dimensional array. It increases its efficiency by keeping an index of each
dimension. Thus, dimensional is able to retrieve data fast.
Relational data cube: It basically helps in storing large amounts of data by making use of
relational tables. Each relational table displays the dimensions of the data cube. It is slower
compared to a Multidimensional Data Cube.

Data cube operations:

Data cube operations are used to manipulate data to meet the needs of users. These
operations help to select particular data for the analysis purpose. There are mainly 5
operations listed below-
Roll-up: operation and aggregate certain similar data attributes having the same dimension
together. For example, if the data cube displays the daily income of a customer, we can use
a roll-up operation to find the monthly income of his salary.

Drill-down: this operation is the reverse of the roll-up operation. It allows us to take
particular information and then subdivide it further for coarser granularity analysis. It
zooms into more detail. For example- if India is an attribute of a country column and we
wish to see villages in India, then the drill-down operation splits India into states, districts,
towns, cities, villages and then displays the required information.

Slicing: this operation filters the unnecessary portions. Suppose in a particular dimension,
the user doesn’t need everything for analysis, rather a particular attribute. For example,
country=”jamaica”, this will display only about jamaica and only display other countries
present on the country list.

39
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Dicing: this operation does a multidimensional cutting, that not only cuts only one
dimension but also can go to another dimension and cut a certain range of it. As a result, it
looks more like a subcube out of the whole cube(as depicted in the figure). For example-
the user wants to see the annual salary of Jharkhand state employees.

Pivot: this operation is very important from a viewing point of view. It basically transforms
the data cube in terms of view. It doesn’t change the data present in the data cube. For
example, if the user is comparing year versus branch, using the pivot operation, the user
can change the viewpoint and now compare branch versus item type.

Advantages of data cubes:

Multi-dimensional analysis: Data cubes enable multi-dimensional analysis of business


data, allowing users to view data from different perspectives and levels of detail.
Interactivity: Data cubes provide interactive access to large amounts of data, allowing
users to easily navigate and manipulate the data to support their analysis.
Speed and efficiency: Data cubes are optimized for OLAP analysis, enabling fast and
efficient querying and aggregation of data.
Data aggregation: Data cubes support complex calculations and data aggregation,
enabling users to quickly and easily summarize large amounts of data.
Improved decision-making: Data cubes provide a clear and comprehensive view of
business data, enabling improved decision-making and business intelligence.
Accessibility: Data cubes can be accessed from a variety of devices and platforms,
making it easy for users to access and analyze business data from anywhere.
Helps in giving a summarised view of data.
Data cubes store large data in a simple way.
Data cube operation provides quick and better analysis,
Improve performance of data.

Disadvantages of data cube:

Complexity: OLAP systems can be complex to set up and maintain, requiring


specialized technical expertise.
Data size limitations: OLAP systems can struggle with very large data sets and may
require extensive data aggregation or summarization.
Performance issues: OLAP systems can be slow when dealing with large amounts of
data, especially when running complex queries or calculations.
Data integrity: Inconsistent data definitions and data quality issues can affect the
accuracy of OLAP analysis.
Cost: OLAP technology can be expensive, especially for enterprise-level solutions, due to
the need for specialized hardware and software.
Inflexibility: OLAP systems may not easily accommodate changing business needs and
may require significant effort to modify or extend.
Last Updated : 01 Feb, 2023

40
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Lecture Number: 16

Data Warehouse
Topics to cover:

A data warehouse is a type of data management system that is designed to enable and support business
intelligence (BI) activities, especially analytics. Data warehouses are solely intended to perform queries and
analysis and often contain large amounts of historical data. The data within a data warehouse is usually derived
from a wide range of sources such as application log files and transaction applications.
A data warehouse centralizes and consolidates large amounts of data from multiple sources. Its analytical
capabilities allow organizations to derive valuable business insights from their data to improve decision-
making. Over time, it builds a historical record that can be invaluable to data scientists and business analysts.
Because of these capabilities, a data warehouse can be considered an organization’s “single source of truth.”

A typical data warehouse often includes the following elements:

A relational database to store and manage data


An extraction, loading, and transformation (ELT) solution for preparing the data for analysis
Statistical analysis, reporting, and data mining capabilities
Client analysis tools for visualizing and presenting data to business users
Other, more sophisticated analytical applications that generate actionable information by applying data
science and artificial intelligence (AI) algorithms, or graph and spatial features that enable more kinds of
analysis of data at scale

Benefits of a Data Warehouse


Subject-oriented. They can analyze data about a particular subject or functional area (such as sales).
Integrated. Data warehouses create consistency among different data types from disparate sources.
Nonvolatile. Once data is in a data warehouse, it’s stable and doesn’t change.
Time-variant. Data warehouse analysis looks at change over time.

Data Warehouse Architecture


The architecture of a data warehouse is determined by the organization’s specific needs. Common
architectures include

Simple. All data warehouses share a basic design in which metadata, summary data, and raw data are
stored within the central repository of the warehouse. The repository is fed by data sources on one end and
accessed by end users for analysis, reporting, and mining on the other end.
Simple with a staging area. Operational data must be cleaned and processed before being put in the
warehouse. Although this can be done programmatically, many data warehouses add a staging area for data
before it enters the warehouse, to simplify data preparation.

41
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Hub and spoke. Adding data marts between the central repository and end users allows an organization to
customize its data warehouse to serve various lines of business. When the data is ready for use, it is moved
to the appropriate data mart.
Sandboxes. Sandboxes are private, secure, safe areas that allow companies to quickly and informally
explore new datasets or ways of analyzing data without having to conform to or comply with the formal
rules and protocol of the data warehouse.

Lecture Number: 17

Design and Usage


Topics to cover:

A data warehouse is a single data repository where a record from multiple data sources is integrated
for online business analytical processing (OLAP). This implies a data warehouse needs to meet the
requirements from all the business stages within the entire organization..

There are two approaches

1. "top-down" approach
2. "bottom-up" approach

Top-down Design Approach


In the "Top-Down" design approach, a data warehouse is described as a subject-oriented, time-variant,
non-volatile and integrated data repository for the entire enterprise data from different sources are
validated, reformatted and saved in a normalized (up to 3NF) database as the data warehouse.

Advantages of top-down design

Data Marts are loaded from the data warehouses.

Developing new data mart from the data warehouse is very easy.

Disadvantages of top-down design

This technique is inflexible to changing departmental needs.

The cost of implementing the project is high.

42
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Bottom-Up Design Approach


In the "Bottom-Up" approach, a data warehouse is described as "a copy of transaction data specifical
architecture for query and analysis," term the star schema. In this approach, a data mart is created first
to necessary reporting and analytical capabilities for particular business processes (or subjects). Thus
it is needed to be a business-driven approach in contrast to Inmon's data-driven approach.

43
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Advantages of bottom-up design

Documents can be generated quickly.

The data warehouse can be extended to accommodate new business units.

It is just developing new data marts and then integrating with other data marts.

Disadvantages of bottom-up design

the locations of the data warehouse and the data marts are reversed in the bottom-up approach design.

Differentiate between Top-Down Design Approach and Bottom-


Up Design Approach

Top-Down Design Approach Bottom-Up Design Approach

44
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Breaks the vast problem into smaller subproblems. Solves the essential low-level problem and
integrates them into a higher one.

Inherently architected- not a union of several data Inherently incremental; can schedule essential data
marts. marts first.

Single, central storage of information about the Departmental information stored.


content.

Centralized rules and control. Departmental rules and control.

It includes redundant information. Redundancy can be removed.

It may see quick results if implemented with Less risk of failure, favorable return on investment,
repetitions. and proof of techniques.

Lecture Number: 18

Topics to cover: Data Warehouse Implementation

Data Warehouse Implementation


There are various implementation in data warehouses which are as follows

45
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

1. Requirements analysis and capacity planning: The first process in data warehousing involves
defining enterprise needs, defining architectures, carrying out capacity planning, and selecting the
hardware and software tools. This step will contain be consulting senior management as well as the
different stakeholder.

2. Hardware integration: Once the hardware and software has been selected, they require to be put
by integrating the servers, the storage methods, and the user software tools.

3. Modeling: Modelling is a significant stage that involves designing the warehouse schema and views.
This may contain using a modeling tool if the data warehouses are sophisticated.

4. Physical modeling: For the data warehouses to perform efficiently, physical modeling is needed.
This contains designing the physical data warehouse organization, data placement, data partitioning,
deciding on access techniques, and indexing.

5. Sources: The information for the data warehouse is likely to come from several data sources. This
step contains identifying and connecting the sources using the gateway, ODBC drives, or another
wrapper.

6. ETL: The data from the source system will require to go through an ETL phase. The process of
designing and implementing the ETL phase may contain defining a suitable ETL tool vendors and
purchasing and implementing the tools. This may contains customize the tool to suit the need of the
enterprises.

46
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

7. Populate the data warehouses: Once the ETL tools have been agreed upon, testing the tools will
be needed, perhaps using a staging area. Once everything is working adequately, the ETL tools may
be used in populating the warehouses given the schema and view definition.

8. User applications: For the data warehouses to be helpful, there must be end-user applications. This
step contains designing and implementing applications required by the end-users.

9. Roll-out the warehouses and applications: Once the data warehouse has been populated and the
end-client applications tested, the warehouse system and the operations may be rolled out for the user's
community to use.

Implementation Guidelines

1. Build incrementally: Data warehouses must be built incrementally. Generally, it is recommended


that a data marts may be created with one particular project in mind, and once it is implemented, several
other sections of the enterprise may also want to implement similar systems. An enterprise data
warehouses can then be implemented in an iterative manner allowing all data marts to extract
information from the data warehouse.

2. Need a champion: A data warehouses project must have a champion who is active to carry out
considerable researches into expected price and benefit of the project. Data warehousing projects

47
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

requires inputs from many units in an enterprise and therefore needs to be driven by someone who is
needed for interacting with people in the enterprises and can actively persuade colleagues.

3. Senior management support: A data warehouses project must be fully supported by senior
management. Given the resource-intensive feature of such project and the time they can take to
implement, a warehouse project signal for a sustained commitment from senior management.

4. Ensure quality: The only record that has been cleaned and is of a quality that is implicit by the
organizations should be loaded in the data warehouses.

5. Corporate strategy: A data warehouse project must be suitable for corporate strategies and business
goals. The purpose of the project must be defined before the beginning of the projects.

6. Business plan: The financial costs (hardware, software, and peopleware), expected advantage, and
a project plan for a data warehouses project must be clearly outlined and understood by all
stakeholders. Without such understanding, rumors about expenditure and benefits can become the only
sources of data, subversion the projects.

7. Training: Data warehouses projects must not overlook data warehouses training requirements. For
a data warehouses project to be successful, the customers must be trained to use the warehouses and
to understand its capabilities.

Lecture Number: 19

Data Generalization by Attribute-Oriented Induction


Topics to cover:

Attribute-Oriented Induction
The Attribute-Oriented Induction (AOI) approach to data generalization and summarization – based
characterization was first proposed in 1989 (KDD ‘89 workshop) a few years before the introduction
of the data cube approach.

The data cube approach can be considered as a data warehouse – based, pre computational –
oriented, materialized approach.

It performs off-line aggregation before an OLAP or data mining query is submitted for processing.

On the other hand, the attribute oriented induction approach, at least in its initial proposal, a
relational database query – oriented, generalized – based, on-line data analysis technique.

However, there is no inherent barrier distinguishing the two approaches based on online aggregation
versus offline precomputation.

Some aggregations in the data cube can be computed on-line, while off-line precomputation of
multidimensional space can speed up attribute-oriented induction as well.

48
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

It was proposed in 1989 (KDD ‘89 workshop).

It is not confined to categorical data nor particular measures.

How it is done?

Collect the task-relevant data( initial relation) using a relational database query
Perform generalization by attribute removal or attribute generalization.
Apply aggregation by merging identical, generalized tuples and accumulating their respective counts.
Reduces the size of the generalized data set.
Interactive presentation with users.

Basic Principles Of Attribute Oriented Induction

Data focusing:

Analyzing task-relevant data, including dimensions, and the result is the initial relation.

Attribute-removal:

To remove attribute A if there is a large set of distinct values for A but (1) there is no generalization
operator on A, or (2) A’s higher-level concepts are expressed in terms of other attributes.

Attribute-generalization:

If there is a large set of distinct values for A, and there exists a set of generalization operators on A,
then select an operator and generalize A.

Lecture Number: 20

Topics to cover: Data Cube Computation

What are the techniques for Data Cube Computations?


The following are general optimization techniques for efficient computation of data cubes which as
follows −
Sorting, hashing, and grouping − Sorting, hashing, and grouping operations must be used to the
dimension attributes to reorder and cluster associated tuples. In cube computation, aggregation is
implemented on the tuples that share the similar set of dimension values. Therefore, it is essential to
analyse sorting, hashing, and grouping services to access and group such data to support evaluation
of such aggregates.

49
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

It can calculate total sales by branch, day, and item. It can be more effective to sort tuples or cells by
branch, and thus by day, and then group them as per the item name. An effective performance of such
operations in huge data sets have been widely considered in the database research community.
Such performance can be continued to data cube computation. This method can also be continued to
implement shared-sorts (i.e., sharing sorting costs across different cuboids when sort-based techniques
are used), or to implement shared-partitions (i.e., sharing the partitioning cost across different cuboids
when hash-based algorithms are utilized).
Simultaneous aggregation and caching of intermediate results − In cube computation, it is
effective to calculate higher-level aggregates from earlier computed lower-level aggregates, instead
of from the base fact table. Furthermore, simultaneous aggregation from cached intermediate
computation results can lead to the decline of high-priced disk input/output (I/O) operations.
It can compute sales by branch, for instance, it can use the intermediate results changed from the
computation of a lower-level cuboid including sales by branch and day. This methods can be continued
to implement amortized scans (i.e., computing as several cuboids as possible simultaneously to
amortize disk reads).
Aggregation from the smallest child when there exist multiple child cuboids − When there exist
several child cuboids, it is generally more effective to calculate the desired parent (i.e., more
generalized) cuboid from the smallest, formerly computed child cuboid.
The Apriori pruning method can be explored to compute iceberg cubes efficiently − The Apriori
property in the context of data cubes, defined as follows: If a given cell does not fulfil minimum
support, therefore no descendant of the cell (i.e., more specific cell) will satisfy minimum support.
This property can be used to largely decrease the computation of iceberg cubes.
The description of iceberg cubes includes an iceberg condition, which is a constraint on the cells to be
materialized. A general iceberg condition is that the cells should satisfy a minimum support threshold
including a minimum count or sum. In this term, the Apriori property can be used to shorten away the
inspection of the cell’s descendants.

50
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

DATA SCIENCES: DATA WAREHOUSING AND DATA MINING

UNIT 3
PATTERNS, ASSOCIATIONS AND CORRELATIONS

Unit Content:

CO 214.3 Describe the concepts of Patterns, Associations


Course Outcome (CO)
Covered for this unit: and Correlations with relevant examples.

Number Lectures
planned in Teaching EIGHT
Plan:

➢ 1. Jiawei Han, Micheline Kamber, Jian Pei, Data


Mining: Concepts and Techniques, Elsevier
Reference: ➢ 2. Margaret H Dunham, Data Mining Introductory and
Advanced Topics, Pearson Education

51
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Lecture Number: 21

Topics to cover: Mining Frequent Patterns

Frequent pattern mining in data mining is the process of identifying patterns or associations
within a dataset that occur frequently. This is typically done by analyzing large datasets to
find items or sets of items that appear together frequently.
There are several different algorithms used for frequent pattern mining, including:
1. Apriori algorithm: This is one of the most commonly used algorithms for frequent
pattern mining. It uses a “bottom-up” approach to identify frequent itemsets and then
generates association rules from those itemsets.
2. ECLAT algorithm: This algorithm uses a “depth-first search” approach to identify
frequent itemsets. It is particularly efficient for datasets with a large number of items.
3. FP-growth algorithm: This algorithm uses a “compression” technique to find frequent
patterns efficiently. It is particularly efficient for datasets with a large number of
transactions.
4. Frequent pattern mining has many applications, such as Market Basket Analysis,
Recommender Systems, Fraud Detection, and many more.

Advantages:

1. It can find useful information which is not visible in simple data browsing
2. It can find interesting association and correlation among data items

Disadvantages:

1. It can generate a large number of patterns


2. With high dimensionality, the number of patterns can be very large, making it difficult
to interpret the results.

Data mining is the process of converting raw data into suitable patterns based on trends.
Data mining has different types of patterns and frequent pattern mining is one of them.
This concept was introduced for mining transaction databases. Frequent patterns are

52
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

patterns(such as items, subsequences, or substructures) that appear frequently in the database.


It is an analytical process that finds frequent patterns, associations, or causal structures from
databases in various databases. This process aims to find the frequently occurring item in a
transaction. By frequent patterns, we can identify strongly correlated items together and we
can identify similar characteristics and associations among them. By doing frequent data
mining we can go further for clustering and association.
Frequent pattern mining is a major concern it plays a major role in associations and
correlations and disclose an intrinsic and important property of dataset.

53
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Lecture Number: 22

54
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Topics to cover: Associations and Correlations: Basic Concepts

Frequent Pattern is a pattern which appears frequently in a data set. By


identifying frequent patterns we can observe strongly correlated items
together and easily identify similar characteristics, associations among
them. By doing frequent pattern mining, it leads to further analysis like
clustering, classification and other data mining tasks.

Before moving to mine frequent patterns, we should focus on two terms which
“support” and “confidence” because they can provide a measure if the
Association rule is qualified or not for a particular data set.

Support: how often a given rule appears in the database being mined

Confidence: the number of times a given rule turns out to be true in practice

Let’s practice it through a sample data set;

Image by Author

55
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Example: One of possible Association Rule is A => D


Total no of Transactions(N) = 5
Frequency(A, D) = > Total no of instances together A with D is 3
Frequency(A) => Total no of occurrence in A
Support = 3 / 5
Confidence = 3 / 4

1. Generate Candidate set 1, do the first scan and generate One item
set

In this stage, we get the sample data set and take each individual’s count and
make frequent item set 1(K = 1).

Candidate set 1

“Candidate set 1” figure shows you the individual item Support count. Hence the
minimum support is 2 and based on that, item E will remove from the Candidate
set 1 as an infrequent item (disqualified).

56
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Frequent itemset from the first scan

Frequent itemset based on the minimum support value will be shown under the
figure “Frequent item set from the first scan” as the “One item set”.

2. Generate Candidate set 2, do the second scan and generate Second item
set

Through this step, you create frequent set 2 (K =2) and takes each of their
Support counts.

Candidate set 2

“Candidate set 2” figure generates through joining Candidate set 1 and takes the
frequency of related occurrences. Hence the minimum support is 2, Itemset B, D
will be removed from Candidate set 2 as an infrequent item set.

57
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Frequent item set from the second scan

“Frequent item set from the second scan” is the frequent itemset based on the
minimum support value and it will generate the “Second item set”.

3. Generate Candidate set 3, do the third scan and generate Third item set

In this iteration create frequent set 3 (K = 3) and take count of Support. Then
compare with the minimum support value from the Candidate set 3.

Candidate set 3

As you can see here we compare Candidate set 3 with the minimum support
value, generated Third item set. The frequent set from the third scan is the same
as above.

4. Generate Candidate set 4, do the fourth scan and generate Fourth item set

By considering the Frequent set, we can generate Candidate set 4, by joining


Candidate set 3. Then possible Candidate set 4 is; [A, B, C, D] which has

58
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

minimum support less than 2. Therefore we have to stop the calculation from
here because we can not go no any more iterations. Therefore for the above data
set frequent patterns are [A, B, C] and [A, C, D].

By considering one of the frequent set as{A, B, C} and the possible Association
rules as follows;

1. A => B, C

2. A, B => C

3. A, C => B

4. B => A, C

5. B, C => A

6. C => A, B

59
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Image by Author

As you can see in the Apriori Algorithm, for each step it is generating candidates
and it is taking much time when the list of candidates becomes larger.

Here is the pseudo-code for Apriori Algorithm;

60
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Lecture Number: 23

Efficient and Salable Frequent, Item set Mining Methods


Topics to cover:

1. Frequent item sets, also known as association rules, are a fundamental concept in
association rule mining, which is a technique used in data mining to discover
relationships between items in a dataset. The goal of association rule mining is to
identify relationships between items in a dataset that occur frequently together.
2. A frequent item set is a set of items that occur together frequently in a dataset. The
frequency of an item set is measured by the support count, which is the number of
transactions or records in the dataset that contain the item set. For example, if a dataset
contains 100 transactions and the item set {milk, bread} appears in 20 of those
transactions, the support count for {milk, bread} is 20.
3. Association rule mining algorithms, such as Apriori or FP-Growth, are used to find
frequent item sets and generate association rules. These algorithms work by iteratively
generating candidate item sets and pruning those that do not meet the minimum support
threshold. Once the frequent item sets are found, association rules can be generated by
using the concept of confidence, which is the ratio of the number of transactions that
contain the item set and the number of transactions that contain the antecedent (left-
hand side) of the rule.

61
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

4. Frequent item sets and association rules can be used for a variety of tasks such as
market basket analysis, cross-selling and recommendation systems. However, it should
be noted that association rule mining can generate a large number of rules, many of
which may be irrelevant or uninteresting. Therefore, it is important to use appropriate
measures such as lift and conviction to evaluate the interestingness of the generated
rules.
Association Mining searches for frequent items in the data set. In frequent mining usually,
interesting associations and correlations between item sets in transactional and relational
databases are found. In short, Frequent Mining shows which items appear together in a
transaction or relationship.
Need of Association Mining: Frequent mining is the generation of association rules from a
Transactional Dataset. If there are 2 items X and Y purchased frequently then it’s good to
put them together in stores or provide some discount offer on one item on purchase of
another item. This can really increase sales. For example, it is likely to find that if a
customer buys Milk and bread he/she also buys Butter. So the association rule
is [‘milk]^[‘bread’]=>[‘butter’]. So the seller can suggest the customer buy butter if
he/she buys Milk and Bread.

Important Definitions :

Support : It is one of the measures of interestingness. This tells about the usefulness
and certainty of rules. 5% Support means total 5% of transactions in the database
follow the rule.
Support(A -> B) = Support_count(A 𝖴 B)
Confidence: A confidence of 60% means that 60% of the customers who purchased a
milk and bread also bought butter.
Confidence(A -> B) = Support_count(A 𝖴 B) / Support_count(A)
If a rule satisfies both minimum support and minimum confidence, it is a strong rule.
Support_count(X): Number of transactions in which X appears. If X is A union B then
it is the number of transactions in which A and B both are present.
Maximal Itemset: An itemset is maximal frequent if none of its supersets are frequent.
Closed Itemset: An itemset is closed if none of its immediate supersets have same
support count same as Itemset.
K- Itemset: Itemset which contains K items is a K-itemset. So it can be said that an
itemset is frequent if the corresponding support count is greater than the minimum
support count.

62
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Example On finding Frequent Itemsets – Consider the given dataset with given

transactions.

Lecture Number: 24

Pattern Evaluation Methods, Applications of frequent


Topics to cover:
pattern and associations

In data mining, pattern evaluation is the process of assessing the quality of discovered
patterns. This process is important in order to determine whether the patterns are useful and
whether they can be trusted. There are a number of different measures that can be used to
evaluate patterns, and the choice of measure will depend on the application.
There are several ways to evaluate pattern mining algorithms:
1. Accuracy
The accuracy of a data mining model is a measure of how correctly the model predicts the
target values. The accuracy is measured on a test dataset, which is separate from the training
dataset that was used to train the model

2. Classification Accuracy
This measures how accurately the patterns discovered by the algorithm can be used to classify
new data. This is typically done by taking a set of data that has been labeled with known class
labels and then using the discovered patterns to predict the class labels of the data. The
accuracy can then be computed by comparing the predicted labels to the actual labels.
Overall, classification accuracy is a good metric to use when evaluating classification
models. However, it is important to be aware of its limitations and to use other evaluation
metrics in situations where classification accuracy could be misleading.

63
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

3. Clustering Accuracy
This measures how accurately the patterns discovered by the algorithm can be used to cluster
new data. This is typically done by taking a set of data that has been labeled with known
cluster labels and then using the discovered patterns to predict the cluster labels of the data.
The accuracy can then be computed by comparing the predicted labels to the actual labels.
There are a few ways to evaluate the accuracy of a clustering algorithm:
External indices: these indices compare the clusters produced by the algorithm to some
known ground truth. For example, the Rand Index or the Jaccard coefficient can be used
if the ground truth is known.
Internal indices: these indices assess the goodness of clustering without reference to any
external information. The most popular internal index is the Dunn index.
Stability: this measures how robust the clustering is to small changes in the data. A
clustering algorithm is said to be stable if, when applied to different samples of the same
data, it produces the same results.
Efficiency: this measures how quickly the algorithm converges to the correct clustering.
4. Coverage
This measures how many of the possible patterns in the data are discovered by the algorithm.
This can be computed by taking the total number of possible patterns and dividing it by the
number of patterns discovered by the algorithm.
5. Visual Inspection
This is perhaps the most common method, where the data miner simply looks at the patterns
to see if they make sense. In visual inspection, the data is plotted in a graphical format and
the pattern is observed.

6. Running Time
This measures how long it takes for the algorithm to find the patterns in the data. This is
typically measured in seconds or minutes. There are a few different ways to measure the
performance of a machine learning algorithm, but one of the most common is to simply
measure the amount of time it takes to train the model and make predictions. This is known
as the running time pattern evaluation.

7. Support
The support of a pattern is the percentage of the total number of records that contain the
pattern. Support Pattern evaluation is a process of finding interesting and potentially useful
patterns in data. The purpose of support pattern evaluation is to identify interesting patterns
that may be useful for decision-making. Support pattern evaluation is typically used in data
mining and machine learning applications.

8. Confidence
The confidence of a pattern is the percentage of times that the pattern is found to be correct.
Confidence Pattern evaluation is a method of data mining that is used to assess the quality of
patterns found in data.

64
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

9. Lift
The lift of a pattern is the ratio of the number of times that the pattern is found to be correct
to the number of times that the pattern is expected to be correct. Lift Pattern evaluation is a
data mining technique that can be used to evaluate the performance of a predictive model.
The lift pattern is a graphical representation of the model’s performance and can be used to
identify potential problems with the model.

10. Prediction
The prediction of a pattern is the percentage of times that the pattern is found to be correct.
Prediction Pattern evaluation is a data mining technique used to assess the accuracy of
predictive models. It is used to determine how well a model can predict future outcomes
based on past data. Prediction Pattern evaluation can be used to compare different models,
or to evaluate the performance of a single model.

11. Precision
Precision Pattern Evaluation is a method for analyzing data that has been collected from a
variety of sources. This method can be used to identify patterns and trends in the data, and to
evaluate the accuracy of data. Precision Pattern Evaluation can be used to identify errors in
the data, and to determine the cause of the errors

12. Cross-Validation
This method involves partitioning the data into two sets, training the model on one set, and
then testing it on the other. This can be done multiple times, with different partitions, to get
a more reliable estimate of the model’s performance. Cross-validation is a model validation
technique for assessing how the results of a data mining analysis will generalize to an
independent data set.

13. Test Set


This method involves partitioning the data into two sets, training the model on the entire
data set, and then testing it on the held-out test set. This is more reliable than cross-validation
but can be more expensive if the data set is large. There are a number of ways to evaluate the
performance of a model on a test set.

14. Bootstrapping
This method involves randomly sampling the data with replacement, training the model on
the sampled data, and then testing it on the original data. This can be used to get a distribution
of the model’s performance, which can be useful for understanding how robust the model is.

Lecture Number: 25

Frequent Patterns and Association Mining: A Road Map


Topics to cover:

Association Rule Mining:


It is easy to find associations in frequent patterns:
for each frequent pattern x for each subset y c x.
65
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

calculate the support of y-> x – y.


if it is greater than the threshold, keep the rule. There are two algorithms that support this
lattice
1. Apriori algorithm
2. eclat algorithm
Apriori Eclat

It performs “perfect” pruning of


It reduces memory requirements and is faster.
infrequent item sets.

It requires a lot of memory(all frequent


item sets are represented) and support
counting takes very long for large Its storage of transaction list.
transactions. But this is not efficient in
practice.

The words support and confidence support the association rule.


Support: how often a given rule in a database is mined? support the transaction contains
xUy
Confidence: the number of times the given rule in a practice is true. The conditional
probability is a transaction having x as well as y.

working principle (it is a simple point of scale application for any supermarket which has a
good off-product scale)
the product data will be entered into the database.
the taxes and commissions are entered.
the product will be purchased and it will be sent to the bill counter.
the bill calculating operator will check the product with the bar code machine it will check
and match the product in the database and then it will show the information of the product.
the bill will be paid by the customer and he will receive the products.

66
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Tasks in the frequent pattern mining:


Association
Cluster analysis: frequent pattern-based clustering is well suited for high-dimensional
data. by the extension of dimension the sub-space clustering occurs.
Data warehouse: iceberg cube and cube gradient
Broad applications
There are some to improve the efficiency of the tasks.

Closed Pattern:

A frequent pattern, it meets the minimum support criteria. All super patterns of a closed
pattern are less frequent than the closed pattern.

Max Pattern:

It also meets the minimum support criteria(like a closed pattern). All super patterns of a max
pattern are not frequent patterns. both patterns generate fewer numbers of patterns so
therefore they increase the efficiency of the task.

Applications of Frequent Pattern Mining:

basket data analysis, cross-marketing, catalog design, sale campaign analysis, web log
analysis, and DNA sequence analysis.
Issues of frequent pattern mining

67
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

flexibility and reusability for creating frequent patterns


most of the algorithms used for mining frequent item sets do not offer flexibility for
reusing
much research is needed to reduce the size of the derived patterns
Frequent pattern mining has several applications in different areas, including:
Market Basket Analysis: This is the process of analyzing customer purchasing patterns
in order to identify items that are frequently bought together. This information can be
used to optimize product placement, create targeted marketing campaigns, and make
other business decisions.
Recommender Systems: Frequent pattern mining can be used to identify patterns in user
behavior and preferences in order to make personalized recommendations.
Fraud Detection: Frequent pattern mining can be used to identify abnormal patterns of
behavior that may indicate fraudulent activity.
Network Intrusion Detection: Network administrators can use frequent pattern mining to
detect patterns of network activity that may indicate a security threat.
Medical Analysis: Frequent pattern mining can be used to identify patterns in medical
data that may indicate a particular disease or condition.
Text Mining: Frequent pattern mining can be used to identify patterns in text data, such
as keywords or phrases that appear frequently together in a document.
Web usage mining: Frequent pattern mining can be used to analyze patterns of user
behavior on a website, such as which pages are visited most frequently or which links
are clicked on most often.
Gene Expression: Frequent pattern mining can be used to analyze patterns of gene
expression in order to identify potential biomarkers for different diseases.
These are a few examples of the application of frequent pattern mining. The list is not
exhaustive and the technique can be applied in many other areas, as well.

Lecture Number: 26

Topics to cover: Mining Various Kinds of Association Rules

Types of Association Rules in Data Mining

Association rule learning is a machine learning technique used for discovering interesting
relationships between variables in large databases. It is designed to detect strong rules in the

68
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

database based on some interesting metrics. For any given multi-item transaction, association
rules aim to obtain rules that determine how or why certain items are linked. Association
rules are created for finding information about general if-then patterns using specific criteria
with support and trust to define what the key relationships are. They help to show the frequency
of an item in specific data since confidence is defined by the number of times an if-then
statement is found to be true.

Types of Association Rules:

There are various types of association rules in data mining:-


Multi-relational association rules
Generalized association rules
Quantitative association rules
Interval information association rules
1. Multi-relational association rules: Multi-Relation Association Rules (MRAR) is a new
class of association rules, different from original, simple, and even multi-relational association
rules (usually extracted from multi-relational databases), each rule element consists of one
entity but many a relationship. These relationships represent indirect relationships between
entities.
2. Generalized association rules: Generalized association rule extraction is a powerful tool
for getting a rough idea of interesting patterns hidden in data. However, since patterns are
extracted at each level of abstraction, the mined rule sets may be too large to be used effectively
for decision-making. Therefore, in order to discover valuable and interesting knowledge, post-
processing steps are often required. Generalized association rules should have categorical
(nominal or discrete) properties on both the left and right sides of the rule.
3. Quantitative association rules: Quantitative association rules is a special type of
association rule. Unlike general association rules, where both left and right sides of the rule
should be categorical (nominal or discrete) attributes, at least one attribute (left or right) of
quantitative association rules must contain numeric attributes

Uses of Association Rules

Some of the uses of association rules in different fields are given below:
Medical Diagnosis: Association rules in medical diagnosis can be used to help doctors
cure patients. As all of us know that diagnosis is not an easy thing, and there are many
errors that can lead to unreliable end results. Using the multi-relational association rule, we
can determine the probability of disease occurrence associated with various factors and
symptoms.
Market Basket Analysis: It is one of the most popular examples and uses of association
rule mining. Big retailers typically use this technique to determine the association between
items.

69
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Lecture Number: 27

Constraint-Based Frequent Pattern Mining


Topics to cover:

Constraint-Based Frequent Pattern Mining


Data mining allows for extracting thousands of rules that appear to be important from a dataset; nevertheless,
likely, the majority of these rules will not provide customers with any value. Users often have a clear concept
of the "form" of the patterns or rules they want to discover and the "direction" of mining that may lead to
intriguing pattern discoveries. Knowledge Type of Constraint: These characteristics, which may include association,
correlation,categorization, or grouping, describe the nature of the knowledge that is to be mined.
▪ Data Constraint: These characteristics are used to determine the information that is required to finish a job. These
constraint attempts can be guided in the right direction by imposing constraints, such as limitations on the
dimensions or layers of the data, abstractions, or thought hierarchies.
▪ Interestingness Constraints: Limitations on interestingness are utilized in the process of establishing minimum
and maximum values for statistica measures of rule interestingness, including support, confidence, and correlation.
Limitations are placed on the interestingness of rules.
▪ Rule Constraint: The form or needs of the rules are outlined by the constraints that are placed on the rules to be
mined.

In the form of metarules, one can express limitations placed on the number of predicates permitted to occur in
the antecedent or consequent of a rule, as well as the relationships between attributes, attribute values, and/or
aggregates (rule templates).It is necessary to have both a graphical user interface and a high-level declarative
data mining query language to be able to express such constraints.

70
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

The first four kinds of limitations have each received a substantial amount of attention over the entirety of this
book and in the chapters that came before this one. In this section, we will discuss how the application of rule
limits may assist in reducing the overall scope of the mining process. This constraint-based mining approach
optimizes the data mining process by allowing users to describe the rules they want to uncover and then look for
those rules. In addition, the user-specified limits can be utilized by an intelligent mining query optimizer, which
in turn increases the mining operation's efficiency.

When using constraints, it's possible to do interactive exploratory mining and analysis. In this course, you'll
learn about metarule-guided mining, a technique in which syntactic rule restrictions are described using rule
templates. details the use of data space pruning (removing portions of the data space for which further
exploration cannot contribute to finding patterns matching the requirements) and pattern space pruning
(removing portions of the pattern space that are not being mined).

We present anti monotonicity , monotonicity, and succinctness as classes of traits that aid in pruning pattern
spaces via constraint-based search space reduction. Convertible constraints are discussed; they are a subset of
monotonic and anti-monotonic constraints that may be pushed farther into the iterative mining process without
losing their pruning power with the right data ordering.We investigate how data space pruning might be included
in a data mining workflow by introducing two sets of properties: data succinctness and data anti monotonicity
.

We will assume the user is looking for association rules for the sake of discussion. Adding a correlation measure
of interestingness to the support-confidence framework makes it simple to apply the proposed methods to mining
correlation rules. For a better understanding of constraint-based frequent pattern mining, you need to learn about
the six stages of data science processing.

Lecture Number: 28

Topics to cover: Extended Applications of Frequent Patterns.


There are various applications of Pattern Mining which are as follows −
Pattern mining is generally used for noise filtering and data cleaning as preprocessing in several data-
intensive applications. It can be used to explore microarray data, for example, which includes tens of
thousands of dimensions (e.g., describing genes).
Pattern mining provides in the discovery of inherent mechanisms and clusters hidden in the data.
Given the DBLP data set, for example, frequent pattern mining can simply discover interesting clusters
like coauthor clusters (by determining authors who generally collaborate) and conference clusters (by
determining the sharing of several authors and terms). Such architecture or cluster discovery can be
used as preprocessing for additional sophisticated data mining.
Frequent patterns can be used effectively for subspace clustering in high-dimensional area. Clustering
is difficult in high-dimensional space, where the distance among two objects is complex to measure.
This is because such a distance is dominated by the multiple sets of dimensions in which the objects
are occupying.
Pattern analysis is beneficial in the analysis of spatiotemporal information, timeseries data, image
data, video data, and multimedia data. An application of spatiotemporal data analysis is the analysis

71
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

of colocation patterns. These can help decide if a specific disease is geographically colocated with
specific objects like a well, a hospital, or a river.
In time-series data analysis, researchers have discretized time-series values into several intervals
therefore that small fluctuations and value differences can be ignored. The data can be summarized
into sequential patterns, which can be indexed to simplify similarity search or comparative analysis.
In image analysis and pattern recognition, researchers have also orderly frequently appearing visual
fragments as visual words, which can be used for efficient clustering, classification, and comparative
analysis.
Pattern mining has been used for the analysis of sequence or structural data including trees, graphs,
subsequences, and networks. In software engineering, researchers have coherent consecutive or
gapped subsequences in code execution as sequential patterns that support identify software errors.
Copy-and-paste errors in huge software programs can be recognized by extended sequential pattern
analysis of source code. Plagiarized software programs can be recognized based on their substantially
identical program flow/loop mechanism.
Frequent and discriminative patterns can be used as primitive indexing mechanism (called graph
indices) to provide search large, complex, structured data sets and networks. These provide a similarity
search in graph-structured data including chemical compound databases or XML-structured databases.
Such patterns can be used for data compression and description.

72
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Lecture Number: 29

Topics to cover: Basic Concepts, Decision Tree Induction

A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node
denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class
label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a customer at a
company is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf
node represents a class.

The benefits of having a decision tree are as follows −

• It does not require any domain knowledge.


• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.
Decision Tree Induction Algorithm
A machine researcher named J. Ross Quinlan in 1980 developed a decision tree algorithm known as ID3
(Iterative Dichotomiser). Later, he presented C4.5, which was the successor of ID3. ID3 and C4.5 adopt a
greedy approach. In this algorithm, there is no backtracking; the trees are constructed in a top-down
recursive divide-and-conquer manner.
Generating a decision tree form training tuples of data partition D
Algorithm : Generate_decision_tree

Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
73
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

splitting_attribute and either a splitting point or splitting subset.

Output:
A Decision Tree

Method
create a node N;

if tuples in D are all of the same class, C then


return N as leaf node labeled with class C;

if attribute_list is empty then


return N as leaf node with labeled
with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list)


to find the best splitting_criterion;
label node N with splitting_criterion;

if splitting_attribute is discrete-valued and


multiway splits allowed then // no restricted to binary trees

attribute_list = splitting attribute; // remove splitting attribute


for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each partition


let Dj be the set of data tuples in D satisfying outcome j; // a partition

if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to noise or outliers. The
pruned trees are smaller and less complex.

Tree Pruning Approaches


There are two approaches to prune a tree −
• Pre-pruning − The tree is pruned by halting its construction early.
• Post-pruning - This approach removes a sub-tree from a fully grown tree.
Cost Complexity
The cost complexity is measured by the following two parameters −

• Number of leaves in the tree, and


• Error rate of the tree.

74
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Lecture Number: 30

Topics to cover: Bayesian Classification Methods

Data Mining Bayesian Classifiers


In numerous applications, the connection between the attribute set and the class variable is non-
deterministic. In other words, we can say the class label of a test record cant be assumed with certainty
even though its attribute set is the same as some of the training examples. These circumstances may
emerge due to the noisy data or the presence of certain confusing factors that influence classification, but it
is not included in the analysis. For example, consider the task of predicting the occurrence of whether an
individual is at risk for liver illness based on individuals eating habits and working efficiency.

Bayes theorem came into existence after Thomas Bayes, who first utilized conditional probability to
provide an algorithm that uses evidence to calculate limits on an unknown parameter.

Bayes's theorem is expressed mathematically by the following equation that is given below.

Where X and Y are the events and P (Y) ≠ 0

P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y is true.

P(Y/X) is a conditional probability that describes the occurrence of event Y is given that X is true.

P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This is known as
the marginal probability.

Bayesian interpretation:

In the Bayesian interpretation, probability determines a "degree of belief." Bayes theorem connects the
degree of belief in a hypothesis before and after accounting for evidence. For example, Lets us consider an
example of the coin. If we toss a coin, then we get either heads or tails, and the percent of occurrence of
either heads and tails is 50%. If the coin is flipped numbers of times, and the outcomes are observed, the
degree of belief may rise, fall, or remain the same depending on the outcomes.

For proposition X and evidence Y,

o P(X), the prior, is the primary degree of belief in X


o P(X/Y), the posterior is the degree of belief having accounted for Y.

75
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

o The quotient represents the supports Y provides for X.

Bayes theorem can be derived from the conditional probability:

Where P (X⋂Y) is the joint probability of both X and Y being true, because

Bayesian network:

A Bayesian Network falls under the classification of Probabilistic Graphical Modelling (PGM) procedure
that is utilized to compute uncertainties by utilizing the probability concept. Generally known as Belief
Networks, Bayesian Networks are used to show uncertainties using Directed Acyclic Graphs (DAG)

A Directed Acyclic Graph is used to show a Bayesian Network, and like some other statistical graph, a
DAG consists of a set of nodes and links, where the links signify the connection between the nodes.

The nodes here represent random variables, and the edges define the relationship between these variables.

76
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

A DAG models the uncertainty of an event taking place based on the Conditional Probability Distribution
(CDP) of each random variable. A Conditional Probability Table (CPT) is used to represent the CPD of
each variable in a network.

Lecture Number: 31

Topics to cover: Rule Based Classification

IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the
following from −
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age = youth AND student = yes
THEN buy_computer = yes
Points to remember −
• The IF part of the rule is called rule antecedent or precondition.
• The THEN part of the rule is called rule consequent.
• The antecedent part the condition consist of one or more attribute tests and these tests are
logically ANDed.
• The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.

Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a decision tree.
Points to remember −
To extract a rule from a decision tree −
• One rule is created for each path from the root to the leaf node.
• To form a rule antecedent, each splitting criterion is logically ANDed.
• The leaf node holds the class prediction, forming the rule consequent.
Rule Induction Using Sequential Covering Algorithm
Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data. We do not
require to generate a decision tree first. In this algorithm, each rule for a given class covers many of the
tuples of that class.

77
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general strategy the
rules are learned one at a time. For each time rules are learned, a tuple covered by the rule is removed and
the process continues for the rest of the tuples. This is because the path to each leaf in a decision tree
corresponds to a rule.
Note − The Decision tree induction can be considered as learning a set of rules simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one class at a time. When
learning a rule from a class Ci, we want the rule to cover all the tuples from class C only and no tuple form
any other class.
Algorithm: Sequential Covering

Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.

Output: A Set of IF-THEN rules.


Method:
Rule_set={ }; // initial set of rules learned is empty

for each class c do

repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;

Rule_set=Rule_set+Rule; // add a new rule to rule-set


end for
return Rule_Set;
Rule Pruning
The rule is pruned is due to the following reason −
• The Assessment of quality is made on the original set of training data. The rule may perform
well on training data but less well on subsequent data. That's why the rule pruning is
required.
• The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R has
greater quality than what was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
FOIL_Prune = pos - neg / pos + neg
where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the FOIL_Prune value
is higher for the pruned version of R, then we prune R.

78
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Lecture Number: 32

Topics to cover: Model Evaluation and Selection

Model Selection and Evaluation is a hugely important procedure in the machine


learning workflow. This is the section of our workflow in which we will analyse our
model. We look at more insightful statistics of its performance and decide what
actions to take in order to improve this model. This step is usually the difference
between a model that performs well and a model that performs very well. When we
evaluate our model, we gain a greater insight into what it predicts well and what it
doesn’t and this helps us turn it from a model that predicts our dataset with a 65%
accuracy level to closer to 80% or 90%.

Metrics and Scoring

Let’s say we have two hypothesis for a task, h(x) and h’(x). How would we know
which one is better. Well from a high level perspective, we might take the following
steps:

1. Measure the accuracy of both hypotheses

2. Determine whether there is any statistical significance between the two


results. If there is, select the better performing hypothesis. If not, we
cannot say with any statistical certainty that either h(x) or h’(x) is
better.

When we have a classification task, we will consider the accuracy of our model by its
ability to assign an instance to its correct class. Consider this on a binary level. We
have two classes, 1 and 0. We would classify a correct prediction therefore as being
when the model classifies a class 1 instance as class 1, or a class 0 instance as class 0.
79
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Assuming our 1 class as being the ‘Positive class’ and the 0 class being the ‘Negative
class’, we can build a table that outlines all the possibilities our model might produce.

We also have names for these classifications. Our True Positive and True Negative are
our correct classifications, as we can see in both cases, the actual class and the
predicted class are the same. The other two classes, in which the model predicts
incorrectly, can be explained as follows:

Lecture Number: 33

Topics to cover: Techniques to Improve Classification Accuracy

1. Working on the data side


1. Method 1: Acquire more data
2. Method 2: missing value treatment
3. Method 3: Outlier treatment
4. Method 4: Feature engineering
2. Working on the model side
1. Method 1: Hyperparameter tuning
2. Method 2: Applying different models
3. Method 3: Ensembling methods
4. Method 4: Cross-validation

Let’s start with understanding classification models.

About classification models


This article assumes that the reader should have knowledge about classification
models and if some do not have such knowledge in this section we will try to
have a general introduction about the classification models.

A classification data, in data science, can consist of information divided into the
classes, for example data of people from a city where the gender of a person is
80
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

included in the information of that person. So in the data gender information can
be considered as the data that has the classes.

A classification model can be defined as an algorithm that can tell us about the
class of the data points using the other information of the data. For example, in
the data of people from a city, we can have information such as name,
education, work, and marital status. Based on this information, a trained model
can tell us about the gender of the person.

Building such a model that can tell us about the class of any data point using the
other information the data point consists of can be called classification
modelling. Now if we are practising or we are experts in classification
modelling we must have known that we can not build a model with no error.
There will always be a possibility of error but as a modeler, it becomes our
responsibility to reduce the errors of the model or to increase the accuracy of the
model.

In this article, we are going to discuss some of the methods that can increase the
accuracy of the classification model. We can also think of this discussion as a
method to reduce the error of the classification model. Since in the life of the
modelling procedure acquiring data is one of the first steps, in this article, we
are going to start our discussion on the methods that can be applied on the data
side.

Are you looking for a complete repository of Python libraries used in data
science, check out here.

Working on the data side


various methods can be applied to the data to increase the accuracy of the
model. Some of the methods that can be applied on the data side are as follows:

Method 1: Acquire more data

81
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

One thing that a classification modelling always requires is to let the data tell it
about itself as much as possible. We may find problems with classification
modelling when the amount of data is very little or less. In such a scenario we
are required to have some techniques or sources that can provide us with more
data.

However, if we are practicing modelling or participating in competitions getting


more data is difficult and we can try to copy-paste data in the training the model
to boost the accuracy or if we are working on a company project then we can try
to ask for more data from the source if possible. This method is one of the basic
methods that can lead us to higher accuracy than before. Let’s move to our
second method.

Method 2: Missing value treatment


There are various reasons for the generation of missing values in the data. Here
we are not concerned about the generation of missing values but we are
concerned about the treatment of missing values.. Take a look at the below table

Age Name Gender Working status

18 Rahul M N

20 Archana F Y

29 Ashok M Y

18 Anamika N

27 Ankit M Y

Let’s take a look at the below table which represents the percentage of working
people according to the gender

Gender Percentage

Female 100%

Male 66.66%

82
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Here we can see that there is a missing value and the records are showing 100%
females are working in the data. Now if we fill the missing value with F then the
results will be as follows.

Gender Percentage

Female 50%

Male 66.66%

Here we can see the effect of missing values in the final prediction.

There are various ways to deal with the missing values some of them are as
follows:

• Mean, Median, and mode: This method is for dealing with the missing
data in the continuous data(age is our case). We can fill missing values
using the Mean, median, and mode of the continuous variable.
• Separation of class: This method separates the data points that have
missing values and we can use this method for categorical data.
• Missing value prediction: This method uses another model that can make
predictions on the missing values. After filling the missing values using a
model we can continue our work on the main modelling.
• KNN imputation: This method can be utilized to fill missing values that
work on finding the data points that have similar attributes to the class
where we have missing values and fills the values same as the information
available in the similar data points.
• Filling closest value: This method fills the nearest value in place of the
missing value. This method can be worked with continuous data but is best
with the time series data.

Method 3: Outlier treatment


Fitting a classification model can also be thought of as fitting a line or area on
the data points. So if the data has the data points that are close to each other
fitting a model can give us better results because the prediction area is dense.

83
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Deleting outliers: By drawing the data points in coordinates we can detect the
values that are far from the dense area and we can delete sparse data points from
the data if we have a very large amount of the data. With a low amount of data,
this way of outlier treatment is not a good way.

• Data transformation: Data transformation can help us to get rid of outlier


data points in modelling. There are ways to do this such as performing a
log of the data points to reduce the sparsity of the data. Binning is also a
way to transform data and many algorithms such as decision trees help us
in dealing with outlier data points using the binning of data.
• Mean, median, mode: This method is similar to the method we discussed
in dealing with missing values. Before using this method we are required
to ensure the outlier we have detected is natural or artificial. If the value is
artificial we can use the mean, median, or mode of the other data points in
the place of an outlier.
• Separation: If the amount of the outlier is higher than the normal then we
can separate them from the main data and fit the model on them
separately.

Method 4: Feature engineering


Feature engineering is one of the best ways to increase the accuracy of the
classification model. Since it lets the model work with only those variables that
are highly correlated to the target variable. We can also think of this method as
the creation of a hypothesis regarding the accuracy of the model. We can
perform feature engineering in three steps:

1. Transformation of features: In the core, we can find that transformation


of the features includes two main processes that can be applied one by
one, or in some cases, we may require to use one of them.
1. scaling data: in this process, we normalize the data and make it
scales between 0 to 1. For example, if we have three variables in
gram, kg, and ton. With such data, it is always required to fit the
model after normalizing to improve the accuracy of the model.
2. All the models require data that is normally distributed to give higher
accuracy. Before fitting a classification model there is always a
requirement to remove the skewness of the data as much as possible.
84
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

2. Creation of features: This process can be considered as the creation of


new features using the old features of the data. This process can help us in
understanding and generate new insights from the data. Let’s say that the
daily traffic on website hours has no relation with the traffic but the
minutes are having a relationship. Such information can help improve the
accuracy of the model.
3. Feature selection: This process let us know about the relation of any
feature with the target variables. Using this process we generally reduce
the number of features that are going to be modeled. Since the best
features are fed into the model it helps in improving the results of the
model. Various ways help in selecting the best features:

1. Knowledge: Based on the knowledge of the domain we can say which are
the variable are most important to be modelled. For example, in the sales
on a daily basis from a shop, days and amount of material are important
but the name of the customer is not important.
2. Parameters: There are some parameters such as P-value that helps in
determining the relation between the variables. Using these parameters we
can differentiate between important and unimportant features of the data
for modelling.
3. Dimension reduction: Various dimensional reduction algorithms help us
in drawing the data into a lower dimension but also help in understanding
inherent relationships in the data.

Working on the model side


Some of the techniques that can be followed to increase the accuracy of
classification modelling are as follows:

Method 1: Hyperparameter tuning


In the core of the models, we find some of the units that drive the model to
make final results. We call these units as parameters of the model. These
parameters take some values to perform their task under the model. For
example, the Gini impurity under the decision tree model helps the tree to split
the data into branches.

85
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Since we know that the split of the data in the decision tree makes a higher
impact on the accuracy of the decision tree model. So to better split, we need to
find an optimum value of Gini impurity. Finding an optimal value for
parameters of the model is known as hyperparameter tuning.

The impact of the hyperparameter tuning on the performance and accuracy of


the model is so high and various packages help in hyperparameter tuning even
in an automated nature. Some of them can be found here.

Method 2: Applying different models


There can be several changes that applying a single model after so much
hyperparameter tuning will not help us in boosting the accuracy of the
procedure. So in such a scenario, applying different classification models to the
data can help us in increasing the accuracy of the procedure. However,
hyperparameter tuning can be applied after finalizing a model for better
accuracy.

Various packages help us in finding an optimal model and hyperparameters of


that model. Some of them can be found here.

Method 3: Ensembling methods


In machine learning, ensemble methods are different from general methods of
modelling that include various weak models to perform modelling of the data
and combine their results. The reason for being more accurate is the results are
combined. We can categorize ensembling methods into two categories:

• Averaging methods: These methods work based on combining the results


of different models as an average. We can consider these methods better
than applying a single model to data. Examples of models of this method
are bagging meta estimators and forests of random trees.
• Boosting method: These methods work based on the reduction of the bias
of the combined estimator after sequentially applying the base models.

86
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

These are very powerful in terms of performance and accuracy. Examples


of models of these methods are Adaboost and gradient tree boosting.

Method 4: Cross-validation
Although this method is related to both data and model side since we apply
models several times in this technique, we can consider cross-validation as a
method of working on the modelling side. Also, making a model perform
accurately does not mean it is accurate. Cross-validation is a way that verifies
the accuracy of the model.

Lecture Number: 34

Topics to cover: Ensemble Methods

Lecture Number: 35

Topics to cover: Handling Different Kinds of Cases in Classification

Lecture Number: 36

Topics to cover: Classification by Neural Networks-I


A neural network is a method in artificial intelligence that teaches computers to process data in a way that is inspired by
the human brain. It is a type of machine learning process, called deep learning, that uses interconnected nodes or neurons
in a layered structure that resembles the human brain. It creates an adaptive system that computers use to learn from their
mistakes and improve continuously. Thus, artificial neural networks attempt to solve complicated problems, like
summarizing documents or recognizing faces, with greater accuracy.

Why are neural networks important?


Neural networks can help computers make intelligent decisions with limited human assistance. This is because they can
learn and model the relationships between input and output data that are nonlinear and complex. For instance, they can do
the following tasks.

87
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Make generalizations and inferences


Neural networks can comprehend unstructured data and make general observations without explicit training. For instance,
they can recognize that two different input sentences have a similar meaning:

• Can you tell me how to make the payment?


• How do I transfer money?

A neural network would know that both sentences mean the same thing. Or it would be able to broadly recognize that
Baxter Road is a place, but Baxter Smith is a person’s name.

What are neural networks used for?


Neural networks have several use cases across many industries, such as the following:

• Medical diagnosis by medical image classification


• Targeted marketing by social network filtering and behavioral data analysis
• Financial predictions by processing historical data of financial instruments
• Electrical load and energy demand forecasting
• Process and quality control
• Chemical compound identification

We give four of the important applications of neural networks below.

Computer vision
Computer vision is the ability of computers to extract information and insights from images and videos. With neural
networks, computers can distinguish and recognize images similar to humans. Computer vision has several applications,
such as the following:

• Visual recognition in self-driving cars so they can recognize road signs and other road users
• Content moderation to automatically remove unsafe or inappropriate content from image and video archives
• Facial recognition to identify faces and recognize attributes like open eyes, glasses, and facial hair
• Image labeling to identify brand logos, clothing, safety gear, and other image details

Speech recognition
Neural networks can analyze human speech despite varying speech patterns, pitch, tone, language, and accent. Virtual
assistants like Amazon Alexa and automatic transcription software use speech recognition to do tasks like these:

• Assist call center agents and automatically classify calls


• Convert clinical conversations into documentation in real time
• Accurately subtitle videos and meeting recordings for wider content reach

Natural language processing


Natural language processing (NLP) is the ability to process natural, human-created text. Neural networks help computers
gather insights and meaning from text data and documents. NLP has several use cases, including in these functions:

88
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

• Automated virtual agents and chatbots


• Automatic organization and classification of written data
• Business intelligence analysis of long-form documents like emails and forms
• Indexing of key phrases that indicate sentiment, like positive and negative comments on social media
• Document summarization and article generation for a given topic

Lecture Number: 37

Topics to cover: Classification by Neural Networks-II

What are the types of neural networks?


Artificial neural networks can be categorized by how the data flows from the input node to the output node. Below are
some examples:

Feedforward neural networks


Feedforward neural networks process data in one direction, from the input node to the output node. Every node in one
layer is connected to every node in the next layer. A feedforward network uses a feedback process to improve predictions
over time.

Backpropagation algorithm
Artificial neural networks learn continuously by using corrective feedback loops to improve their predictive analytics. In
simple terms, you can think of the data flowing from the input node to the output node through many different paths in the
neural network. Only one path is the correct one that maps the input node to the correct output node. To find this path, the
neural network uses a feedback loop, which works as follows:

1. Each node makes a guess about the next node in the path.
2. It checks if the guess was correct. Nodes assign higher weight values to paths that lead to more correct guesses
and lower weight values to node paths that lead to incorrect guesses.
3. For the next data point, the nodes make a new prediction using the higher weight paths and then repeat Step 1.

Convolutional neural networks


The hidden layers in convolutional neural networks perform specific mathematical functions, like summarizing or
filtering, called convolutions. They are very useful for image classification because they can extract relevant features from
images that are useful for image recognition and classification. The new form is easier to process without losing features
that are critical for making a good prediction. Each hidden layer extracts and processes different image features, like
edges, color, and depth.

How to train neural networks?


Neural network training is the process of teaching a neural network to perform a task. Neural networks learn by initially
processing several large sets of labeled or unlabeled data. By using these examples, they can then process unknown inputs
more accurately.

89
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Supervised learning
In supervised learning, data scientists give artificial neural networks labeled datasets that provide the right answer in
advance. For example, a deep learning network training in facial recognition initially processes hundreds of thousands of
images of human faces, with various terms related to ethnic origin, country, or emotion describing each image.

The neural network slowly builds knowledge from these datasets, which provide the right answer in advance. After the
network has been trained, it starts making guesses about the ethnic origin or emotion of a new image of a human face that
it has never processed before.

Lecture Number: 38

Topics to cover: Support Vector.Machines

Support Vector Machine


Support Vector Machine (SVM) is a supervised machine learning algorithm used for both
classification and regression. Though we say regression problems as well it’s best suited for
classification.
The main objective of the SVM algorithm is to find the optimal hyperplane in an N-dimensional
space that can separate the data points in different classes in the feature space.
The hyperplane tries that the margin between the closest points of different classes should be as
maximum as possible. The dimension of the hyperplane depends upon the number of features. If
the number of input features is two, then the hyperplane is just a line. If the number of input
features is three, then the hyperplane becomes a 2-D plane. It becomes difficult to imagine when
the number of features exceeds three.
Let’s consider two independent variables x1, x2, and one dependent variable which is either a blue
circle or a red circle.

Linearly Separable Data points

90
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

From the figure above it’s very clear that there are multiple lines (our hyperplane here is a line
because we are considering only two input features x1, x2) that segregate our data points or do a
classification between red and blue circles. So how do we choose the best line or in general the
best hyperplane that segregates our data points?

How does SVM work?

One reasonable choice as the best hyperplane is the one that represents the largest separation or
margin between the two classes.

Multiple hyperplanes separate the data from two classes

So we choose the hyperplane whose distance from it to the nearest data point on each side is
maximized. If such a hyperplane exists it is known as the maximum-margin hyperplane/hard
margin. So from the above figure, we choose L2. Let’s consider a scenario like shown below

91
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Selecting hyperplane for data with outlier

Here we have one blue ball in the boundary of the red ball. So how does SVM classify the data?
It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls. The SVM
algorithm has the characteristics to ignore the outlier and finds the best hyperplane that maximizes
the margin. SVM is robust to outliers.

Hyperplane which is the most optimized one

So in this type of data point what SVM does is, finds the maximum margin as done with previous
data sets along with that it adds a penalty each time a point crosses the margin. So the margins in
these types of cases are called soft margins. When there is a soft margin to the data set, the SVM
tries to minimize (1/margin+∧(∑penalty)). Hinge loss is a commonly used penalty. If no violations
no hinge loss.If violations hinge loss proportional to the distance of violation.
Till now, we were talking about linearly separable data(the group of blue balls and red balls are
separable by a straight line/linear line). What to do if data are not linearly separable?

Original 1D dataset for classification

Say, our data is shown in the figure above. SVM solves this by creating a new variable using
a kernel. We call a point xi on the line and we create a new variable yi as a function of distance
from origin o.so if we plot this we get something like as shown below
92
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Mapping 1D data to 2D to become able to separate the two classes

In this case, the new variable y is created as a function of distance from the origin. A non-linear
function that creates a new variable is referred to as a kernel.

Support Vector Machine Terminology

1. Hyperplane: Hyperplane is the decision boundary that is used to separate the data
points of different classes in a feature space. In the case of linear classifications, it will
be a linear equation i.e. wx+b = 0.
2. Support Vectors: Support vectors are the closest data points to the hyperplane, which
makes a critical role in deciding the hyperplane and margin.
3. Margin: Margin is the distance between the support vector and hyperplane. The main
objective of the support vector machine algorithm is to maximize the margin. The wider
margin indicates better classification performance.
4. Kernel: Kernel is the mathematical function, which is used in SVM to map the original
input data points into high-dimensional feature spaces, so, that the hyperplane can be
easily found out even if the data points are not linearly separable in the original input
space. Some of the common kernel functions are linear, polynomial, radial basis
function(RBF), and sigmoid.
5. Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a
hyperplane that properly separates the data points of different categories without any
misclassifications.
6. Soft Margin: When the data is not perfectly separable or contains outliers, SVM
permits a soft margin technique. Each data point has a slack variable introduced by the
soft-margin SVM formulation, which softens the strict margin requirement and permits
certain misclassifications or violations. It discovers a compromise between increasing
the margin and reducing violations.

93
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

7. C: Margin maximisation and misclassification fines are balanced by the regularisation


parameter C in SVM. The penalty for going over the margin or misclassifying data items
is decided by it. A stricter penalty is imposed with a greater value of C, which results in
a smaller margin and perhaps fewer misclassifications.
8. Hinge Loss: A typical loss function in SVMs is hinge loss. It punishes incorrect
classifications or margin violations. The objective function in SVM is frequently formed
by combining it with the regularisation term.

Types of Support Vector Machine

Based on the nature of the decision boundary, Support Vector Machines (SVM) can be divided
into two main parts:
• Linear SVM: Linear SVMs use a linear decision boundary to separate the data points of
different classes. When the data can be precisely linearly separated, linear SVMs are
very suitable. This means that a single straight line (in 2D) or a hyperplane (in higher
dimensions) can entirely divide the data points into their respective classes. A
hyperplane that maximizes the margin between the classes is the decision boundary.
• Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be
separated into two classes by a straight line (in the case of 2D). By using kernel
functions, nonlinear SVMs can handle nonlinearly separable data. The original input
data is transformed by these kernel functions into a higher-dimensional feature space,
where the data points can be linearly separated. A linear SVM is used to locate a
nonlinear decision boundary in this modified space.

Advantages of SVM
• Effective in high-dimensional cases.
• Its memory is efficient as it uses a subset of training points in the decision function
called support vectors.
• Different kernel functions can be specified for the decision functions and its possible to
specify custom kernels.

Lecture Number: 39

Topics to cover: Pattern-Based Classification

94
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Lecture Number: 40

Topics to cover: Lazy Learners (or Learning from Your Neighbours).

What is Lazy Learning?


Lazy learning is a type of machine learning that doesn't process training data until it needs to
make a prediction. Instead of building models during training, lazy learning algorithms wait
until they encounter a new query. This method stores and compares training examples when
making predictions. It's also called instance-based or memory-based learning.

Lazy Learning Explained


Lazy learning algorithms work by memorizing the training data rather than constructing a
general model. When a new query is received, lazy learning retrieves similar instances from
the training set and uses them to generate a prediction. The similarity between instances is
usually calculated using distance metrics, such as Euclidean distance or cosine similarity.
One of the most popular lazy learning algorithms is the k-nearest neighbors (k-NN) algorithm.
In k-NN, the k closest training instances to the query point are considered, and their class
labels are used to determine the class of the query. Lazy learning methods excel in situations
where the underlying data distribution is complex or where the training data is noisy.

Examples of Real-World Lazy Learning Applications


Lazy learning has found applications in various domains. Here are a few examples:
• Recommendation systems. Lazy learning is widely used in recommender systems to
provide personalized recommendations. By comparing user preferences to similar users
in the training set, lazy learning algorithms can suggest items or products of interest,
such as movies, books, or products.
• Medical diagnosis. Lazy learning can be employed in medical diagnosis systems. By
comparing patient symptoms and medical histories to similar cases in the training data,
lazy learning algorithms can assist in diagnosing diseases or suggesting appropriate
treatments.
• Anomaly detection. Lazy learning algorithms are useful for detecting anomalies or
outliers in datasets. For example, an algorithm can detect credit card fraud by comparing
a transaction to nearby transactions based on factors like location and history. If the

95
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

transaction is unusual, such as being made in a faraway location for a large amount, it
may be flagged as fraudulent.

Lazy Learning vs Eager Learning Models


Lazy learning stands in contrast to eager learning methods, such as decision trees or neural
networks, where models are built during the training phase. Here are some key differences:
• Training phase. Eager learning algorithms construct a general model based on the
entire training dataset, whereas lazy learning algorithms defer model construction until
prediction time.
• Computational cost. Lazy learning algorithms can be computationally expensive
during prediction since they require searching through the training data to find nearest
neighbors. In contrast, eager learning algorithms typically have faster prediction times
once the model is trained.
• Interpretability. Eager learning methods often provide more interpretability as they
produce explicit models, such as decision trees, that can be easily understood by
humans. Lazy learning methods, on the other hand, rely on the stored instances and do
not provide explicit rules or models.
Create your own Eager learning model with this Random Forest Classification tutorial. Learn
to visualize the model and understand its decision-making process.

What are the Benefits of Lazy Learning?


Lazy learning offers several advantages:
• Adaptability. Lazy learning algorithms can adapt quickly to new or changing data.
Since the learning process happens at prediction time, they can incorporate new
instances without requiring complete retraining of the model.
• Robustness to outliers. Lazy learning algorithms are less affected by outliers compared
to eager learning methods. Outliers have less influence on predictions because they are
not used during the learning phase.
• Flexibility. When it comes to handling complex data distributions and nonlinear
relationships, lazy learning algorithms are effective. They can capture intricate decision
boundaries by leveraging the information stored in the training instances.

96
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Lecture Number: 41

Topics to cover: Basic Concepts of Cluster Analysis

INTRODUCTION:
Cluster analysis, also known as clustering, is a method of data mining that groups similar data
points together. The goal of cluster analysis is to divide a dataset into groups (or clusters) such
that the data points within each group are more similar to each other than to data points in other
groups. This process is often used for exploratory data analysis and can help identify patterns or
relationships within the data that may not be immediately obvious. There are many different
algorithms used for cluster analysis, such as k-means, hierarchical clustering, and density-based
clustering. The choice of algorithm will depend on the specific requirements of the analysis and
the nature of the data being analyzed.
Cluster Analysis is the process to find similar groups of objects in order to form clusters. It i s an
unsupervised machine learning-based algorithm that acts on unlabelled data. A group of data
points would comprise together to form a cluster in which all the objects would belong to the
same group.
The given data is divided into different groups by combining similar objects into a group. This
group is nothing but a cluster. A cluster is nothing but a collection of similar data which is
grouped together.
For example, consider a dataset of vehicles given in which it contains information about different
vehicles like cars, buses, bicycles, etc. As it is unsupervised learning there are no class labels
like Cars, Bikes, etc for all the vehicles, all the data is combined and is not in a structured
manner.
Now our task is to convert the unlabelled data to labelled data and it can be done using clusters.
The main idea of cluster analysis is that it would arrange all the data points by forming clusters
like cars cluster which contains all the cars, bikes clusters which contains all the bikes, etc.
Simply it is the partitioning of similar objects which are applied to unlabelled data.

Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast amount of data and should be dealing with
huge databases. In order to handle extensive databases, the clustering algorithm should be
scalable. Data should be scalable, if it is not scalable, then we can’t get the appropriate result
which would lead to wrong results.
2. High Dimensionality: The algorithm should be able to handle high dimensional space along
with the data of small size.

97
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

3. Algorithm Usability with multiple data kinds: Different kinds of data can be used with
algorithms of clustering. It should be capable of dealing with different types of data like discrete,
categorical and interval-based data, binary data etc.
4. Dealing with unstructured data: There would be some databases that contain missing
values, and noisy or erroneous data. If the algorithms are sensitive to such data then it may lead
to poor quality clusters. So it should be able to handle unstructured data and give some structure
to the data by organising it into groups of similar data objects. This makes the job of the data
expert easier in order to process the data and discover new patterns.
5. Interpretability: The clustering outcomes should be interpretable, comprehensible, and
usable. The interpretability reflects how easily the data is understood.

Lecture Number: 42
Partitioning Methods, Hierarchical Methods
Topics to cover:

Clustering Methods:
The clustering methods can be classified into the following categories:
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
Partitioning Method: It is used to make partitions on the data in order to form clusters. If “n”
partitions are done on “p” objects of the database then each partition is represented by a cluster
and n < p. The two conditions which need to be satisfied with this Partitioning Clustering
Method are:
• One objective should only belong to only one group.
• There should be no group without even a single purpose.
In the partitioning method, there is one technique called iterative relocation, which means the
object will be moved from one group to another to improve the partitioning
Hierarchical Method: In this method, a hierarchical decomposition of the given set of data
objects is created. We can classify hierarchical methods and will be able to know the purpose of
classification on the basis of how the hierarchical decomposition is formed. There are two types
of approaches for the creation of hierarchical decomposition, they are:
• Agglomerative Approach: The agglomerative approach is also known as the bottom-
up approach. Initially, the given data is divided into which objects form separate
groups. Thereafter it keeps on merging the objects or the groups that are close to one
another which means that they exhibit similar properties. This merging process
continues until the termination condition holds.
• Divisive Approach: The divisive approach is also known as the top-down approach.
In this approach, we would start with the data objects that are in the same cluster. The
group of individual clusters is divided into small clusters by continuous iteration. The
98
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

iteration continues until the condition of termination is met or until each cluster
contains one object.
Once the group is split or merged then it can never be undone as it is a rigid method and is not so
flexible. The two approaches which can be used to improve the Hierarchical Clustering Quality
in Data Mining are: –
• One should carefully analyze the linkages of the object at every partitioning of
hierarchical clustering.
• One can use a hierarchical agglomerative algorithm for the integration of hierarchical
agglomeration. In this approach, first, the objects are grouped into micro-clusters.
After grouping data objects into microclusters, macro clustering is performed on the
microcluster.
Density-Based Method: The density-based method mainly focuses on density. In this method,
the given cluster will keep on growing continuously as long as the density in the neighbourhood
exceeds some threshold, i.e, for each data point within a given cluster. The radius of a given
cluster has to contain at least a minimum number of points.
Grid-Based Method: In the Grid-Based method a grid is formed using the object together,i.e,
the object space is quantized into a finite number of cells that form a grid structure. One of the
major advantages of the grid-based method is fast processing time and it is dependent only on
the number of cells in each dimension in the quantized space. The processing time for this
method is much faster so it can save time.
Model-Based Method: In the model-based method, all the clusters are hypothesized in order to
find the data which is best suited for the model. The clustering of the density function is used to
locate the clusters for a given model. It reflects the spatial distribution of data points and also
provides a way to automatically determine the number of clusters based on standard statistics,
taking outlier or noise into account. Therefore it yields robust clustering methods.
Constraint-Based Method: The constraint-based clustering method is performed by the
incorporation of application or user-oriented constraints. A constraint refers to the user
expectation or the properties of the desired clustering results. Constraints provide us with an
interactive way of communication with the clustering process. The user or the application
requirement can specify constraints.
Applications Of Cluster Analysis:
• It is widely used in image processing, data analysis, and pattern recognition.
• It helps marketers to find the distinct groups in their customer base and they can
characterize their customer groups by using purchasing patterns.
• It can be used in the field of biology, by deriving animal and plant taxonomies and
identifying genes with the same capabilities.
• It also helps in information discovery by classifying documents on the web.

Lecture Number: 43

Topics to cover: Clustering Structures, Major Clustering Approaches

What is a Cluster?
o A cluster is a subset of similar objects
99
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

o A subset of objects such that the distance between any of the two objects in the cluster is less than the
distance between any object in the cluster and any object that is not located inside it.
o A connected region of a multidimensional space with a comparatively high density of objects.

What is clustering in Data Mining?


o Clustering is the method of converting a group of abstract objects into classes of similar objects.
o Clustering is a method of partitioning a set of data or objects into a set of significant subclasses called
clusters.
o It helps users to understand the structure or natural grouping in a data set and used either as a stand-alone
instrument to get a better insight into data distribution or as a pre-processing step for other algorithms

Important points:
o Data objects of a cluster can be considered as one group.
o We first partition the information set into groups while doing cluster analysis. It is based on data similarities
and then assigns the levels to the groups.
o The over-classification main advantage is that it is adaptable to modifications, and it helps single out
important characteristics that differentiate between distinct groups.

Applications of cluster analysis in data mining:


o In many applications, clustering analysis is widely used, such as data analysis, market research, pattern
recognition, and image processing.
o It assists marketers to find different groups in their client base and based on the purchasing patterns. They
can characterize their customer groups.
o It helps in allocating documents on the internet for data discovery.
o Clustering is also used in tracking applications such as detection of credit card fraud.
o As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of data to
analyze the characteristics of each cluster.
o In terms of biology, It can be used to determine plant and animal taxonomies, categorization of genes with
the same functionalities and gain insight into structure inherent to populations.
o It helps in the identification of areas of similar land that are used in an earth observation database and the
identification of house groups in a city according to house type, value, and geographical location.

Lecture Number: 44

Topics to cover: Density-Based Methods, Model-Based Clustering

100
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Density-based clustering in data mining


Density-based clustering refers to a method that is based on local cluster criterion, such as density
connected points. In this tutorial, we will discuss density-based clustering with examples.

What is Density-based clustering?


Density-Based Clustering refers to one of the most popular unsupervised learning methodologies used in
model building and machine learning algorithms. The data points in the region separated by two clusters of
low point density are considered as noise. The surroundings with a radius ε of a given object are known as
the ε neighborhood of the object. If the ε neighborhood of the object comprises at least a minimum
number, MinPts of objects, then it is called a core object.

Density-Based Clustering - Background


There are two different parameters to calculate the density-based clustering

EPS: It is considered as the maximum radius of the neighborhood.

MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.

NEps (i) : { k belongs to D and dist (i,k) < = Eps}

Directly density reachable:

A point i is considered as the directly density reachable from a point k with respect to Eps, MinPts if

i belongs to NEps(k)

Core point condition:

NEps (k) >= MinPts

101
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Density reachable:

A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there is a sequence
chain of a point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density reachable from ii.

Density connected:

A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point o such that
both i and j are considered as density reachable from o with respect to Eps and MinPts.

102
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Lecture Number: 45

Topics to cover: Why outlier analysis? Identifying and handling of outliers


Outlier analysis is the process of identifying outliers, or abnormal observations, in a dataset. Also known as
outlier detection, it’s an important step in data analysis, as it removes erroneous or inaccurate observations
which might otherwise skew conclusions.

There are a wide range of techniques and tools used in outlier analysis. However, as we’ll see later, it’s
often very easy to spot outlying data points. As a result, there’s really no excuse not to perform outlier
analysis on any and all datasets.

Benefits
Unlike other data analysis processes, outlier analysis only really has one benefit: it improves the quality of
the dataset being subject to analysis. Of course, this in turn brings benefits. With a higher-quality dataset,
analysts can expect to draw more accurate conclusions (and more of them).

When to Do Outlier Analysis


As mentioned, outlier analysis should be performed as part of any data analysis procedure. In this case,
outlier analysis should be one of the first — if not the first — steps in data analysis. This way, when the
dataset reaches steps that truly involve assessing and interpreting the data, any outliers will have already
been removed.

Lecture Number: 46

Topics to cover: Outlier Detection Techniques

Outlier Analysis Techniques


103
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

There are a wide variety of techniques that can be used to identify outliers in datasets. In this section, we’ll
look at just a few of these techniques, including both straightforward and sophisticated ones.

Sorting
For an amateur data analyst, sorting is by far the easiest technique for outlier analysis. The premise is
simple: load your dataset into any kind of data manipulation tool (such as a spreadsheet), and sort the
values by their magnitude. Then, look at the range of values of various data points. If any data points are
significantly higher or lower than others in the dataset, they may be treated as outliers.

Graphing
An equally forgiving tool for outlier analysis is graphing. Once again, the premise is straightforward: plot
all of the data points on a graph, and see which points stand out from the rest. The advantage of using a
graphing approach over a sorting approach is that it visualizes the magnitude of the data points, which
makes it much easier to spot outliers.

Z-score
A more statistical technique that can be used to identify outliers is the Z-score. The Z-score measures how
far a data point is from the average, as measured in standard deviations. By calculating the Z-score for each
data point, it’s easy to see which data points are placed far from the average. Unfortunately, like sorting,
this doesn’t take into account the influence of a second variable.

Other tests
Aside from sorting, graphing, and Z-scores, there are a whole host of statistical tests that can be used to
identify outliers in a dataset. For most intents and purposes, sorting and graphing are more than enough for
outlier analysis. Z-scores or other statistical tests may only be necessary for academic or high-stakes
purposes, where the true statistical aspect is much more important.

Lecture Number: 47

Topics to cover: WEB MINING: Basic concepts


Web Mining is the process of Data Mining techniques to automatically discover and extract
information from Web documents and services. The main purpose of web mining is discovering
useful information from the World-Wide Web and its usage patterns.
Applications of Web Mining:
Web mining is the process of discovering patterns, structures, and relationships in web data. It
involves using data mining techniques to analyze web data and extract valuable insights. The
applications of web mining are wide-ranging and include:
104
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Personalized marketing:
Web mining can be used to analyze customer behavior on websites and social media platforms.
This information can be used to create personalized marketing campaigns that target customers
based on their interests and preferences.
E-commerce
Web mining can be used to analyze customer behavior on e-commerce websites. This
information can be used to improve the user experience and increase sales by recommending
products based on customer preferences.
Search engine optimization:
Web mining can be used to analyze search engine queries and search engine results pages
(SERPs). This information can be used to improve the visibility of websites in search engine
results and increase traffic to the website.
Fraud detection:
Web mining can be used to detect fraudulent activity on websites. This information can be used
to prevent financial fraud, identity theft, and other types of online fraud.
Sentiment analysis:
Web mining can be used to analyze social media data and extract sentiment from posts,
comments, and reviews. This information can be used to understand customer sentiment towards
products and services and make informed business decisions.
Web content analysis:
Web mining can be used to analyze web content and extract valuable information such as
keywords, topics, and themes. This information can be used to improve the relevance of web
content and optimize search engine rankings.
Customer service:
Web mining can be used to analyze customer service interactions on websites and social media
platforms. This information can be used to improve the quality of customer service and identify
areas for improvement.
Healthcare:
Web mining can be used to analyze health-related websites and extract valuable information
about diseases, treatments, and medications. This information can be used to improve the quality
of healthcare and inform medical research.
Process of Web Mining:

Web Mining Process

Web mining can be broadly divided into three different types of techniques of mining: Web
Content Mining, Web Structure Mining, and Web Usage Mining. These are explained as
following below.

105
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Categories of Web Mining

Lecture Number: 48

Topics to cover: Different types of web mining


1. Web Content Mining: Web content mining is the application of extracting useful
information from the content of the web documents. Web content consist of several
types of data – text, image, audio, video etc. Content data is the group of facts that a
web page is designed. It can provide effective and interesting patterns about user
needs.
2. Web Structure Mining: Web structure mining is the application of discovering
structure information from the web. The structure of the web graph consists of web
pages as nodes, and hyperlinks as edges connecting related pages.
3. Web Usage Mining: Web usage mining is the application of identifying or
discovering interesting usage patterns from large data sets. And these patterns enable
you to understand the user behaviors or something like that. In web usage mining, user
access data on the web and collect data in form of logs. So, Web usage mining is also
called log mining.

Comparison Between Data mining and Web mining:


Points Data Mining Web Mining

106
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Points Data Mining Web Mining

Data Mining is the process that attempts


Web Mining is the process of data mining
to discover pattern and hidden
Definition techniques to automatically discover and
knowledge in large data sets in any
extract information from web documents.
system.

Data Mining is very useful for web page Web Mining is very useful for a particular
Application
analysis. website and e-service.

Target
Data scientist and data engineers. Data scientists along with data analysts.
Users

Access Data Mining access data privately. Web Mining access data publicly.

In Web Mining get the information from


In Data Mining get the information
Structure structured, unstructured and semi-
from explicit structure.
structured web pages.

Problem Clustering, classification, regression, Web content mining, Web structure


Type prediction, optimization and control. mining.

It includes tools like machine learning Special tools for web mining are Scrapy,
Tools
algorithms. PageRank and Apache logs.

It includes approaches for data It includes application level knowledge,


Skills cleansing, machine learning algorithms. data engineering with mathematical
Statistics and probability. modules like statistics and probability.

Lecture Number: 49

Topics to cover: PAGE RANK Algorithm

The page rank algorithm is applicable to web pages. The page rank algorithm is used by Google
Search to rank many websites in their search engine results. The page rank algorithm was named
after Larry Page, one of the founders of Google. We can say that the page rank algorithm is a
way of measuring the importance of website pages. A web page basically is a directed graph
which is having two components namely Nodes and Connections. The pages are nodes and
hyperlinks are connections.

107
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Let us see how to solve Page Rank Algorithm. Compute page rank at every node at the end of the
second iteration. use teleportation factor = 0.8

So the formula is,


PR(A) = (1-β) + β * [PR(B) / Cout(B) + PR(C) / Cout(C)+ ...... + PR(N) / Cout(N)]
HERE, β is teleportation factor i.e. 0.8
NOTE: we need to solve atleast till 2 iteration max.
Let us create a table of the 0th Iteration, 1st Iteration, and 2nd Iteration.

NODES ITERATION 0 ITERATION 1 ITERATION 2

A 1/6 = 0.16 0.3 0.392

B 1/6 = 0.16 0.32 0.3568

C 1/6 = 0.16 0.32 0.3568

D 1/6 = 0.16 0.264 0.2714

E 1/6 = 0.16 0.264 0.2714

F 1/6 = 0.16 0.392 0.4141

Iteration 0:
For iteration 0 assume that each page is having page rank = 1/Total no. of nodes
Therefore, PR(A) = PR(B) = PR(C) = PR(D) = PR(E) = PR(F) = 1/6 = 0.16
Iteration 1:
By using the above-mentioned formula
PR(A) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * 0.16/4 + 0.16/2
= 0.3

108
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

So, what have we done here is for node A we will see how many incoming signals are there so
here we have PR(B) and PR(C). And for each of the incoming signals, we will see the outgoing
signals from that particular incoming signal i.e. for PR(B) we have 4 outgoing signals and for
PR(C) we have 2 outgoing signals. The same procedure will be applicable for the remaining
nodes and iterations.
NOTE: USE THE UPDATED PAGE RANK FOR FURTHER CALCULATIONS.
PR(B) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.3/2
= 0.32
PR(C) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.3/2
= 0.32
PR(D) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.32/4
= 0.264
PR(E) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.32/4
= 0.264
PR(F) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * (0.32/4) + (0.32/2)
= 0.392
This was for iteration 1, now let us calculate iteration 2.
Iteration 2:
By using the above-mentioned formula
PR(A) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * (0.32/4) + (0.32/2)
= 0.392
NOTE: USE THE UPDATED PAGE RANK FOR FURTHER CALCULATIONS.
PR(B) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.392/2
= 0.3568
PR(C) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.392/2
= 0.3568
PR(D) = (1-0.8) + 0.8 * PR(B)/4

109
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

= (1-0.8) + 0.8 * 0.3568/4


= 0.2714
PR(E) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.3568/4
= 0.2714
PR(F) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * (0.3568/4) + (0.3568/2)
= 0.4141
So, the final PAGE RANK for the above-given question is,

NODES ITERATION 0 ITERATION 1 ITERATION 2

A 1/6 = 0.16 0.3 0.392

B 1/6 = 0.16 0.32 0.3568

C 1/6 = 0.16 0.32 0.3568

D 1/6 = 0.16 0.264 0.2714

E 1/6 = 0.16 0.264 0.2714

F 1/6 = 0.16 0.392 0.4141

Lecture Number: 50

Topics to cover: HITS Algorithm.

In the same time that PageRank was being developed, Jon Kleinberg a professor in the Department of
Computer Science at Cornell came up with his own solution to the Web Search problem. He developed an
algorithm that made use of the link structure of the web in order to discover and rank pages relevant for a
particular topic. HITS (hyperlink-induced topic search) is now part of the Ask search engine
(www.Ask.com).

110
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

Jon Kleinberg's algorithm called HITS identifies good authorities and hubs for a topic by assigning two
numbers to a page: an authority and a hub weight. These weights are defined recursively. A higher
authority weight occurs if the page is pointed to by pages with high hub weights. A higher hub weight
occurs if the page points to many pages with high authority weights.

In order to get a set rich in both hubs and authorities for a query Q, we first collect the top 200
documents that contain the highest number of occurrences of the search phrase Q. These, as pointed out
before may not be of tremendous practical relevance, but one has to start somewhere. Kleinberg points out
that the pages from this set called root (RQ) are essentially very heterogeneous and in general contain only
a few (if any) links to each other. So the web subgraph determined by these nodes is almost totally
disconnected; in particular, we can not enforce Page Rank techniques on RQ.

Authorities for the query Q are not extremely likely to be in the root set RQ. However, they are likely to

From here on, we translate everything into mathematical language. We associate to each page i two
numbers: an authority weight ai, and a hub weight hi. We consider pages with a higher ai number as being
better authorities, and pages with a higher hi number as being better hubs. Given the weights {ai} and {hi}
of all the nodes in SQ, we dynamically update the weights as follows:

111
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

A good hub increases the authority weight of the pages it points. A good authority increases the hub
weight of the pages that point to it. The idea is then to apply the two operations above alternatively until
equilibrium values for the hub and authority weights are reached.
Let A be the adjacency matrix of the graph SQ and denote the authority weight vector by v and the hub
weight vector by u, where

Let us notice that the two update operations described in the pictures translate to: .

112
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

If we consider that the initial weights of the nodes are then,


after k steps we get the system:

Example: Let us consider a very simple graph:

The adjacency matrix of the graph is , with transpose . Assume the

initial hub weight vector is: .


We compute the authority weight vector by:

Then, the updated hub weight is:

This already corresponds to our intuition that node 3 is the most authoritative, since it is the only one
with incoming edges, and that nodes 1 and 2 are equally important hubs. If we repeat the process further,
we will only obtain scalar multiples of the vectors v and u computed at step 1. So the relative weights of
the nodes remain the same.
113
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)

For more complicated examples of graphs, we would expect the convergence to be problematic, and the
equilibrium solutions (if there are any) to be more difficult to find.

Theorem: Under the assumptions that AAt and AtA are primitive matrices, the following statements hold:

1. If v1, ... , vk, ... is the sequence of authority weights we have computed, then V1, ... ,Vk, ... converges to the
unique probabilistic vector corresponding to the dominant eigenvalue of the matrix AtA. With a slight abuse
of notation, we denoted in here by Vk the vector vk normalized so that the sum of its entries is 1.
2. Likewise, if u1, ... ,uk, ... are the hub weights that we have iteratively computed, then U1, ... ,Uk, ... converges
to the unique probabilistic vector corresponding to the dominant eigenvalue of the matrix AAt. We use the
same notation, that Uk = (1 ⁄ c)uk, where c is the scalar equal to the sum of the entries of the vector uk.

114

You might also like