DSDM Notes
DSDM Notes
UNIT 1
INTRODUCTION
Course Outcome (CO) CO 214.1 Illustrate the concepts of overview of Data Mining.
Covered for this unit:
Number Lectures
planned in Teaching TWELVE
Plan:
1
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Lecture Number: 01
Data mining is the procedure of finding useful new correlations, patterns, and trends by sharing
through a high amount of data saved in repositories, using pattern recognition technologies including
statistical and mathematical techniques. It is the analysis of factual datasets to discover unsuspected
relationships and to summarize the records in novel methods that are both logical and helpful to the
data owner.
Data Mining is similar to Data Science. It is carried out by a person, in a particular situation, on a
specific data set, with an objective. This phase contains several types of services including text mining,
web mining, audio and video mining, pictorial data mining, and social media mining. It is completed
through software that is simple or greatly specific.
Data mining has engaged a huge deal of attention in the information market and society as a whole in
current years, because of the wide availability of huge amounts of data and the imminent needed for
turning such data into beneficial data and knowledge.
Data mining can be considered as a result of the natural progress of data technology. The database
system market has supported an evolutionary direction in the development of the following
functionalities including data collection and database creation, data management, and advanced data
analysis.
For example, the recent development of data collection and database creation structure served as
necessary for the later development of an effective structure for data storage and retrieval, and query
and transaction processing. With various database systems providing query and transaction processing
as common practice, advanced data analysis has developed into the next object.
Data can be saved in several types of databases and data repositories. One data repository structure
that has appeared in the data warehouse, a repository of several heterogeneous data sources organized
under a unified schema at an individual site to support management decision making.
Data warehouse technology involves data cleaning, data integration, and online analytical processing
(OLAP), especially, analysis techniques with functionalities including summarization, consolidation,
and aggregation, and the ability to view data from multiple angles.
Lecture Number: 02
2
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
including defining structure for database, making sure that the stored information remains secured
and consistent, and managing different types of data access, such as shared, distributed, and
concurrent.
A relational database has tables that have different names, attributes, and can store rows or records of
large data sets. Every record stored in a table has a unique key. Entity-relationship model is created
to provide a representation of a relational database that features entities and the relationships that
exist between them.
2. Data warehouse
A data warehouse is a single data storage location that collects data from multiple sources and then
stores it in the form of a unified plan. When data is stored in a data warehouse, it undergoes cleaning,
integration, loading, and refreshing. Data stored in a data warehouse is organized in several parts. If
you want information on data that was stored 6 or 12 months back, you will get it in the form of a
summary.
3. Transactional data
Transactional database stores record that are captured as transactions. These transactions include
flight booking, customer purchase, click on a website, and others. Every transaction record has a
unique ID. It also lists all those items that made it a transaction.
4. Other types of data
We have a lot of other types of data as well that are known for their structure, semantic meanings,
and versatility. They are used in a lot of applications. Here are a few of those data types: data
streams, engineering design data, sequence data, graph data, spatial data, multimedia data, and more.
3
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
• Prediction − It defines predict some unavailable data values or pending trends. An object can
be anticipated based on the attribute values of the object and attribute values of the classes. It
can be a prediction of missing numerical values or increase/decrease trends in time-related
information.
• Clustering − It is similar to classification but the classes are not predefined. The classes are
represented by data attributes. It is unsupervised learning. The objects are clustered or grouped,
depends on the principle of maximizing the intraclass similarity and minimizing the intraclass
similarity.
• Outlier analysis − Outliers are data elements that cannot be grouped in a given class or cluster.
These are the data objects which have multiple behaviour from the general behaviour of other
data objects. The analysis of this type of data can be essential to mine the knowledge.
• Evolution analysis − It defines the trends for objects whose behaviour changes over some time.
Lecture Number: 03
• Associations find commonly co-occurring groupings of things, such as “beers and diapers” or
“bread and butter” commonly purchased and observed together in a shopping cart (i.e., market-
basket analysis). Another type of association pattern captures the sequences of things. These
sequential relationships can discover time-ordered events, such as predicting that an existing
banking customer who already has a checking account will open a savings account followed by
an investment account within a year.
• Predictions tell the nature of future occurrences of certain events based on what has happened in
the past, such as predicting the winner of the Super Bowl or forecasting the absolute temperature
on a particular day.
4
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
• Clusters identify natural groupings of things based on their known characteristics, such as
assigning customers in different segments based on their demographics and past purchase
behaviors.
These types of patterns have been manually extracted from data by humans for centuries, but the
increasing volume of data in modern times has created a need for more automatic approaches. As data
sets have grown in size and complexity, direct manual data analysis has increasingly been augmented
with indirect, automatic data processing tools that use sophisticated methodologies, methods, and
algorithms. The manifestation of such evolution of automated and semi-automated means of
processing large data sets is now commonly referred to as data mining.
As mentioned earlier, generally speaking, data mining tasks and patterns can be classified into three
main categories: prediction, association, and clustering. Based on the way in which the patterns are
extracted from the historical data, the learning algorithms of data mining methods can be classified as
either supervised or unsupervised. With supervised learning algorithms, the training data includes both
the descriptive attributes (i.e., independent variables or decision variables) and the class attribute (i.e.,
output variable or result variable). In contrast, with unsupervised learning, the training data includes
only the descriptive attributes. Figure 2.3 shows a simple taxonomy for data mining tasks, along with
the learning methods and popular algorithms for each of the data mining tasks. Out of the three main
categories of tasks, prediction patterns/models can be classified as the outcome of a supervised learning
procedure, while association and clustering patterns/models can be classified as the outcome of
unsupervised learning procedures.
5
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Lecture Number: 04
A data mining query language can be designed to incorporate these primitives, allowing users to
interact with data mining systems flexibly. Having a data mining query language provides a foundation
on which user-friendly graphical interfaces can be built.
This specifies the portions of the database or the set of data in which the user is interested. This includes
the database attributes or data warehouse dimensions of interest (the relevant attributes or dimensions).
In a relational database, the set of task-relevant data can be collected via a relational query involving
operations like selection, projection, join, and aggregation.
6
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
The data collection process results in a new data relational called the initial data relation. The initial
data relation can be ordered or grouped according to the conditions specified in the query. This data
retrieval can be thought of as a subtask of the data mining task.
This initial relation may or may not correspond to physical relation in the database. Since virtual
relations are called Views in the field of databases, the set of task-relevant data for data mining is called
a minable view.
This specifies the data mining functions to be performed, such as characterization, discrimination,
association or correlation analysis, classification, prediction, clustering, outlier analysis, or evolution
analysis.
Concept hierarchy defines a sequence of mappings from low-level concepts to higher-level, more
general concepts.
o Rolling Up - Generalization of data: Allow to view data at more meaningful and explicit abstractions
and makes it easier to understand. It compresses the data, and it would require fewer input/output
operations.
o Drilling Down - Specialization of data: Concept values replaced by lower-level concepts. Based on
different user viewpoints, there may be more than one concept hierarchy for a given attribute or
dimension.
An example of a concept hierarchy for the attribute (or dimension) age is shown below. User beliefs
regarding relationships in the data are another form of background knowledge.
o Simplicity: A factor contributing to the interestingness of a pattern is the pattern's overall simplicity for
human comprehension. For example, the more complex the structure of a rule is, the more difficult it is
to interpret, and hence, the less interesting it is likely to be..
o Certainty (Confidence): Each discovered pattern should have a measure of certainty associated with
it that assesses the validity or "trustworthiness" of the pattern. A certainty measure for association rules
of the form "A =>B" where A and B are sets of items is confidence. Confidence is a certainty measure.
Given a set of task-relevant data tuples, the confidence of "A => B" is defined as
Confidence (A=>B) = # tuples containing both A and B /# tuples containing A
o Utility (Support): The potential usefulness of a pattern is a factor defining its interestingness. It can be
estimated by a utility function, such as support. The support of an association pattern refers to the
percentage of task-relevant data tuples (or transactions) for which the pattern is true.
Utility (support): usefulness of a pattern
Support (A=>B) = # tuples containing both A and B / total #of tuples
7
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
o Novelty: Novel patterns are those that contribute new information or increased performance to the given
pattern set. For example -> A data exception. Another strategy for detecting novelty is to remove
redundant patterns.
This refers to the form in which discovered patterns are to be displayed, which may include rules,
tables, cross tabs, charts, graphs, decision trees, cubes, or other visual representations.
Users must be able to specify the forms of presentation to be used for displaying the discovered
patterns. Some representation forms may be better suited than others for particular kinds of knowledge.
For example, generalized relations and their corresponding cross tabs or pie/bar charts are good for
presenting characteristic descriptions, whereas decision trees are common for classification.
The data mining system is integrated with a database or data warehouse system so that it can do its
tasks in an effective presence. A data mining system operates in an environment that needed it to
communicate with other data systems like a database system. There are the possible integration
schemes that can integrate these systems which are as follows −
No coupling − No coupling defines that a data mining system will not use any function of a database
or data warehouse system. It can retrieve data from a specific source (including a file system), process
data using some data mining algorithms, and therefore save the mining results in a different file.
Loose Coupling − In this data mining system uses some services of a database or data warehouse
system. The data is fetched from a data repository handled by these systems. Data mining approaches
are used to process the data and then the processed data is saved either in a file or in a designated area
in a database or data warehouse. Loose coupling is better than no coupling as it can fetch some area
of data stored in databases by using query processing or various system facilities.
Semitight Coupling − In this adequate execution of a few essential data mining primitives can be
supported in the database/datawarehouse system. These primitives can contain sorting, indexing,
aggregation, histogram analysis, multi-way join, and pre-computation of some important statistical
measures, including sum, count, max, min, standard deviation, etc.
Tight coupling − Tight coupling defines that a data mining system is smoothly integrated into the
database/data warehouse system. The data mining subsystem is considered as one functional element
of an information system.
Lecture Number: 05
8
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Data mining is not an easy task, as the algorithms used can get very complex and data is not always
available at one place. It needs to be integrated from various heterogeneous data sources. These factors
also create some issues. Here in this tutorial, we will discuss the major issues regarding −
9
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
• Data mining query languages and ad hoc data mining − Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.
• Presentation and visualization of data mining results − Once the patterns are discovered it
needs to be expressed in high level languages, and visual representations. These representations
should be easily understandable.
• Handling noisy or incomplete data − The data cleaning methods are required to handle the
noise and incomplete objects while mining the data regularities. If the data cleaning methods
are not there then the accuracy of the discovered patterns will be poor.
• Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
• Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.
• Parallel, distributed, and incremental mining algorithms − The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the
data into partitions which is further processed in a parallel fashion. Then the results from the
partitions is merged. The incremental algorithms, update databases without mining the data
again from scratch.
Lecture Number: 06
Data: It is how the data objects and their attributes are stored.
• An attribute is an object’s property or characteristics. For example. A person’s hair colour, air
humidity etc.
• An attribute set defines an object. The object is also referred to as a record of the instances or
entity.
Different types of attributes or data types:
10
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
In data mining, understanding the different types of attributes or data types is essential as it helps
to determine the appropriate data analysis techniques to use. The following are the different types
of data:
1] Nominal Data:
This type of data is also referred to as categorical data. Nominal data represents data that is
qualitative and cannot be measured or compared with numbers. In nominal data, the values
represent a category, and there is no inherent order or hierarchy. Examples of nominal data include
gender, race, religion, and occupation. Nominal data is used in data mining for classification and
clustering tasks.
2] Ordinal Data:
This type of data is also categorical, but with an inherent order or hierarchy. Ordinal data
represents qualitative data that can be ranked in a particular order. For instance, education level can
be ranked from primary to tertiary, and social status can be ranked from low to high. In ordinal
data, the distance between values is not uniform. This means that it is not possible to say that the
difference between high and medium social status is the same as the difference between medium
and low social status. Ordinal data is used in data mining for ranking and classification tasks.
3] Binary Data:
This type of data has only two possible values, often represented as 0 or 1. Binary data is
commonly used in classification tasks, where the target variable has only two possible outcomes.
Examples of binary data include yes/no, true/false, and pass/fail. Binary data is used in data mining
for classification and association rule mining tasks.
4] Interval Data:
This type of data represents quantitative data with equal intervals between consecutive values.
Interval data has no absolute zero point, and therefore, ratios cannot be computed. Examples of
interval data include temperature, IQ scores, and time. Interval data is used in data mining for
clustering and prediction tasks.
5] Ratio Data:
This type of data is similar to interval data, but with an absolute zero point. In ratio data, it is
possible to compute ratios of two values, and this makes it possible to make meaningful
comparisons. Examples of ratio data include height, weight, and income. Ratio data is used in data
mining for prediction and association rule mining tasks.
6] Text Data:
This type of data represents unstructured data in the form of text. Text data can be found in social
media posts, customer reviews, and news articles. Text data is used in data mining for sentiment
analysis, text classification, and topic modeling tasks.
Data Quality: Why do we preprocess the data?
Data preprocessing is an essential step in data mining and machine learning as it helps to ensure
the quality of data used for analysis. There are several factors that are used for data quality
assessment, including:
1. Incompleteness:
This refers to missing data or information in the dataset. Missing data can result from various
factors, such as errors during data entry or data loss during transmission. Preprocessing techniques,
such as imputation, can be used to fill in missing values to ensure the completeness of the dataset.
2. Inconsistency:
11
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
This refers to conflicting or contradictory data in the dataset. Inconsistent data can result from
errors in data entry, data integration, or data storage. Preprocessing techniques, such as data
cleaning and data integration, can be used to detect and resolve inconsistencies in the dataset.
3. Noise:
This refers to random or irrelevant data in the dataset. Noise can result from errors during data
collection or data entry. Preprocessing techniques, such as data smoothing and outlier detection,
can be used to remove noise from the dataset.
4. Outliers:
Outliers are data points that are significantly different from the other data points in the dataset.
Outliers can result from errors in data collection, data entry, or data transmission. Preprocessing
techniques, such as outlier detection and removal, can be used to identify and remove outliers from
the dataset.
5. Redundancy:
Redundancy refers to the presence of duplicate or overlapping data in the dataset. Redundant data
can result from data integration or data storage. Preprocessing techniques, such as data
deduplication, can be used to remove redundant data from the dataset.
5.Data format:
This refers to the structure and format of the data in the dataset. Data may be in different formats,
such as text, numerical, or categorical. Preprocessing techniques, such as data transformation and
normalization, can be used to convert data into a consistent format for analysis.
Lecture Number: 07
Data mining refers to extracting or mining knowledge from large amounts of data. In other
words, data mining is the science, art, and technology of discovering large and complex bodies of
data in order to discover useful patterns. Theoreticians and practitioners are continually seeking
improved techniques to make the process more efficient, cost-effective, and accurate. Any situation
can be analyzed in two ways in data mining:
• Statistical Analysis: In statistics, data is collected, analyzed, explored, and presented to
identify patterns and trends. Alternatively, it is referred to as quantitative analysis.
• Non-statistical Analysis: This analysis provides generalized information and includes sound,
still images, and moving images.
In statistics, there are two main categories:
• Descriptive Statistics: The purpose of descriptive statistics is to organize data and identify the
main characteristics of that data. Graphs or numbers summarize the data. Average, Mode,
SD(Standard Deviation), and Correlation are some of the commonly used descriptive statistical
methods.
• Inferential Statistics: The process of drawing conclusions based on probability theory and
generalizing the data. By analyzing sample statistics, you can infer parameters about populations
and make models of relationships within data.
12
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
There are various statistical terms that one should be aware of while dealing with statistics. Some
of these are:
• Population
• Sample
• Variable
• Quantitative Variable
• Qualitative Variable
• Discrete Variable
• Continuous Variable
Now, let’s start discussing statistical methods. This is the analysis of raw data using mathematical
formulas, models, and techniques. Through the use of statistical methods, information is extracted
from research data, and different ways are available to judge the robustness of research outputs.
As a matter of fact, today’s statistical methods used in the data mining field typically are derived
from the vast statistical toolkit developed to answer problems arising in other fields. These
techniques are taught in science curriculums. It is necessary to check and test several hypotheses.
The hypotheses described above help us assess the validity of our data mining endeavor when
attempting to infer any inferences from the data under study. When using more complex and
sophisticated statistical estimators and tests, these issues become more pronounced.
For extracting knowledge from databases containing different types of observations, a variety of
statistical methods are available in Data Mining and some of these are:
• Logistic regression analysis
• Correlation analysis
• Regression analysis
• Discriminate analysis
• Linear discriminant analysis (LDA)
• Classification
• Clustering
• Outlier detection
• Classification and regression trees,
• Correspondence analysis
• Nonparametric regression,
• Statistical pattern recognition,
• Categorical data analysis,
• Time-series methods for trends and periodicity
• Artificial neural networks
Now, let’s try to understand some of the important statistical methods which are used in data
mining:
• Linear Regression: The linear regression method uses the best linear relationship between the
independent and dependent variables to predict the target variable. In order to achieve the best
fit, make sure that all the distances between the shape and the actual observations at each point
are as small as possible..
• Classification: This is a method of data mining in which a collection of data is
categorized so that a greater degree of accuracy can be predicted and analyzed. An
effective way to analyze very large datasets is to classify them. Logistic
Regression: It can also be applied to machine learning applications and predictive
analytics. In this approach, the dependent variable is either binary (binary regression)
or multinomial (multinomial regression): either one of the two or a set of one, two,
13
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
three, or four options. With a logistic regression equation, one can estimate
probabilities regarding the relationship between the independent variable and the
dependent variable. For understanding logistic regression analysis in detail, you can
refer to logistic regression.
• Discriminant Analysis: A Discriminant Analysis is a statistical method of analyzing
data based on the measurements of categories or clusters and categorizing new
observations into one or more populations that were identified a priori. The
discriminant analysis models each response class independently then uses Bayes’s
theorem to flip these projections around to estimate the likelihood of each response
category given the value of X. These models can be either linear or quadratic.
• Linear Discriminant Analysis: According to Linear Discriminant Analysis, each
observation is assigned a discriminant score to classify it into a response variable class.
By combining the independent variables in a linear fashion, these scores can be obtained.
• Quadratic Discriminant Analysis: An alternative approach is provided by Quadratic
Discriminant Analysis. LDA and QDA both assume Gaussian distributions for the
observations of the Y classes. Unlike LDA, QDA considers each class to have its own
covariance matrix. As a result, the predictor variables have different variances across the
k levels in Y.
• Correlation Analysis: In statistical terms, correlation analysis captures the
relationship between variables in a pair. The value of such variables is usually stored
in a column or rows of a database table and represents a property of an object.
• Regression Analysis: Based on a set of numeric data, regression is a data mining
method that predicts a range of numerical values (also known as continuous values).
You could, for instance, use regression to predict the cost of goods and services based
on other variables. A regression model is used across numerous industries for
forecasting financial data, modeling environmental conditions, and analyzing trends.
Idea generation
Data visualization is commonly used to spur idea generation across teams. They are frequently
leveraged during brainstorming or Design Thinking sessions at the start of a project by supporting the
collection of different perspectives and highlighting the common concerns of the collective. While
these visualizations are usually unpolished and unrefined, they help set the foundation within the
project to ensure that the team is aligned on the problem that they’re looking to address for key
stakeholders.
14
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Idea illustration
Data visualization for idea illustration assists in conveying an idea, such as a tactic or process. It is
commonly used in learning settings, such as tutorials, certification courses, centers of excellence, but
it can also be used to represent organization structures or processes, facilitating communication
between the right individuals for specific tasks. Project managers frequently use Gantt charts and
waterfall charts to illustrate workflows. Data modeling also uses abstraction to represent and better
understand data flow within an enterprise’s information system, making it easier for developers,
business analysts, data architects, and others to understand the relationships in a database or data
warehouse.
Visual discovery
Visual discovery and every day data viz are more closely aligned with data teams. While visual
discovery helps data analysts, data scientists, and other data professionals identify patterns and trends
within a dataset, every day data viz supports the subsequent storytelling after a new insight has been
found.
Data visualization
Data visualization is a critical step in the data science process, helping teams and individuals convey
data more effectively to colleagues and decision makers. Teams that manage reporting systems
Tables: This consists of rows and columns used to compare variables. Tables can show a great deal
of information in a structured way, but they can also overwhelm users that are simply looking for
high-level trends.
• Pie charts and stacked bar charts: These graphs are divided into sections that represent parts of a
whole. They provide a simple way to organize data and compare the size of each component to one
other.
• Line charts and area charts: These visuals show change in one or more quantities by plotting a
series of data points over time and are frequently used within predictive analytics. Line graphs utilize
lines to demonstrate these changes while area charts connect data points with line segments, stacking
variables on top of one another and using color to distinguish between variables.
• Histograms: This graph plots a distribution of numbers using a bar chart (with no spaces between the
bars), representing the quantity of data that falls within a particular range. This visual makes it easy
for an end user to identify outliers within a given dataset.
• Scatter plots: These visuals are beneficial in reveling the relationship between two variables, and
they are commonly used within regression data analysis. However, these can sometimes be confused
with bubble charts, which are used to visualize three variables via the x-axis, the y-axis, and the size
of the bubble.
• Heat maps: These graphical representation displays are helpful in visualizing behavioral data by
location. This can be a location on a map, or even a webpage.
• Tree maps, which display hierarchical data as a set of nested shapes, typically rectangles. Treemaps
are great for comparing the proportions between categories via their area size.
Open source visualization tools
Access to data visualization tools has never been easier. Open source libraries, such as D3.js, provide
a way for analysts to present data in an interactive way, allowing them to engage a broader audience
with new data. Some of the most popular open source visualization libraries include:
• D3.js: It is a front-end JavaScript library for producing dynamic, interactive data visualizations in web
browsers. D3.js (link resides outside IBM) uses HTML, CSS, and SVG to create visual
15
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
representations of data that can be viewed on any browser. It also provides features for interactions
and animations.
• ECharts: A powerful charting and visualization library that offers an easy way to add intuitive,
interactive, and highly customizable charts to products, research papers, presentations,
etc. Echarts (link resides outside IBM) is based in JavaScript and ZRender, a lightweight canvas
library.
• Vega: Vega (link resides outside IBM) defines itself as “visualization grammar,” providing support to
customize visualizations across large datasets which are accessible from the web.
• deck.gl: It is part of Uber's open source visualization framework suite. deck.gl (link resides outside
IBM) is a framework, which is used for exploratory data analysis on big data. It helps build high-
performance GPU-powered visualization on the web.
Data visualization best practices
With so many data visualization tools readily available, there has also been a rise in ineffective
information visualization. Visual communication should be simple and deliberate to ensure that your
data visualization helps your target audience arrive at your intended insight or conclusion. The
following best practices can help ensure your data visualization is useful and clear:
Know your audience(s): Think about who your visualization is designed for and then make sure
your data visualization fits their needs. What is that person trying to accomplish? What kind of
questions do they care about? Does your visualization address their concerns?
Choose an effective visual: Specific visuals are designed for specific types of datasets. For instance,
scatter plots display the relationship between two variables well, while line graphs display time series
data well. Ensure that the visual actually assists the audience in understanding your main takeaway.
Misalignment of charts and data can result in the opposite, confusing your audience further versus
providing clarity.
Lecture Number: 08
16
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
• Metric:
A given distance(e.g. dissimilarity) is meant to be a metric if and only if it satisfies the following four
conditions:
4- d(p, q) = 0 only if p = q.
🄱 . Distance Functions:
The technique used to measure distances depends on a particular situation you are working on. For
instance, in some areas, the euclidean distance can be optimal and useful for computing distances.
17
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Euclidean Contours.
The most common distance function used for numeric attributes or features is the
Euclidean distance which is defined in the following formula:
As you may know, this distance metric presents well-known properties, like
symmetrical, differentiable, convex, spherical…
18
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
The Euclidean distance satisfies all the conditions for being a metric.
19
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Furthermore, the distance calculated using this formula represents the smallest
distance between each pair of points. In other words, it is the shortest path to go
from point A to point B(2-D Cartesian coordinate system), as the following figure
illustrates:
The Euclidean distance is the shortest path(Excluding the case of a wormhole in a quantum world).
Thus, it is useful to use this formula whenever you want to compute the distance between 2 points in
the absence of obstacles on the pathway. This can be considered one of the situations where you don’t
want to compute the euclidean distance; instead, you want to use other metrics like the Manhattan
distance, which will be explained later throughout this article.
20
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
For simplicity and demonstration purposes, let’s choose only two features: petal length, petal width,
and excluding Iris-virginica data. In this way, we can plot the data points in a 2-D space where the x-
axis and the y-axis represent the petal length and the petal width, respectively.
Training dataset.
Each data point came along with its own label: Iris-Setosa or Iris-versicolor(0 and 1 in the dataset).
Thus, the dataset can be used in KNN classification because it is a supervised ML algorithm by its
nature. Let’s assume that our ML model(KNN with k = 4) has been trained on this dataset where we
have selected two input features and only twenty data points, as the previous graph shows.
Until this point, everything looks great, and our KNN classifier is ready to classify a new data point.
Therefore, we need a way to let the model decide where the new data point can be classified.
21
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
As you may be thinking, the euclidean distance has been chosen to let each trained data point vote
where the new data sample can fit: Iris-Setosa or Iris-versicolor. Thus, the euclidean distance has been
calculated from the new data point to each point of our training data, as the following figure shows:
Therefore, the new data sample has been classified as Iris-Setosa. Using this analogy, you can imagine
higher dimensions and other classifiers. Hopefully, You get the idea!
As stated earlier, each domain requires a specific way of computing distances. As we progress more
throughout this article, you will find out what is meant by stating this.
Computing distances using this approach avoids the need to use the squared root function. As the
name reflects, the SED is equal to the euclidean distance squared. Therefore, SED can reduce
22
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
computational work while calculating distances between observations. For instance, it can be used in
Manhattan Contours.
This metric is very useful in measuring the distance between two streets in a given city, where the
distance can be measured in terms of the number of blocks that separate two different places. For
instance, according to the following image, the distance between point A and point B is roughly
equal to 4 blocks.
23
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Recall from the previous KNN example, Computing the manhattan distance from
a new data point to the training data will produce the following values:
KNN classification using Manhattan distance(tie!)
As you can see, two data points have voted for Iris-Setosa, and another two
points have voted for Iris-versicolor, which means a tie.
24
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
I think you may have encountered this problem somewhere. An intuitive solution would be changing
the value of k, decreasing by one if k is larger than one, or increasing by one, otherwise.
However, you will get different behavior for the KNN classifier for each of the previous solutions. For
instance, in our example, k=4. Changing it to k=3 would result in the following values:
Lecture Number: 09, 10
Data preprocessing is the process of transforming raw data into an understandable format. It is also an
important step in data mining as we cannot work with raw data. The quality of the data should be
25
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Consistency: To check whether the same data is kept in all the places that do or do not match.
There are 4 major tasks in data preprocessing – Data cleaning, Data integration, Data reduction, and
Data transformation.
Lecture Number: 11
Data Reduction
This process helps in the reduction of the volume of the data, which makes the analysis easier yet
produces the same or almost the same result. This reduction also helps to reduce storage space. Some
26
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
of the data reduction techniques are dimensionality reduction, numerosity reduction, and data
compression.
Dimensionality reduction: This process is necessary for real-world applications as the data size is
big. In this process, the reduction of random variables or attributes is done so that the dimensionality
of the data set can be reduced. Combining and merging the attributes of the data without losing its
original characteristics. This also helps in the reduction of storage space, and computation time is
reduced. When the data is highly dimensional, a problem called the “Curse of Dimensionality” occurs.
Numerosity Reduction: In this method, the representation of the data is made smaller by reducing the
volume. There will not be any loss of data in this reduction.
Data compression: The compressed form of data is called data compression. This compression can be
lossless or lossy. When there is no loss of information during compression, it is called lossless
compression. Whereas lossy compression reduces information, but it removes only the unnecessary
information.
Data Transformation
The change made in the format or the structure of the data is called data transformation. This step can
be simple or complex based on the requirements. There are some methods for data transformation.
Smoothing: With the help of algorithms, we can remove noise from the dataset, which helps in
knowing the important features of the dataset. By smoothing, we can find even a simple change that
helps in prediction.
Aggregation: In this method, the data is stored and presented in the form of a summary. The data set,
which is from multiple sources, is integrated into with data analysis description. This is an important
step since the accuracy of the data depends on the quantity and quality of the data. When the quality
and the quantity of the data are good, the results are more relevant.
Discretization: The continuous data here is split into intervals. Discretization reduces the data size.
For example, rather than specifying the class time, we can set an interval like (3 pm-5 pm, or 6 pm-8
pm).
Normalization: It is the method of scaling the data so that it can be represented in a smaller range.
Lecture Number: 12
Data Cleaning
Data cleaning is the process of removing incorrect data, incomplete data, and inaccurate data from the
datasets, and it also replaces the missing values. Here are some techniques for data cleaning:
Standard values like “Not Available” or “NA” can be used to replace the missing values.
27
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Missing values can also be filled manually, but it is not recommended when that dataset is big.
The attribute’s mean value can be used to replace the missing value when the data is normally
distributed
wherein in the case of non-normal distribution median value of the attribute can be used.
While using regression or decision tree algorithms, the missing value can be replaced by the most
probable value.
Noisy generally means random error or containing unnecessary data points. Handling noisy data is one
of the most important steps as it leads to the optimization of the model we are using Here are some of
the methods to handle noisy data.
Binning: This method is to smooth or handle noisy data. First, the data is sorted then, and then the
sorted values are separated and stored in the form of bins. There are three methods for smoothing data
in the bin. Smoothing by bin mean method: In this method, the values in the bin are replaced by the
mean value of the bin; Smoothing by bin median: In this method, the values in the bin are replaced
by the median value; Smoothing by bin boundary: In this method, the using minimum and maximum
values of the bin values are taken, and the closest boundary value replaces the values.
Regression: This is used to smooth the data and will help to handle data when unnecessary data is
present. For the analysis, purpose regression helps to decide the variable which is suitable for our
analysis.
Clustering: This is used for finding the outliers and also in grouping the data. Clustering is generally
used in unsupervised learning.
Data Integration
The process of combining multiple sources into a single dataset. The Data integration process is one
of the main components of data management. There are some problems to be considered during data
integration.
Schema integration: Integrates metadata(a set of data that describes other data) from different sources.
Entity identification problem: Identifying entities from multiple databases. For example, the system
or the user should know the student id of one database and studentname of another database belonging
to the same entity.
Detecting and resolving data value concepts: The data taken from different databases while merging
may differ. The attribute values from one database may differ from another database. For example, the
date format may differ, like “MM/DD/YYYY” or “DD/MM/YYYY”.
28
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
UNIT 2
DATA WAREHOUSING AND ON-LINE ANALYTICAL PROCESSING
Number Lectures
planned in Teaching EIGHT
Plan:
Lecture Number: 13
Data Warehouse
Data Warehouse is a relational database management system (RDBMS) construct to meet the
requirement of transaction processing systems. It can be loosely described as any centralized data
repository which can be queried for business benefits. It is a database that stores information oriented
to satisfy decision-making requests. It is a group of decision support technologies, targets to enabling
the knowledge worker (executive, manager, and analyst) to make superior and higher decisions. So,
29
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Data Warehousing support architectures and tool for business executives to systematically organize,
understand and use their information to make strategic decisions.
Data Warehouse environment contains an extraction, transportation, and loading (ETL) solution, an
online analytical processing (OLAP) engine, customer analysis tools, and other applications that
handle the process of gathering information and delivering it to business users.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing
support for decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to a particular group
of users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
o Its usage is read-intensive.
o It contains a few large tables.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the source
operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e., update,
insert, and delete operations are not performed. It usually requires only two procedures in data
accessing: Initial loading of data and access to data.
30
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
31
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
1. 1) Business User: Business users require a data warehouse to view summarized data from the past.
Since these people are non-technical, the data may be presented to them in an elementary form.
2. 2) Store historical data: Data Warehouse is required to store the time variable data from the past. This
input is made to be used for various purposes.
3. 3) Make strategic decisions: Some strategies may be depending upon the data in the data warehouse.
So, data warehouse contributes to making strategic decisions.
4. 4) For data consistency and quality: Bringing the data from different sources at a commonplace, the
user can effectively undertake to bring the uniformity and consistency in data.
5. 5) High response time: Data warehouse has to be ready for somewhat unexpected loads and types of
queries, which demands a significant degree of flexibility and quick response time.
Lecture Number: 14
Data modeling in data warehouses is different from data modeling in operational database systems.
The primary function of data warehouses is to support DSS processes. Thus, the objective of data
32
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
warehouse modeling is to make the data warehouse efficiently support complex queries on long term
information.
In contrast, data modeling in operational database systems targets efficiently supporting simple
transactions in the database such as retrieving, inserting, deleting, and changing data. Moreover, data
warehouses are designed for the customer with general information knowledge about the enterprise,
whereas operational database systems are more oriented toward use by software specialists for creating
distinct applications.
The data within the specific warehouse itself has a particular architecture with the emphasis on various
levels of summarization, as shown in figure:
33
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
o Reflects the most current happenings, which are commonly the most stimulating.
o It is numerous as it is saved at the lowest method of the Granularity.
o It is always (almost) saved on disk storage, which is fast to access but expensive and difficult to manage.
Older detail data is stored in some form of mass storage, and it is infrequently accessed and kept at a
level detail consistent with current detailed data.
Lightly summarized data is data extract from the low level of detail found at the current, detailed
level and usually is stored on disk storage. When building the data warehouse have to remember what
unit of time is summarization done over and also the components or what attributes the summarized
data will contain.
Highly summarized data is compact and directly available and can even be found outside the
warehouse.
Metadata is the final element of the data warehouses and is really of various dimensions in which it
is not the same as file drawn from the operational data, but it is used as:-
o A directory to help the DSS investigator locate the items of the data warehouse.
o A guide to the mapping of record as the data is changed from the operational data to the data warehouse
environment.
o A guide to the method used for summarization between the current, accurate data and the lightly
summarized information and the highly summarized data, etc.
We can see that the only data shown via the conceptual data model is the entities that define the data
and the relationships between those entities. No other data, as shown through the conceptual data
model.
34
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
The phase for designing the logical data model which are as follows:
35
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
The steps for physical data model design which are as follows:
36
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Enterprise Warehouse
An Enterprise warehouse collects all of the records about subjects spanning the entire organization. It
supports corporate-wide data integration, usually from one or more operational systems or external
data providers, and it's cross-functional in scope. It generally contains detailed information as well as
summarized information and can range in estimate from a few gigabyte to hundreds of gigabytes,
terabytes, or beyond.
An enterprise data warehouse may be accomplished on traditional mainframes, UNIX super servers,
or parallel architecture platforms. It required extensive business modeling and may take years to
develop and build.
37
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Data Mart
A data mart includes a subset of corporate-wide data that is of value to a specific collection of users.
The scope is confined to particular selected subjects. For example, a marketing data mart may restrict
its subjects to the customer, items, and sales. The data contained in the data marts tend to be
summarized.
Independent Data Mart: Independent data mart is sourced from data captured from one or more
operational systems or external data providers, or data generally locally within a different department
or geographic area.
Dependent Data Mart: Dependent data marts are sourced exactly from enterprise data-warehouses.
Virtual Warehouses
Virtual Data Warehouses is a set of perception over the operational database. For effective query
processing, only some of the possible summary vision may be materialized. A virtual warehouse is
simple to build but required excess capacity on operational database servers.
Lecture Number: 15
What is OLAP?
OLAP stands for Online Analytical Processing, which is a technology that enables multi-
dimensional analysis of business data. It provides interactive access to large amounts of data
and supports complex calculations and data aggregation. OLAP is used to support business
intelligence and decision-making processes.
38
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Data cube operations are used to manipulate data to meet the needs of users. These
operations help to select particular data for the analysis purpose. There are mainly 5
operations listed below-
Roll-up: operation and aggregate certain similar data attributes having the same dimension
together. For example, if the data cube displays the daily income of a customer, we can use
a roll-up operation to find the monthly income of his salary.
Drill-down: this operation is the reverse of the roll-up operation. It allows us to take
particular information and then subdivide it further for coarser granularity analysis. It
zooms into more detail. For example- if India is an attribute of a country column and we
wish to see villages in India, then the drill-down operation splits India into states, districts,
towns, cities, villages and then displays the required information.
Slicing: this operation filters the unnecessary portions. Suppose in a particular dimension,
the user doesn’t need everything for analysis, rather a particular attribute. For example,
country=”jamaica”, this will display only about jamaica and only display other countries
present on the country list.
39
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Dicing: this operation does a multidimensional cutting, that not only cuts only one
dimension but also can go to another dimension and cut a certain range of it. As a result, it
looks more like a subcube out of the whole cube(as depicted in the figure). For example-
the user wants to see the annual salary of Jharkhand state employees.
Pivot: this operation is very important from a viewing point of view. It basically transforms
the data cube in terms of view. It doesn’t change the data present in the data cube. For
example, if the user is comparing year versus branch, using the pivot operation, the user
can change the viewpoint and now compare branch versus item type.
40
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Lecture Number: 16
Data Warehouse
Topics to cover:
A data warehouse is a type of data management system that is designed to enable and support business
intelligence (BI) activities, especially analytics. Data warehouses are solely intended to perform queries and
analysis and often contain large amounts of historical data. The data within a data warehouse is usually derived
from a wide range of sources such as application log files and transaction applications.
A data warehouse centralizes and consolidates large amounts of data from multiple sources. Its analytical
capabilities allow organizations to derive valuable business insights from their data to improve decision-
making. Over time, it builds a historical record that can be invaluable to data scientists and business analysts.
Because of these capabilities, a data warehouse can be considered an organization’s “single source of truth.”
Simple. All data warehouses share a basic design in which metadata, summary data, and raw data are
stored within the central repository of the warehouse. The repository is fed by data sources on one end and
accessed by end users for analysis, reporting, and mining on the other end.
Simple with a staging area. Operational data must be cleaned and processed before being put in the
warehouse. Although this can be done programmatically, many data warehouses add a staging area for data
before it enters the warehouse, to simplify data preparation.
41
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Hub and spoke. Adding data marts between the central repository and end users allows an organization to
customize its data warehouse to serve various lines of business. When the data is ready for use, it is moved
to the appropriate data mart.
Sandboxes. Sandboxes are private, secure, safe areas that allow companies to quickly and informally
explore new datasets or ways of analyzing data without having to conform to or comply with the formal
rules and protocol of the data warehouse.
Lecture Number: 17
A data warehouse is a single data repository where a record from multiple data sources is integrated
for online business analytical processing (OLAP). This implies a data warehouse needs to meet the
requirements from all the business stages within the entire organization..
1. "top-down" approach
2. "bottom-up" approach
Developing new data mart from the data warehouse is very easy.
42
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
43
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
It is just developing new data marts and then integrating with other data marts.
the locations of the data warehouse and the data marts are reversed in the bottom-up approach design.
44
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Breaks the vast problem into smaller subproblems. Solves the essential low-level problem and
integrates them into a higher one.
Inherently architected- not a union of several data Inherently incremental; can schedule essential data
marts. marts first.
It may see quick results if implemented with Less risk of failure, favorable return on investment,
repetitions. and proof of techniques.
Lecture Number: 18
45
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
1. Requirements analysis and capacity planning: The first process in data warehousing involves
defining enterprise needs, defining architectures, carrying out capacity planning, and selecting the
hardware and software tools. This step will contain be consulting senior management as well as the
different stakeholder.
2. Hardware integration: Once the hardware and software has been selected, they require to be put
by integrating the servers, the storage methods, and the user software tools.
3. Modeling: Modelling is a significant stage that involves designing the warehouse schema and views.
This may contain using a modeling tool if the data warehouses are sophisticated.
4. Physical modeling: For the data warehouses to perform efficiently, physical modeling is needed.
This contains designing the physical data warehouse organization, data placement, data partitioning,
deciding on access techniques, and indexing.
5. Sources: The information for the data warehouse is likely to come from several data sources. This
step contains identifying and connecting the sources using the gateway, ODBC drives, or another
wrapper.
6. ETL: The data from the source system will require to go through an ETL phase. The process of
designing and implementing the ETL phase may contain defining a suitable ETL tool vendors and
purchasing and implementing the tools. This may contains customize the tool to suit the need of the
enterprises.
46
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
7. Populate the data warehouses: Once the ETL tools have been agreed upon, testing the tools will
be needed, perhaps using a staging area. Once everything is working adequately, the ETL tools may
be used in populating the warehouses given the schema and view definition.
8. User applications: For the data warehouses to be helpful, there must be end-user applications. This
step contains designing and implementing applications required by the end-users.
9. Roll-out the warehouses and applications: Once the data warehouse has been populated and the
end-client applications tested, the warehouse system and the operations may be rolled out for the user's
community to use.
Implementation Guidelines
2. Need a champion: A data warehouses project must have a champion who is active to carry out
considerable researches into expected price and benefit of the project. Data warehousing projects
47
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
requires inputs from many units in an enterprise and therefore needs to be driven by someone who is
needed for interacting with people in the enterprises and can actively persuade colleagues.
3. Senior management support: A data warehouses project must be fully supported by senior
management. Given the resource-intensive feature of such project and the time they can take to
implement, a warehouse project signal for a sustained commitment from senior management.
4. Ensure quality: The only record that has been cleaned and is of a quality that is implicit by the
organizations should be loaded in the data warehouses.
5. Corporate strategy: A data warehouse project must be suitable for corporate strategies and business
goals. The purpose of the project must be defined before the beginning of the projects.
6. Business plan: The financial costs (hardware, software, and peopleware), expected advantage, and
a project plan for a data warehouses project must be clearly outlined and understood by all
stakeholders. Without such understanding, rumors about expenditure and benefits can become the only
sources of data, subversion the projects.
7. Training: Data warehouses projects must not overlook data warehouses training requirements. For
a data warehouses project to be successful, the customers must be trained to use the warehouses and
to understand its capabilities.
Lecture Number: 19
Attribute-Oriented Induction
The Attribute-Oriented Induction (AOI) approach to data generalization and summarization – based
characterization was first proposed in 1989 (KDD ‘89 workshop) a few years before the introduction
of the data cube approach.
The data cube approach can be considered as a data warehouse – based, pre computational –
oriented, materialized approach.
It performs off-line aggregation before an OLAP or data mining query is submitted for processing.
On the other hand, the attribute oriented induction approach, at least in its initial proposal, a
relational database query – oriented, generalized – based, on-line data analysis technique.
However, there is no inherent barrier distinguishing the two approaches based on online aggregation
versus offline precomputation.
Some aggregations in the data cube can be computed on-line, while off-line precomputation of
multidimensional space can speed up attribute-oriented induction as well.
48
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
How it is done?
Collect the task-relevant data( initial relation) using a relational database query
Perform generalization by attribute removal or attribute generalization.
Apply aggregation by merging identical, generalized tuples and accumulating their respective counts.
Reduces the size of the generalized data set.
Interactive presentation with users.
Data focusing:
Analyzing task-relevant data, including dimensions, and the result is the initial relation.
Attribute-removal:
To remove attribute A if there is a large set of distinct values for A but (1) there is no generalization
operator on A, or (2) A’s higher-level concepts are expressed in terms of other attributes.
Attribute-generalization:
If there is a large set of distinct values for A, and there exists a set of generalization operators on A,
then select an operator and generalize A.
Lecture Number: 20
49
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
It can calculate total sales by branch, day, and item. It can be more effective to sort tuples or cells by
branch, and thus by day, and then group them as per the item name. An effective performance of such
operations in huge data sets have been widely considered in the database research community.
Such performance can be continued to data cube computation. This method can also be continued to
implement shared-sorts (i.e., sharing sorting costs across different cuboids when sort-based techniques
are used), or to implement shared-partitions (i.e., sharing the partitioning cost across different cuboids
when hash-based algorithms are utilized).
Simultaneous aggregation and caching of intermediate results − In cube computation, it is
effective to calculate higher-level aggregates from earlier computed lower-level aggregates, instead
of from the base fact table. Furthermore, simultaneous aggregation from cached intermediate
computation results can lead to the decline of high-priced disk input/output (I/O) operations.
It can compute sales by branch, for instance, it can use the intermediate results changed from the
computation of a lower-level cuboid including sales by branch and day. This methods can be continued
to implement amortized scans (i.e., computing as several cuboids as possible simultaneously to
amortize disk reads).
Aggregation from the smallest child when there exist multiple child cuboids − When there exist
several child cuboids, it is generally more effective to calculate the desired parent (i.e., more
generalized) cuboid from the smallest, formerly computed child cuboid.
The Apriori pruning method can be explored to compute iceberg cubes efficiently − The Apriori
property in the context of data cubes, defined as follows: If a given cell does not fulfil minimum
support, therefore no descendant of the cell (i.e., more specific cell) will satisfy minimum support.
This property can be used to largely decrease the computation of iceberg cubes.
The description of iceberg cubes includes an iceberg condition, which is a constraint on the cells to be
materialized. A general iceberg condition is that the cells should satisfy a minimum support threshold
including a minimum count or sum. In this term, the Apriori property can be used to shorten away the
inspection of the cell’s descendants.
50
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
UNIT 3
PATTERNS, ASSOCIATIONS AND CORRELATIONS
Unit Content:
Number Lectures
planned in Teaching EIGHT
Plan:
51
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Lecture Number: 21
Frequent pattern mining in data mining is the process of identifying patterns or associations
within a dataset that occur frequently. This is typically done by analyzing large datasets to
find items or sets of items that appear together frequently.
There are several different algorithms used for frequent pattern mining, including:
1. Apriori algorithm: This is one of the most commonly used algorithms for frequent
pattern mining. It uses a “bottom-up” approach to identify frequent itemsets and then
generates association rules from those itemsets.
2. ECLAT algorithm: This algorithm uses a “depth-first search” approach to identify
frequent itemsets. It is particularly efficient for datasets with a large number of items.
3. FP-growth algorithm: This algorithm uses a “compression” technique to find frequent
patterns efficiently. It is particularly efficient for datasets with a large number of
transactions.
4. Frequent pattern mining has many applications, such as Market Basket Analysis,
Recommender Systems, Fraud Detection, and many more.
Advantages:
1. It can find useful information which is not visible in simple data browsing
2. It can find interesting association and correlation among data items
Disadvantages:
Data mining is the process of converting raw data into suitable patterns based on trends.
Data mining has different types of patterns and frequent pattern mining is one of them.
This concept was introduced for mining transaction databases. Frequent patterns are
52
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
53
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Lecture Number: 22
54
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Before moving to mine frequent patterns, we should focus on two terms which
“support” and “confidence” because they can provide a measure if the
Association rule is qualified or not for a particular data set.
Support: how often a given rule appears in the database being mined
Confidence: the number of times a given rule turns out to be true in practice
Image by Author
55
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
1. Generate Candidate set 1, do the first scan and generate One item
set
In this stage, we get the sample data set and take each individual’s count and
make frequent item set 1(K = 1).
Candidate set 1
“Candidate set 1” figure shows you the individual item Support count. Hence the
minimum support is 2 and based on that, item E will remove from the Candidate
set 1 as an infrequent item (disqualified).
56
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Frequent itemset based on the minimum support value will be shown under the
figure “Frequent item set from the first scan” as the “One item set”.
2. Generate Candidate set 2, do the second scan and generate Second item
set
Through this step, you create frequent set 2 (K =2) and takes each of their
Support counts.
Candidate set 2
“Candidate set 2” figure generates through joining Candidate set 1 and takes the
frequency of related occurrences. Hence the minimum support is 2, Itemset B, D
will be removed from Candidate set 2 as an infrequent item set.
57
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
“Frequent item set from the second scan” is the frequent itemset based on the
minimum support value and it will generate the “Second item set”.
3. Generate Candidate set 3, do the third scan and generate Third item set
In this iteration create frequent set 3 (K = 3) and take count of Support. Then
compare with the minimum support value from the Candidate set 3.
Candidate set 3
As you can see here we compare Candidate set 3 with the minimum support
value, generated Third item set. The frequent set from the third scan is the same
as above.
4. Generate Candidate set 4, do the fourth scan and generate Fourth item set
58
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
minimum support less than 2. Therefore we have to stop the calculation from
here because we can not go no any more iterations. Therefore for the above data
set frequent patterns are [A, B, C] and [A, C, D].
By considering one of the frequent set as{A, B, C} and the possible Association
rules as follows;
1. A => B, C
2. A, B => C
3. A, C => B
4. B => A, C
5. B, C => A
6. C => A, B
59
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Image by Author
As you can see in the Apriori Algorithm, for each step it is generating candidates
and it is taking much time when the list of candidates becomes larger.
60
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Lecture Number: 23
1. Frequent item sets, also known as association rules, are a fundamental concept in
association rule mining, which is a technique used in data mining to discover
relationships between items in a dataset. The goal of association rule mining is to
identify relationships between items in a dataset that occur frequently together.
2. A frequent item set is a set of items that occur together frequently in a dataset. The
frequency of an item set is measured by the support count, which is the number of
transactions or records in the dataset that contain the item set. For example, if a dataset
contains 100 transactions and the item set {milk, bread} appears in 20 of those
transactions, the support count for {milk, bread} is 20.
3. Association rule mining algorithms, such as Apriori or FP-Growth, are used to find
frequent item sets and generate association rules. These algorithms work by iteratively
generating candidate item sets and pruning those that do not meet the minimum support
threshold. Once the frequent item sets are found, association rules can be generated by
using the concept of confidence, which is the ratio of the number of transactions that
contain the item set and the number of transactions that contain the antecedent (left-
hand side) of the rule.
61
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
4. Frequent item sets and association rules can be used for a variety of tasks such as
market basket analysis, cross-selling and recommendation systems. However, it should
be noted that association rule mining can generate a large number of rules, many of
which may be irrelevant or uninteresting. Therefore, it is important to use appropriate
measures such as lift and conviction to evaluate the interestingness of the generated
rules.
Association Mining searches for frequent items in the data set. In frequent mining usually,
interesting associations and correlations between item sets in transactional and relational
databases are found. In short, Frequent Mining shows which items appear together in a
transaction or relationship.
Need of Association Mining: Frequent mining is the generation of association rules from a
Transactional Dataset. If there are 2 items X and Y purchased frequently then it’s good to
put them together in stores or provide some discount offer on one item on purchase of
another item. This can really increase sales. For example, it is likely to find that if a
customer buys Milk and bread he/she also buys Butter. So the association rule
is [‘milk]^[‘bread’]=>[‘butter’]. So the seller can suggest the customer buy butter if
he/she buys Milk and Bread.
Important Definitions :
Support : It is one of the measures of interestingness. This tells about the usefulness
and certainty of rules. 5% Support means total 5% of transactions in the database
follow the rule.
Support(A -> B) = Support_count(A 𝖴 B)
Confidence: A confidence of 60% means that 60% of the customers who purchased a
milk and bread also bought butter.
Confidence(A -> B) = Support_count(A 𝖴 B) / Support_count(A)
If a rule satisfies both minimum support and minimum confidence, it is a strong rule.
Support_count(X): Number of transactions in which X appears. If X is A union B then
it is the number of transactions in which A and B both are present.
Maximal Itemset: An itemset is maximal frequent if none of its supersets are frequent.
Closed Itemset: An itemset is closed if none of its immediate supersets have same
support count same as Itemset.
K- Itemset: Itemset which contains K items is a K-itemset. So it can be said that an
itemset is frequent if the corresponding support count is greater than the minimum
support count.
62
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Example On finding Frequent Itemsets – Consider the given dataset with given
transactions.
Lecture Number: 24
In data mining, pattern evaluation is the process of assessing the quality of discovered
patterns. This process is important in order to determine whether the patterns are useful and
whether they can be trusted. There are a number of different measures that can be used to
evaluate patterns, and the choice of measure will depend on the application.
There are several ways to evaluate pattern mining algorithms:
1. Accuracy
The accuracy of a data mining model is a measure of how correctly the model predicts the
target values. The accuracy is measured on a test dataset, which is separate from the training
dataset that was used to train the model
2. Classification Accuracy
This measures how accurately the patterns discovered by the algorithm can be used to classify
new data. This is typically done by taking a set of data that has been labeled with known class
labels and then using the discovered patterns to predict the class labels of the data. The
accuracy can then be computed by comparing the predicted labels to the actual labels.
Overall, classification accuracy is a good metric to use when evaluating classification
models. However, it is important to be aware of its limitations and to use other evaluation
metrics in situations where classification accuracy could be misleading.
63
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
3. Clustering Accuracy
This measures how accurately the patterns discovered by the algorithm can be used to cluster
new data. This is typically done by taking a set of data that has been labeled with known
cluster labels and then using the discovered patterns to predict the cluster labels of the data.
The accuracy can then be computed by comparing the predicted labels to the actual labels.
There are a few ways to evaluate the accuracy of a clustering algorithm:
External indices: these indices compare the clusters produced by the algorithm to some
known ground truth. For example, the Rand Index or the Jaccard coefficient can be used
if the ground truth is known.
Internal indices: these indices assess the goodness of clustering without reference to any
external information. The most popular internal index is the Dunn index.
Stability: this measures how robust the clustering is to small changes in the data. A
clustering algorithm is said to be stable if, when applied to different samples of the same
data, it produces the same results.
Efficiency: this measures how quickly the algorithm converges to the correct clustering.
4. Coverage
This measures how many of the possible patterns in the data are discovered by the algorithm.
This can be computed by taking the total number of possible patterns and dividing it by the
number of patterns discovered by the algorithm.
5. Visual Inspection
This is perhaps the most common method, where the data miner simply looks at the patterns
to see if they make sense. In visual inspection, the data is plotted in a graphical format and
the pattern is observed.
6. Running Time
This measures how long it takes for the algorithm to find the patterns in the data. This is
typically measured in seconds or minutes. There are a few different ways to measure the
performance of a machine learning algorithm, but one of the most common is to simply
measure the amount of time it takes to train the model and make predictions. This is known
as the running time pattern evaluation.
7. Support
The support of a pattern is the percentage of the total number of records that contain the
pattern. Support Pattern evaluation is a process of finding interesting and potentially useful
patterns in data. The purpose of support pattern evaluation is to identify interesting patterns
that may be useful for decision-making. Support pattern evaluation is typically used in data
mining and machine learning applications.
8. Confidence
The confidence of a pattern is the percentage of times that the pattern is found to be correct.
Confidence Pattern evaluation is a method of data mining that is used to assess the quality of
patterns found in data.
64
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
9. Lift
The lift of a pattern is the ratio of the number of times that the pattern is found to be correct
to the number of times that the pattern is expected to be correct. Lift Pattern evaluation is a
data mining technique that can be used to evaluate the performance of a predictive model.
The lift pattern is a graphical representation of the model’s performance and can be used to
identify potential problems with the model.
10. Prediction
The prediction of a pattern is the percentage of times that the pattern is found to be correct.
Prediction Pattern evaluation is a data mining technique used to assess the accuracy of
predictive models. It is used to determine how well a model can predict future outcomes
based on past data. Prediction Pattern evaluation can be used to compare different models,
or to evaluate the performance of a single model.
11. Precision
Precision Pattern Evaluation is a method for analyzing data that has been collected from a
variety of sources. This method can be used to identify patterns and trends in the data, and to
evaluate the accuracy of data. Precision Pattern Evaluation can be used to identify errors in
the data, and to determine the cause of the errors
12. Cross-Validation
This method involves partitioning the data into two sets, training the model on one set, and
then testing it on the other. This can be done multiple times, with different partitions, to get
a more reliable estimate of the model’s performance. Cross-validation is a model validation
technique for assessing how the results of a data mining analysis will generalize to an
independent data set.
14. Bootstrapping
This method involves randomly sampling the data with replacement, training the model on
the sampled data, and then testing it on the original data. This can be used to get a distribution
of the model’s performance, which can be useful for understanding how robust the model is.
Lecture Number: 25
working principle (it is a simple point of scale application for any supermarket which has a
good off-product scale)
the product data will be entered into the database.
the taxes and commissions are entered.
the product will be purchased and it will be sent to the bill counter.
the bill calculating operator will check the product with the bar code machine it will check
and match the product in the database and then it will show the information of the product.
the bill will be paid by the customer and he will receive the products.
66
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Closed Pattern:
A frequent pattern, it meets the minimum support criteria. All super patterns of a closed
pattern are less frequent than the closed pattern.
Max Pattern:
It also meets the minimum support criteria(like a closed pattern). All super patterns of a max
pattern are not frequent patterns. both patterns generate fewer numbers of patterns so
therefore they increase the efficiency of the task.
basket data analysis, cross-marketing, catalog design, sale campaign analysis, web log
analysis, and DNA sequence analysis.
Issues of frequent pattern mining
67
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Lecture Number: 26
Association rule learning is a machine learning technique used for discovering interesting
relationships between variables in large databases. It is designed to detect strong rules in the
68
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
database based on some interesting metrics. For any given multi-item transaction, association
rules aim to obtain rules that determine how or why certain items are linked. Association
rules are created for finding information about general if-then patterns using specific criteria
with support and trust to define what the key relationships are. They help to show the frequency
of an item in specific data since confidence is defined by the number of times an if-then
statement is found to be true.
Some of the uses of association rules in different fields are given below:
Medical Diagnosis: Association rules in medical diagnosis can be used to help doctors
cure patients. As all of us know that diagnosis is not an easy thing, and there are many
errors that can lead to unreliable end results. Using the multi-relational association rule, we
can determine the probability of disease occurrence associated with various factors and
symptoms.
Market Basket Analysis: It is one of the most popular examples and uses of association
rule mining. Big retailers typically use this technique to determine the association between
items.
69
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Lecture Number: 27
In the form of metarules, one can express limitations placed on the number of predicates permitted to occur in
the antecedent or consequent of a rule, as well as the relationships between attributes, attribute values, and/or
aggregates (rule templates).It is necessary to have both a graphical user interface and a high-level declarative
data mining query language to be able to express such constraints.
70
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
The first four kinds of limitations have each received a substantial amount of attention over the entirety of this
book and in the chapters that came before this one. In this section, we will discuss how the application of rule
limits may assist in reducing the overall scope of the mining process. This constraint-based mining approach
optimizes the data mining process by allowing users to describe the rules they want to uncover and then look for
those rules. In addition, the user-specified limits can be utilized by an intelligent mining query optimizer, which
in turn increases the mining operation's efficiency.
When using constraints, it's possible to do interactive exploratory mining and analysis. In this course, you'll
learn about metarule-guided mining, a technique in which syntactic rule restrictions are described using rule
templates. details the use of data space pruning (removing portions of the data space for which further
exploration cannot contribute to finding patterns matching the requirements) and pattern space pruning
(removing portions of the pattern space that are not being mined).
We present anti monotonicity , monotonicity, and succinctness as classes of traits that aid in pruning pattern
spaces via constraint-based search space reduction. Convertible constraints are discussed; they are a subset of
monotonic and anti-monotonic constraints that may be pushed farther into the iterative mining process without
losing their pruning power with the right data ordering.We investigate how data space pruning might be included
in a data mining workflow by introducing two sets of properties: data succinctness and data anti monotonicity
.
We will assume the user is looking for association rules for the sake of discussion. Adding a correlation measure
of interestingness to the support-confidence framework makes it simple to apply the proposed methods to mining
correlation rules. For a better understanding of constraint-based frequent pattern mining, you need to learn about
the six stages of data science processing.
Lecture Number: 28
71
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
of colocation patterns. These can help decide if a specific disease is geographically colocated with
specific objects like a well, a hospital, or a river.
In time-series data analysis, researchers have discretized time-series values into several intervals
therefore that small fluctuations and value differences can be ignored. The data can be summarized
into sequential patterns, which can be indexed to simplify similarity search or comparative analysis.
In image analysis and pattern recognition, researchers have also orderly frequently appearing visual
fragments as visual words, which can be used for efficient clustering, classification, and comparative
analysis.
Pattern mining has been used for the analysis of sequence or structural data including trees, graphs,
subsequences, and networks. In software engineering, researchers have coherent consecutive or
gapped subsequences in code execution as sequential patterns that support identify software errors.
Copy-and-paste errors in huge software programs can be recognized by extended sequential pattern
analysis of source code. Plagiarized software programs can be recognized based on their substantially
identical program flow/loop mechanism.
Frequent and discriminative patterns can be used as primitive indexing mechanism (called graph
indices) to provide search large, complex, structured data sets and networks. These provide a similarity
search in graph-structured data including chemical compound databases or XML-structured databases.
Such patterns can be used for data compression and description.
72
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Lecture Number: 29
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node
denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class
label. The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that indicates whether a customer at a
company is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf
node represents a class.
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
73
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Output:
A Decision Tree
Method
create a node N;
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to noise or outliers. The
pruned trees are smaller and less complex.
74
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Lecture Number: 30
Bayes theorem came into existence after Thomas Bayes, who first utilized conditional probability to
provide an algorithm that uses evidence to calculate limits on an unknown parameter.
Bayes's theorem is expressed mathematically by the following equation that is given below.
P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y is true.
P(Y/X) is a conditional probability that describes the occurrence of event Y is given that X is true.
P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This is known as
the marginal probability.
Bayesian interpretation:
In the Bayesian interpretation, probability determines a "degree of belief." Bayes theorem connects the
degree of belief in a hypothesis before and after accounting for evidence. For example, Lets us consider an
example of the coin. If we toss a coin, then we get either heads or tails, and the percent of occurrence of
either heads and tails is 50%. If the coin is flipped numbers of times, and the outcomes are observed, the
degree of belief may rise, fall, or remain the same depending on the outcomes.
75
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Where P (X⋂Y) is the joint probability of both X and Y being true, because
Bayesian network:
A Bayesian Network falls under the classification of Probabilistic Graphical Modelling (PGM) procedure
that is utilized to compute uncertainties by utilizing the probability concept. Generally known as Belief
Networks, Bayesian Networks are used to show uncertainties using Directed Acyclic Graphs (DAG)
A Directed Acyclic Graph is used to show a Bayesian Network, and like some other statistical graph, a
DAG consists of a set of nodes and links, where the links signify the connection between the nodes.
The nodes here represent random variables, and the edges define the relationship between these variables.
76
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
A DAG models the uncertainty of an event taking place based on the Conditional Probability Distribution
(CDP) of each random variable. A Conditional Probability Table (CPT) is used to represent the CPD of
each variable in a network.
Lecture Number: 31
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the
following from −
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age = youth AND student = yes
THEN buy_computer = yes
Points to remember −
• The IF part of the rule is called rule antecedent or precondition.
• The THEN part of the rule is called rule consequent.
• The antecedent part the condition consist of one or more attribute tests and these tests are
logically ANDed.
• The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a decision tree.
Points to remember −
To extract a rule from a decision tree −
• One rule is created for each path from the root to the leaf node.
• To form a rule antecedent, each splitting criterion is logically ANDed.
• The leaf node holds the class prediction, forming the rule consequent.
Rule Induction Using Sequential Covering Algorithm
Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data. We do not
require to generate a decision tree first. In this algorithm, each rule for a given class covers many of the
tuples of that class.
77
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general strategy the
rules are learned one at a time. For each time rules are learned, a tuple covered by the rule is removed and
the process continues for the rest of the tuples. This is because the path to each leaf in a decision tree
corresponds to a rule.
Note − The Decision tree induction can be considered as learning a set of rules simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one class at a time. When
learning a rule from a class Ci, we want the rule to cover all the tuples from class C only and no tuple form
any other class.
Algorithm: Sequential Covering
Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;
78
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Lecture Number: 32
Let’s say we have two hypothesis for a task, h(x) and h’(x). How would we know
which one is better. Well from a high level perspective, we might take the following
steps:
When we have a classification task, we will consider the accuracy of our model by its
ability to assign an instance to its correct class. Consider this on a binary level. We
have two classes, 1 and 0. We would classify a correct prediction therefore as being
when the model classifies a class 1 instance as class 1, or a class 0 instance as class 0.
79
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Assuming our 1 class as being the ‘Positive class’ and the 0 class being the ‘Negative
class’, we can build a table that outlines all the possibilities our model might produce.
We also have names for these classifications. Our True Positive and True Negative are
our correct classifications, as we can see in both cases, the actual class and the
predicted class are the same. The other two classes, in which the model predicts
incorrectly, can be explained as follows:
Lecture Number: 33
A classification data, in data science, can consist of information divided into the
classes, for example data of people from a city where the gender of a person is
80
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
included in the information of that person. So in the data gender information can
be considered as the data that has the classes.
A classification model can be defined as an algorithm that can tell us about the
class of the data points using the other information of the data. For example, in
the data of people from a city, we can have information such as name,
education, work, and marital status. Based on this information, a trained model
can tell us about the gender of the person.
Building such a model that can tell us about the class of any data point using the
other information the data point consists of can be called classification
modelling. Now if we are practising or we are experts in classification
modelling we must have known that we can not build a model with no error.
There will always be a possibility of error but as a modeler, it becomes our
responsibility to reduce the errors of the model or to increase the accuracy of the
model.
In this article, we are going to discuss some of the methods that can increase the
accuracy of the classification model. We can also think of this discussion as a
method to reduce the error of the classification model. Since in the life of the
modelling procedure acquiring data is one of the first steps, in this article, we
are going to start our discussion on the methods that can be applied on the data
side.
Are you looking for a complete repository of Python libraries used in data
science, check out here.
81
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
One thing that a classification modelling always requires is to let the data tell it
about itself as much as possible. We may find problems with classification
modelling when the amount of data is very little or less. In such a scenario we
are required to have some techniques or sources that can provide us with more
data.
18 Rahul M N
20 Archana F Y
29 Ashok M Y
18 Anamika N
27 Ankit M Y
Let’s take a look at the below table which represents the percentage of working
people according to the gender
Gender Percentage
Female 100%
Male 66.66%
82
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Here we can see that there is a missing value and the records are showing 100%
females are working in the data. Now if we fill the missing value with F then the
results will be as follows.
Gender Percentage
Female 50%
Male 66.66%
Here we can see the effect of missing values in the final prediction.
There are various ways to deal with the missing values some of them are as
follows:
• Mean, Median, and mode: This method is for dealing with the missing
data in the continuous data(age is our case). We can fill missing values
using the Mean, median, and mode of the continuous variable.
• Separation of class: This method separates the data points that have
missing values and we can use this method for categorical data.
• Missing value prediction: This method uses another model that can make
predictions on the missing values. After filling the missing values using a
model we can continue our work on the main modelling.
• KNN imputation: This method can be utilized to fill missing values that
work on finding the data points that have similar attributes to the class
where we have missing values and fills the values same as the information
available in the similar data points.
• Filling closest value: This method fills the nearest value in place of the
missing value. This method can be worked with continuous data but is best
with the time series data.
83
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Deleting outliers: By drawing the data points in coordinates we can detect the
values that are far from the dense area and we can delete sparse data points from
the data if we have a very large amount of the data. With a low amount of data,
this way of outlier treatment is not a good way.
1. Knowledge: Based on the knowledge of the domain we can say which are
the variable are most important to be modelled. For example, in the sales
on a daily basis from a shop, days and amount of material are important
but the name of the customer is not important.
2. Parameters: There are some parameters such as P-value that helps in
determining the relation between the variables. Using these parameters we
can differentiate between important and unimportant features of the data
for modelling.
3. Dimension reduction: Various dimensional reduction algorithms help us
in drawing the data into a lower dimension but also help in understanding
inherent relationships in the data.
85
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Since we know that the split of the data in the decision tree makes a higher
impact on the accuracy of the decision tree model. So to better split, we need to
find an optimum value of Gini impurity. Finding an optimal value for
parameters of the model is known as hyperparameter tuning.
86
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Method 4: Cross-validation
Although this method is related to both data and model side since we apply
models several times in this technique, we can consider cross-validation as a
method of working on the modelling side. Also, making a model perform
accurately does not mean it is accurate. Cross-validation is a way that verifies
the accuracy of the model.
Lecture Number: 34
Lecture Number: 35
Lecture Number: 36
87
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
A neural network would know that both sentences mean the same thing. Or it would be able to broadly recognize that
Baxter Road is a place, but Baxter Smith is a person’s name.
Computer vision
Computer vision is the ability of computers to extract information and insights from images and videos. With neural
networks, computers can distinguish and recognize images similar to humans. Computer vision has several applications,
such as the following:
• Visual recognition in self-driving cars so they can recognize road signs and other road users
• Content moderation to automatically remove unsafe or inappropriate content from image and video archives
• Facial recognition to identify faces and recognize attributes like open eyes, glasses, and facial hair
• Image labeling to identify brand logos, clothing, safety gear, and other image details
Speech recognition
Neural networks can analyze human speech despite varying speech patterns, pitch, tone, language, and accent. Virtual
assistants like Amazon Alexa and automatic transcription software use speech recognition to do tasks like these:
88
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Lecture Number: 37
Backpropagation algorithm
Artificial neural networks learn continuously by using corrective feedback loops to improve their predictive analytics. In
simple terms, you can think of the data flowing from the input node to the output node through many different paths in the
neural network. Only one path is the correct one that maps the input node to the correct output node. To find this path, the
neural network uses a feedback loop, which works as follows:
1. Each node makes a guess about the next node in the path.
2. It checks if the guess was correct. Nodes assign higher weight values to paths that lead to more correct guesses
and lower weight values to node paths that lead to incorrect guesses.
3. For the next data point, the nodes make a new prediction using the higher weight paths and then repeat Step 1.
89
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Supervised learning
In supervised learning, data scientists give artificial neural networks labeled datasets that provide the right answer in
advance. For example, a deep learning network training in facial recognition initially processes hundreds of thousands of
images of human faces, with various terms related to ethnic origin, country, or emotion describing each image.
The neural network slowly builds knowledge from these datasets, which provide the right answer in advance. After the
network has been trained, it starts making guesses about the ethnic origin or emotion of a new image of a human face that
it has never processed before.
Lecture Number: 38
90
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
From the figure above it’s very clear that there are multiple lines (our hyperplane here is a line
because we are considering only two input features x1, x2) that segregate our data points or do a
classification between red and blue circles. So how do we choose the best line or in general the
best hyperplane that segregates our data points?
One reasonable choice as the best hyperplane is the one that represents the largest separation or
margin between the two classes.
So we choose the hyperplane whose distance from it to the nearest data point on each side is
maximized. If such a hyperplane exists it is known as the maximum-margin hyperplane/hard
margin. So from the above figure, we choose L2. Let’s consider a scenario like shown below
91
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Here we have one blue ball in the boundary of the red ball. So how does SVM classify the data?
It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls. The SVM
algorithm has the characteristics to ignore the outlier and finds the best hyperplane that maximizes
the margin. SVM is robust to outliers.
So in this type of data point what SVM does is, finds the maximum margin as done with previous
data sets along with that it adds a penalty each time a point crosses the margin. So the margins in
these types of cases are called soft margins. When there is a soft margin to the data set, the SVM
tries to minimize (1/margin+∧(∑penalty)). Hinge loss is a commonly used penalty. If no violations
no hinge loss.If violations hinge loss proportional to the distance of violation.
Till now, we were talking about linearly separable data(the group of blue balls and red balls are
separable by a straight line/linear line). What to do if data are not linearly separable?
Say, our data is shown in the figure above. SVM solves this by creating a new variable using
a kernel. We call a point xi on the line and we create a new variable yi as a function of distance
from origin o.so if we plot this we get something like as shown below
92
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
In this case, the new variable y is created as a function of distance from the origin. A non-linear
function that creates a new variable is referred to as a kernel.
1. Hyperplane: Hyperplane is the decision boundary that is used to separate the data
points of different classes in a feature space. In the case of linear classifications, it will
be a linear equation i.e. wx+b = 0.
2. Support Vectors: Support vectors are the closest data points to the hyperplane, which
makes a critical role in deciding the hyperplane and margin.
3. Margin: Margin is the distance between the support vector and hyperplane. The main
objective of the support vector machine algorithm is to maximize the margin. The wider
margin indicates better classification performance.
4. Kernel: Kernel is the mathematical function, which is used in SVM to map the original
input data points into high-dimensional feature spaces, so, that the hyperplane can be
easily found out even if the data points are not linearly separable in the original input
space. Some of the common kernel functions are linear, polynomial, radial basis
function(RBF), and sigmoid.
5. Hard Margin: The maximum-margin hyperplane or the hard margin hyperplane is a
hyperplane that properly separates the data points of different categories without any
misclassifications.
6. Soft Margin: When the data is not perfectly separable or contains outliers, SVM
permits a soft margin technique. Each data point has a slack variable introduced by the
soft-margin SVM formulation, which softens the strict margin requirement and permits
certain misclassifications or violations. It discovers a compromise between increasing
the margin and reducing violations.
93
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Based on the nature of the decision boundary, Support Vector Machines (SVM) can be divided
into two main parts:
• Linear SVM: Linear SVMs use a linear decision boundary to separate the data points of
different classes. When the data can be precisely linearly separated, linear SVMs are
very suitable. This means that a single straight line (in 2D) or a hyperplane (in higher
dimensions) can entirely divide the data points into their respective classes. A
hyperplane that maximizes the margin between the classes is the decision boundary.
• Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be
separated into two classes by a straight line (in the case of 2D). By using kernel
functions, nonlinear SVMs can handle nonlinearly separable data. The original input
data is transformed by these kernel functions into a higher-dimensional feature space,
where the data points can be linearly separated. A linear SVM is used to locate a
nonlinear decision boundary in this modified space.
Advantages of SVM
• Effective in high-dimensional cases.
• Its memory is efficient as it uses a subset of training points in the decision function
called support vectors.
• Different kernel functions can be specified for the decision functions and its possible to
specify custom kernels.
Lecture Number: 39
94
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Lecture Number: 40
95
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
transaction is unusual, such as being made in a faraway location for a large amount, it
may be flagged as fraudulent.
96
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Lecture Number: 41
INTRODUCTION:
Cluster analysis, also known as clustering, is a method of data mining that groups similar data
points together. The goal of cluster analysis is to divide a dataset into groups (or clusters) such
that the data points within each group are more similar to each other than to data points in other
groups. This process is often used for exploratory data analysis and can help identify patterns or
relationships within the data that may not be immediately obvious. There are many different
algorithms used for cluster analysis, such as k-means, hierarchical clustering, and density-based
clustering. The choice of algorithm will depend on the specific requirements of the analysis and
the nature of the data being analyzed.
Cluster Analysis is the process to find similar groups of objects in order to form clusters. It i s an
unsupervised machine learning-based algorithm that acts on unlabelled data. A group of data
points would comprise together to form a cluster in which all the objects would belong to the
same group.
The given data is divided into different groups by combining similar objects into a group. This
group is nothing but a cluster. A cluster is nothing but a collection of similar data which is
grouped together.
For example, consider a dataset of vehicles given in which it contains information about different
vehicles like cars, buses, bicycles, etc. As it is unsupervised learning there are no class labels
like Cars, Bikes, etc for all the vehicles, all the data is combined and is not in a structured
manner.
Now our task is to convert the unlabelled data to labelled data and it can be done using clusters.
The main idea of cluster analysis is that it would arrange all the data points by forming clusters
like cars cluster which contains all the cars, bikes clusters which contains all the bikes, etc.
Simply it is the partitioning of similar objects which are applied to unlabelled data.
Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast amount of data and should be dealing with
huge databases. In order to handle extensive databases, the clustering algorithm should be
scalable. Data should be scalable, if it is not scalable, then we can’t get the appropriate result
which would lead to wrong results.
2. High Dimensionality: The algorithm should be able to handle high dimensional space along
with the data of small size.
97
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
3. Algorithm Usability with multiple data kinds: Different kinds of data can be used with
algorithms of clustering. It should be capable of dealing with different types of data like discrete,
categorical and interval-based data, binary data etc.
4. Dealing with unstructured data: There would be some databases that contain missing
values, and noisy or erroneous data. If the algorithms are sensitive to such data then it may lead
to poor quality clusters. So it should be able to handle unstructured data and give some structure
to the data by organising it into groups of similar data objects. This makes the job of the data
expert easier in order to process the data and discover new patterns.
5. Interpretability: The clustering outcomes should be interpretable, comprehensible, and
usable. The interpretability reflects how easily the data is understood.
Lecture Number: 42
Partitioning Methods, Hierarchical Methods
Topics to cover:
Clustering Methods:
The clustering methods can be classified into the following categories:
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
Partitioning Method: It is used to make partitions on the data in order to form clusters. If “n”
partitions are done on “p” objects of the database then each partition is represented by a cluster
and n < p. The two conditions which need to be satisfied with this Partitioning Clustering
Method are:
• One objective should only belong to only one group.
• There should be no group without even a single purpose.
In the partitioning method, there is one technique called iterative relocation, which means the
object will be moved from one group to another to improve the partitioning
Hierarchical Method: In this method, a hierarchical decomposition of the given set of data
objects is created. We can classify hierarchical methods and will be able to know the purpose of
classification on the basis of how the hierarchical decomposition is formed. There are two types
of approaches for the creation of hierarchical decomposition, they are:
• Agglomerative Approach: The agglomerative approach is also known as the bottom-
up approach. Initially, the given data is divided into which objects form separate
groups. Thereafter it keeps on merging the objects or the groups that are close to one
another which means that they exhibit similar properties. This merging process
continues until the termination condition holds.
• Divisive Approach: The divisive approach is also known as the top-down approach.
In this approach, we would start with the data objects that are in the same cluster. The
group of individual clusters is divided into small clusters by continuous iteration. The
98
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
iteration continues until the condition of termination is met or until each cluster
contains one object.
Once the group is split or merged then it can never be undone as it is a rigid method and is not so
flexible. The two approaches which can be used to improve the Hierarchical Clustering Quality
in Data Mining are: –
• One should carefully analyze the linkages of the object at every partitioning of
hierarchical clustering.
• One can use a hierarchical agglomerative algorithm for the integration of hierarchical
agglomeration. In this approach, first, the objects are grouped into micro-clusters.
After grouping data objects into microclusters, macro clustering is performed on the
microcluster.
Density-Based Method: The density-based method mainly focuses on density. In this method,
the given cluster will keep on growing continuously as long as the density in the neighbourhood
exceeds some threshold, i.e, for each data point within a given cluster. The radius of a given
cluster has to contain at least a minimum number of points.
Grid-Based Method: In the Grid-Based method a grid is formed using the object together,i.e,
the object space is quantized into a finite number of cells that form a grid structure. One of the
major advantages of the grid-based method is fast processing time and it is dependent only on
the number of cells in each dimension in the quantized space. The processing time for this
method is much faster so it can save time.
Model-Based Method: In the model-based method, all the clusters are hypothesized in order to
find the data which is best suited for the model. The clustering of the density function is used to
locate the clusters for a given model. It reflects the spatial distribution of data points and also
provides a way to automatically determine the number of clusters based on standard statistics,
taking outlier or noise into account. Therefore it yields robust clustering methods.
Constraint-Based Method: The constraint-based clustering method is performed by the
incorporation of application or user-oriented constraints. A constraint refers to the user
expectation or the properties of the desired clustering results. Constraints provide us with an
interactive way of communication with the clustering process. The user or the application
requirement can specify constraints.
Applications Of Cluster Analysis:
• It is widely used in image processing, data analysis, and pattern recognition.
• It helps marketers to find the distinct groups in their customer base and they can
characterize their customer groups by using purchasing patterns.
• It can be used in the field of biology, by deriving animal and plant taxonomies and
identifying genes with the same capabilities.
• It also helps in information discovery by classifying documents on the web.
Lecture Number: 43
What is a Cluster?
o A cluster is a subset of similar objects
99
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
o A subset of objects such that the distance between any of the two objects in the cluster is less than the
distance between any object in the cluster and any object that is not located inside it.
o A connected region of a multidimensional space with a comparatively high density of objects.
Important points:
o Data objects of a cluster can be considered as one group.
o We first partition the information set into groups while doing cluster analysis. It is based on data similarities
and then assigns the levels to the groups.
o The over-classification main advantage is that it is adaptable to modifications, and it helps single out
important characteristics that differentiate between distinct groups.
Lecture Number: 44
100
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.
A point i is considered as the directly density reachable from a point k with respect to Eps, MinPts if
i belongs to NEps(k)
101
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Density reachable:
A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there is a sequence
chain of a point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density reachable from ii.
Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point o such that
both i and j are considered as density reachable from o with respect to Eps and MinPts.
102
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Lecture Number: 45
There are a wide range of techniques and tools used in outlier analysis. However, as we’ll see later, it’s
often very easy to spot outlying data points. As a result, there’s really no excuse not to perform outlier
analysis on any and all datasets.
Benefits
Unlike other data analysis processes, outlier analysis only really has one benefit: it improves the quality of
the dataset being subject to analysis. Of course, this in turn brings benefits. With a higher-quality dataset,
analysts can expect to draw more accurate conclusions (and more of them).
Lecture Number: 46
There are a wide variety of techniques that can be used to identify outliers in datasets. In this section, we’ll
look at just a few of these techniques, including both straightforward and sophisticated ones.
Sorting
For an amateur data analyst, sorting is by far the easiest technique for outlier analysis. The premise is
simple: load your dataset into any kind of data manipulation tool (such as a spreadsheet), and sort the
values by their magnitude. Then, look at the range of values of various data points. If any data points are
significantly higher or lower than others in the dataset, they may be treated as outliers.
Graphing
An equally forgiving tool for outlier analysis is graphing. Once again, the premise is straightforward: plot
all of the data points on a graph, and see which points stand out from the rest. The advantage of using a
graphing approach over a sorting approach is that it visualizes the magnitude of the data points, which
makes it much easier to spot outliers.
Z-score
A more statistical technique that can be used to identify outliers is the Z-score. The Z-score measures how
far a data point is from the average, as measured in standard deviations. By calculating the Z-score for each
data point, it’s easy to see which data points are placed far from the average. Unfortunately, like sorting,
this doesn’t take into account the influence of a second variable.
Other tests
Aside from sorting, graphing, and Z-scores, there are a whole host of statistical tests that can be used to
identify outliers in a dataset. For most intents and purposes, sorting and graphing are more than enough for
outlier analysis. Z-scores or other statistical tests may only be necessary for academic or high-stakes
purposes, where the true statistical aspect is much more important.
Lecture Number: 47
Personalized marketing:
Web mining can be used to analyze customer behavior on websites and social media platforms.
This information can be used to create personalized marketing campaigns that target customers
based on their interests and preferences.
E-commerce
Web mining can be used to analyze customer behavior on e-commerce websites. This
information can be used to improve the user experience and increase sales by recommending
products based on customer preferences.
Search engine optimization:
Web mining can be used to analyze search engine queries and search engine results pages
(SERPs). This information can be used to improve the visibility of websites in search engine
results and increase traffic to the website.
Fraud detection:
Web mining can be used to detect fraudulent activity on websites. This information can be used
to prevent financial fraud, identity theft, and other types of online fraud.
Sentiment analysis:
Web mining can be used to analyze social media data and extract sentiment from posts,
comments, and reviews. This information can be used to understand customer sentiment towards
products and services and make informed business decisions.
Web content analysis:
Web mining can be used to analyze web content and extract valuable information such as
keywords, topics, and themes. This information can be used to improve the relevance of web
content and optimize search engine rankings.
Customer service:
Web mining can be used to analyze customer service interactions on websites and social media
platforms. This information can be used to improve the quality of customer service and identify
areas for improvement.
Healthcare:
Web mining can be used to analyze health-related websites and extract valuable information
about diseases, treatments, and medications. This information can be used to improve the quality
of healthcare and inform medical research.
Process of Web Mining:
Web mining can be broadly divided into three different types of techniques of mining: Web
Content Mining, Web Structure Mining, and Web Usage Mining. These are explained as
following below.
105
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Lecture Number: 48
106
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Data Mining is very useful for web page Web Mining is very useful for a particular
Application
analysis. website and e-service.
Target
Data scientist and data engineers. Data scientists along with data analysts.
Users
Access Data Mining access data privately. Web Mining access data publicly.
It includes tools like machine learning Special tools for web mining are Scrapy,
Tools
algorithms. PageRank and Apache logs.
Lecture Number: 49
The page rank algorithm is applicable to web pages. The page rank algorithm is used by Google
Search to rank many websites in their search engine results. The page rank algorithm was named
after Larry Page, one of the founders of Google. We can say that the page rank algorithm is a
way of measuring the importance of website pages. A web page basically is a directed graph
which is having two components namely Nodes and Connections. The pages are nodes and
hyperlinks are connections.
107
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Let us see how to solve Page Rank Algorithm. Compute page rank at every node at the end of the
second iteration. use teleportation factor = 0.8
Iteration 0:
For iteration 0 assume that each page is having page rank = 1/Total no. of nodes
Therefore, PR(A) = PR(B) = PR(C) = PR(D) = PR(E) = PR(F) = 1/6 = 0.16
Iteration 1:
By using the above-mentioned formula
PR(A) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * 0.16/4 + 0.16/2
= 0.3
108
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
So, what have we done here is for node A we will see how many incoming signals are there so
here we have PR(B) and PR(C). And for each of the incoming signals, we will see the outgoing
signals from that particular incoming signal i.e. for PR(B) we have 4 outgoing signals and for
PR(C) we have 2 outgoing signals. The same procedure will be applicable for the remaining
nodes and iterations.
NOTE: USE THE UPDATED PAGE RANK FOR FURTHER CALCULATIONS.
PR(B) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.3/2
= 0.32
PR(C) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.3/2
= 0.32
PR(D) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.32/4
= 0.264
PR(E) = (1-0.8) + 0.8 * PR(B)/4
= (1-0.8) + 0.8 * 0.32/4
= 0.264
PR(F) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * (0.32/4) + (0.32/2)
= 0.392
This was for iteration 1, now let us calculate iteration 2.
Iteration 2:
By using the above-mentioned formula
PR(A) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2
= (1-0.8) + 0.8 * (0.32/4) + (0.32/2)
= 0.392
NOTE: USE THE UPDATED PAGE RANK FOR FURTHER CALCULATIONS.
PR(B) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.392/2
= 0.3568
PR(C) = (1-0.8) + 0.8 * PR(A)/2
= (1-0.8) + 0.8 * 0.392/2
= 0.3568
PR(D) = (1-0.8) + 0.8 * PR(B)/4
109
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Lecture Number: 50
In the same time that PageRank was being developed, Jon Kleinberg a professor in the Department of
Computer Science at Cornell came up with his own solution to the Web Search problem. He developed an
algorithm that made use of the link structure of the web in order to discover and rank pages relevant for a
particular topic. HITS (hyperlink-induced topic search) is now part of the Ask search engine
(www.Ask.com).
110
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
Jon Kleinberg's algorithm called HITS identifies good authorities and hubs for a topic by assigning two
numbers to a page: an authority and a hub weight. These weights are defined recursively. A higher
authority weight occurs if the page is pointed to by pages with high hub weights. A higher hub weight
occurs if the page points to many pages with high authority weights.
In order to get a set rich in both hubs and authorities for a query Q, we first collect the top 200
documents that contain the highest number of occurrences of the search phrase Q. These, as pointed out
before may not be of tremendous practical relevance, but one has to start somewhere. Kleinberg points out
that the pages from this set called root (RQ) are essentially very heterogeneous and in general contain only
a few (if any) links to each other. So the web subgraph determined by these nodes is almost totally
disconnected; in particular, we can not enforce Page Rank techniques on RQ.
Authorities for the query Q are not extremely likely to be in the root set RQ. However, they are likely to
From here on, we translate everything into mathematical language. We associate to each page i two
numbers: an authority weight ai, and a hub weight hi. We consider pages with a higher ai number as being
better authorities, and pages with a higher hi number as being better hubs. Given the weights {ai} and {hi}
of all the nodes in SQ, we dynamically update the weights as follows:
111
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
A good hub increases the authority weight of the pages it points. A good authority increases the hub
weight of the pages that point to it. The idea is then to apply the two operations above alternatively until
equilibrium values for the hub and authority weights are reached.
Let A be the adjacency matrix of the graph SQ and denote the authority weight vector by v and the hub
weight vector by u, where
Let us notice that the two update operations described in the pictures translate to: .
112
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
This already corresponds to our intuition that node 3 is the most authoritative, since it is the only one
with incoming edges, and that nodes 1 and 2 are equally important hubs. If we repeat the process further,
we will only obtain scalar multiples of the vectors v and u computed at step 1. So the relative weights of
the nodes remain the same.
113
DATA SCIENCES: DATA WAREHOUSING AND DATA MINING (2018603A)
For more complicated examples of graphs, we would expect the convergence to be problematic, and the
equilibrium solutions (if there are any) to be more difficult to find.
Theorem: Under the assumptions that AAt and AtA are primitive matrices, the following statements hold:
1. If v1, ... , vk, ... is the sequence of authority weights we have computed, then V1, ... ,Vk, ... converges to the
unique probabilistic vector corresponding to the dominant eigenvalue of the matrix AtA. With a slight abuse
of notation, we denoted in here by Vk the vector vk normalized so that the sum of its entries is 1.
2. Likewise, if u1, ... ,uk, ... are the hub weights that we have iteratively computed, then U1, ... ,Uk, ... converges
to the unique probabilistic vector corresponding to the dominant eigenvalue of the matrix AAt. We use the
same notation, that Uk = (1 ⁄ c)uk, where c is the scalar equal to the sum of the entries of the vector uk.
114