Data Mining Basics & Techniques
Data Mining Basics & Techniques
UNIT - I
Data Mining is defined as extracting information from huge sets of data. In other words, we
can say that data mining is the procedure of mining knowledge from data. The information or
knowledge extracted so can be used for any of the following applications −
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
Data mining deals with the kind of patterns that can be mined. On the basis of the kind of data
to be mined, there are two categories of functions involved in Data Mining −
Descriptive
Descriptive Function
The descriptive function deals with the general properties of data in the database. Here is the
list of descriptive functions −
Class/Concept Description
Mining of Associations
Mining of Correlations
Mining of Clusters
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and printers,
and concepts of customers include big spenders and budget spenders. Such
descriptions of a class or a concept are called class/concept descriptions. These
descriptions can be derived by the following two ways −
Some people don’t differentiate data mining from knowledge discovery while others view data
mining as an essential step in the process of knowledge discovery. Here is the list of steps
involved in the knowledge discovery process −
Data Cleaning − In this step, the noise and inconsistent data is removed.
Data Mining − In this step, intelligent methods are applied in order to extract data
patterns.
Knowledge Discovery
Characterization
Discrimination
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
Background knowledge
The background knowledge allows data to be mined at multiple levels of abstraction. For
example, the Concept hierarchies are one of the background knowledge that allows data to be
mined at multiple levels of abstraction.
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place.
It needs to be integrated from various heterogeneous data sources. These factors also create
some issues. Here in this tutorial, we will discuss the major issues regarding −
Performance Issues
Therefore it is necessary for data mining to cover a broad range of knowledge discovery task.
Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse quere language and optimized for efficient and flexible data mining.
Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.
Handling noisy or incomplete data − The data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the discovered patterns will be
poor.
Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
Data mining is one of the forms of artificial intelligence that uses perception models,
analytical models, and multiple algorithms to simulate the techniques of the human brain.
Data mining supports machines to take human decisions and create human choices.
The user of the data mining tools will have to direct the machine rules, preferences, and even
experiences to have decision support data mining metrics are as follows −
Usefulness − Usefulness involves several metrics that tell us whether the model
provides useful data. For instance, a data mining model that correlates save the
location with sales can be both accurate and reliable, but cannot be useful, because it
cannot generalize that result by inserting more stores at the same location.
Furthermore, it does not answer the fundamental business question of why specific locations
have more sales. It can also find that a model that appears successful is meaningless
because it depends on cross-correlations in the data.
Return on Investment (ROI) − Data mining tools will find interesting patterns buried
inside the data and develop predictive models. These models will have several
measures for denoting how well they fit the records. It is not clear how to create a
decision based on some of the measures reported as an element of data mining
analyses.
Access Financial Information during Data Mining − The simplest way to frame decisions in
financial terms is to augment the raw information that is generally mined to also contain
financial data. Some organizations are investing and developing data warehouses, and data
marts.
The design of a warehouse or mart contains considerations about the types of analyses and
data needed for expected queries. It is designing warehouses in a way that allows access to
financial information along with access to more typical data on product attributes, user
profiles, etc. can be useful.
Converting Data Mining Metrics into Financial Terms − A general data mining metric is the
measure of "Lift". Lift is a measure of what is achieved by using the specific model or pattern
relative to a base rate in which the model is not used. High values mean much is achieved. It
can seem then that one can simply create a decision based on Lift.
Accuracy − Accuracy is a measure of how well the model correlates results with the
attributes in the data that has been supported. There are several measures of accuracy,
but all measures of accuracy are dependent on the information that is used. In reality,
values can be missing or approximate, or the data can have been changed by several
processes.
It is the procedure of exploration and development, it can decide to accept a specific amount
of error in the data, especially if the data is fairly uniform in its characteristics. For example, a
model that predicts sales for a specific store based on past sales can be powerfully correlated
and very accurate, even if that store consistently used the wrong accounting techniques. Thus,
measurements of accuracy should be balanced by assessments of reliability.
There are various social implications of data mining which are as follows −
Privacy − It is a loaded issue. In current years privacy concerns have taken on a more
important role in American society as merchants, insurance companies, and
government agencies amass warehouses including personal records.
The concerns that people have over the group of this data will generally extend to
some analytic capabilities used to the data. Users of data mining should start thinking
about how their use of this technology will be impacted by legal problems associated
with privacy.
Profiling − Data Mining and profiling is a developing field that attempts to organize,
understand, analyze, reason, and use the explosion of data in this information age. The
process contains using algorithms and experience to extract design or anomalies that
are very complex, difficult, or time-consuming to recognize.
The founder of Microsoft's Exploration Team used complex data mining algorithms to
solve an issue that had haunted astronomers for some years. The problem of
reviewing, describing, and categorizing 2 billion sky objects recorded over 3 decades.
The algorithm extracted the relevant design to allocate the sky objects like stars or
galaxies. The algorithms were able to extract the feature that represented sky objects
as stars or galaxies. This developing field of data mining and profiling has several
frontiers where it can be used.
Unauthorized Used − Trends obtain through data mining designed to be used for
marketing goals or some other ethical goals, can be misused. Unethical businesses or
people can use the data obtained through data mining to take benefit of vulnerable
people or discriminate against a specific group of people. Furthermore, the data
mining technique is not 100 percent accurate; thus mistakes do appear which can have
serious results.
Data mining, rhich is also referred to as knowledge discovery in databases, means a process
of nontrivial extraction of implicit, previously unknorn and potentially useful information (such
as knorledge rules, constraints, regularities) from data in databases .
Data mining is used to explore increasingly large databases and to improve market
segmentation. By analysing the relationships between parameters such as customer age,
gender, tastes, etc., it is possible to guess their behaviour in order to direct personalised
loyalty campaigns.
The data mining process is usually broken into the following steps.
Supermarkets, for example, use joint purchasing patterns to identify product associations and
decide how to place them in the aisles and on the shelves. Data mining also detects which
offers are most valued by customers or increase sales at the checkout queue. Banking.
Data mining is the process of finding anomalies, patterns and correlations within large data
sets to predict outcomes. Using a broad range of techniques, you can use this information to
increase revenues, cut costs, improve customer relationships, reduce risks and more.
DATA MINING
Data mining is the process of analyzing hidden patterns of data according to different
assembled in common areas, such as data warehouses, for efficient analysis, data
Data mining refers to extracting or mining knowledge from large amountsof data. The
named as knowledge mining which emphasis on mining from large amounts of data.
database systems.
from a data set and transform it into an understandable structure for further use.
The key properties of data mining are Automatic discovery of patterns Prediction of likely
Data mining is the process of sorting through large data sets to identify patterns and
establish relationships to solve problems through data analysis. Data mining tools allow
Data mining uses algorithms and various other techniques to convert large collections of data
into useful output. The most popular types of data mining techniques include: Association
rules, also referred to as market basket analysis, search for relationships between variables.
DM Techniques
1.Classification:
This analysis is used to retrieve important and relevant information about data, and
metadata. This data mining method helps to classify data in different classes.
2. Clustering:
Clustering analysis is a data mining technique to identify data that are like each other.
This process helps to understand the differences and similarities between the data.
3. Regression:
Regression analysis is the data mining method of identifying and analyzing the
4. Association Rules:
This data mining technique helps to find the association between two or more Items. It
5. Outer detection:
This type of data mining technique refers to observation of data items in the dataset
which do not match an expected pattern or expected behavior. This technique can be
used in a variety of domains, such as intrusion, detection, fraud or fault detection, etc.
6. Sequential Patterns:
This data mining technique helps to discover or identify similar patterns or trends in
7. Prediction:
Prediction has used a combination of the other data mining techniques like trends,
Insurance - Data mining helps insurance companies to price their products profitable
Manufacturing - With the help of Data Mining Manufacturers can predict wear and tear
of production assets. They can anticipate maintenance which helps them reduce them
to minimize downtime.
Banking - Data mining helps finance sector to get a view of market risks and manage
Retail - Data Mining techniques help retail malls and grocery stores identify and
arrange
most sellable items in the most attentive positions. It helps store owners to comes up
Service Providers - Service providers like mobile phone and utility industries use Data
Mining to predict the reasons when a customer leaves their company. They analyze
billing details, customer service interactions, complaints made to the company to assign
E-Commerce - E-commerce websites use Data Mining to offer cross-sells and up-sells
through their websites. One of the most famous names is Amazon, who use Data
Super Markets - Data Mining allows supermarket's develope rules to predict if their
shoppers were likely to be expecting. By evaluating their buying pattern, they could find
woman customers who are most likely pregnant. They can start targeting products like
There are various techniques of statistical data mining which are as follows −
Regression − These approaches are used to forecast the value of a response
(dependent) variable from one or more predictor (independent) variables where the
variables are numeric.
There are several forms of regression, including linear, multiple, weighted, polynomial,
nonparametric, and robust (robust techniques are beneficial when errors fail to satisfy
normalcy conditions or when the data includes significant outliers).
Analysis of variance − These methods analyze experimental data for two or more
populations defined by a numeric response variable and one or more categorical
variables (factors). In general, an ANOVA (single-factor analysis of variance) problem
contains a comparison of k population or treatment defines to decide if at least two of
the means are different.
Mixed-effect models − These models are for analyzing grouped data—data that can be
categorized as per one or more grouping variables. They generally define relationships
between a response variable and some covariates in data combined as per one or
more factors. Typical areas of application such as multilevel data, repeated measures
data, block designs, and longitudinal data.
Factor analysis − This method can determine which variables are merged to make a
given factor. For instance, for some psychiatric data, it is not feasible to measure a
specific factor of interest directly (including intelligence); however, it is applicable to
measure other quantities (including student test scores) that reflect the element of
interest. Here, none of the variables are designated as dependent.
The process tries to determine some discriminant functions (linear set of the independent
variables) that discriminate between the groups represented by the response variable.
Discriminant analysis is generally used in social sciences.
Time series analysis − There are some statistical techniques for analyzing time-series
data, including auto-regression methods, univariate ARIMA (autoregressive integrated
moving average) modeling, and long-memory time-sequence modeling.
Quality control − Several statistics can be used to prepare charts for quality control,
including Shewhart charts and CUSUM charts (both of which display group summary
statistics). These statistics contain the mean, standard deviation, range, count, moving
average, moving standard deviation, and moving range.
recent data mining projects, various major data mining techniques have been developed and
used, including association, classification, clustering, prediction, sequential patterns, and
regression.
Clustering consists of grouping certain objects that are similar to each other, it can be
used to decide if two items are similar or dissimilar in their properties.
In a Data Mining sense, the similarity measure is a distance with dimensions describing object
features.
That means if the distance among two data points is small then there is a high degree of
similarity among the objects and vice versa.
The similarity is subjective and depends heavily on the context and application. For example,
similarity among vegetables can be determined from their taste, size, colour etc.
Most clustering approaches use distance measures to assess the similarities or differences
between a pair of objects, the most popular distance measures used are:
1. Euclidean Distance:
Euclidean distance is considered the traditional metric for problems with geometry. It can be
simply explained as the ordinary distance between two points.
It is one of the most used algorithms in the cluster analysis. One of the algorithms that use
this formula would be K-mean.
2. Manhattan Distance:
This determines the absolute difference among the pair of the coordinates.
Suppose we have two points P and Q to determine the distance between these points we
simply have to calculate the perpendicular distance of the points from X-Axis and Y-Axis.
3. Jaccard Index:
The Jaccard distance measures the similarity of the two data set items as the intersection of
those items divided by the union of the data items.
4. Minkowski distance:
Decision tree is the most powerful and popular tool for classification and prediction. A
Decision tree is a flowchart like tree structure, where each internal node denotes a test
on an attribute, each branch represents an outcome of the test, and each leaf node
Introduction:
Decision tree algorithm falls under the category of supervised learning. They canbe used to
solve both regression and classification problems.
Decision tree uses the tree representation to solve the problem in which each
leaf node corresponds to a class label and attributes are represented on the
We can represent any boolean function on discrete attributes using the decision
tree.
Nodes – There are several types of nodes in decision trees. The root node is the parent
of all nodes, which represents the overriding message. Chance nodes tell you the
probability of a certain outcome, whereas decision nodes determine the decisions you
should make.
Branches – Branches connect nodes. Like rivers flowing between two cities, they show
your data flow from questions to answers.
Leaves – Leaves are also known as end nodes. These elements indicate the outcome
of your algorithm. No more nodes can spring out of these nodes. They are the
cornerstone of effective decision-making.
Definition of Decision Tree
1. supervised
2. unsupervised.
The same division can be found in algorithms, and the decision tree belongs to the former
category. It’s a supervised algorithm you can use to regress or classify data. It relies on
training data to predict values or outcomes.
What’s the first thing you notice when you look at a tree? If you’re like most people, it’s
probably the leaves and branches.
The decision tree algorithm has the same elements. Add nodes to the equation, and you have
the entire structure of this algorithm right in front of you.
Nodes – There are several types of nodes in decision trees. The root node is the parent of all
nodes, which represents the overriding message. Chance nodes tell you the probability of a
certain outcome, whereas decision nodes determine the decisions you should make.
Branches – Branches connect nodes. Like rivers flowing between two cities, they show your
data flow from questions to answers.
Leaves – Leaves are also known as end nodes. These elements indicate the outcome of your
algorithm. No more nodes can spring out of these nodes. They are the cornerstone of
effective decision-making.
When you go to a park, you may notice various tree species: birch, pine, oak, and acacia. By
the same token, there are multiple types of decision tree algorithms:
1. Classification Trees – These decision trees map observations about particular data by
classifying them into smaller groups. The chunks allow machine learning specialists to
predict certain values.
2. Regression Trees – According to IBM, regression decision trees can help anticipate
events by looking at input variables.
Knowing the definition, types, and components of decision trees is useful, but it doesn’t give
you a complete picture of this concept. So, buckle your seatbelt and get ready for an in-depth
overview of this algorithm.
Just as there are hierarchies in your family or business, there are hierarchies in any decision
tree in data mining. Top-down arrangements start with a problem you need to solve and break
it down into smaller chunks until you reach a solution. Bottom-up alternatives sort of wing it –
they enable data to flow with some supervision and guide the user to results.
ID3 (Iterative Dichotomiser 3) – Developed by Ross Quinlan, the ID3 is a versatile algorithm
that can solve a multitude of issues. It’s a greedy algorithm (yes, it’s OK to be greedy
sometimes), meaning it selects attributes that maximize information output.
CART (Classification and Regression Trees) – This algorithm drills down on predictions. It
describes how you can predict target values based on other, related information.
CHAID (Chi-squared Automatic Interaction Detection) – If you want to check out how your
variables interact with one another, you can use this algorithm. CHAID determines how
variables mingle and explain particular outcomes.
No discussion about decision tree algorithms is complete without looking at the most
significant concept from this area:
Entropy
As previously mentioned, decision trees are like trees in many ways. Conventional trees
branch out in random directions. Decision trees share this randomness, which is where
entropy comes in.
Entropy tells you the degree of randomness (or surprise) of the information in your decision
tree.
Information Gain
A decision tree isn’t the same before and after splitting a root node into other nodes. You can
use information gain to determine how much it’s changed. This metric indicates how much
your data has improved since your last split. It tells you what to do next to make better
decisions.
Gini Index
Mistakes can happen, even in the most carefully designed decision tree algorithms. However,
you might be able to prevent errors if you calculate their probability.
Enter the Gini index (Gini impurity). It establishes the likelihood of misclassifying an instance
when choosing it randomly.
Pruning
You don’t need every branch on your apple or pear tree to get a great yield. Likewise, not all
data is necessary for a decision tree algorithm.
Pruning is a compression technique that allows you to get rid of this redundant information
that keeps you from classifying useful data.
Growing a tree is straightforward – you plant a seed and water it until it is fully formed.
Creating a decision tree is simpler than some other algorithms, but quite a few steps are
involved nevertheless.
Neural Network:
Neural Network is an information processing paradigm that is inspired by the human nervous
system.
As in the Human Nervous system, we have Biological neurons in the same way in Neural
networks we have Artificial Neurons which is a Mathematical Function that originates from
biological neurons.
The human brain is estimated to have around 10 billion neurons each connected on average to
10,000 other neurons.
Each neuron receives signals through synapses that control the effects of the signal on the
neuron.
Artificial Neuron:
Pitts model is considered to be the first neural network and the Hebbian learning rule is one of
the earliest and simplest learning rules for the neural network. The neural network model can
be broadly divided into the following three types:
2. Feedback Neural Network: Signals can travel in both directions in a feedback network.
Feedback neural networks are very powerful a
nd can become very complex. feedback networks are dynamic. The “states” in such a
network are constantly changing until an equilibrium point is reached.
They stay at equilibrium until the input changes and a new equilibrium needs to be
found. Feedback neural network architectures are also known as interactive or
recurrent. Feedback loops are allowed in such networks. They are used for content
addressable memory.
Genetic algorithms (GA) are adaptive search algorithms- adaptive in terms of the number of
parameters you provide or the types of parameters you provide.
The algorithms classify the best optimal solution among the several solutions, and its design
is based on the natural genetic solution.
Genetic algorithm emulates the principles of natural evolution, i.e. survival of the fittest.
Natural evolution propagates the genetic material in the fittest individuals from one
generation to the next.
The genetic algorithm applies the same technique in data mining – it iteratively performs the
selection, crossover, mutation, and encoding process to evolve the successive generation of
models.
Selection procedure.
Replacement technique.
Termination combination.
At every iteration, the algorithm delivers a model that inherits its traits from the previous
model and competes with the other models until the most predictive model survive.
Genetic algorithms are based on an analogy with genetic structure and behaviour of
chromosomes of the population. Following is the foundation of GAs based on this analogy –
Those individuals who are successful (fittest) then mate to create more offspring than others.
Genes from “fittest” parent propagate throughout the generation, that is sometimes parents
create offspring which is better than either parent.
Search space:
The population of individuals are maintained within search space. Each individual represents a
solution in search space for given problem. Each individual is coded as a finite length vector
(analogous to chromosome) of components. These variable components are analogous to
Genes. Thus a chromosome (individual) is composed of several genes (variable components).
Fitness Score
A Fitness Score is given to each individual which shows the ability of an individual to
“compete”. The individual having optimal fitness score (or near optimal) are sought.
The GAs maintains the population of n individuals (chromosome/solutions) along with their
fitness scores.The individuals having better fitness scores are given more chance to
reproduce than others.
The individuals with better fitness scores are selected who mate and produce better offspring
by combining chromosomes of parents.
The population size is static so the room has to be created for new arrivals. So, some
individuals die and get replaced by new arrivals eventually creating new generation when all
the mating opportunity of the old population is exhausted.
It is hoped that over successive generations better solutions will arrive while least fit die.
replaced by new arrivals eventually creating new generation when all the mating opportunity of
the old population is exhausted.
Once the initial generation is created, the algorithm evolves the generation using following
operators –
1) Initial population
Being the first phase of the algorithm, it includes a set of individuals where each individual is a
solution to the concerned problem. We characterize each individual by the set of parameters
that we refer to as genes.
2)Calculate Fitness
A fitness function is implemented to compute the fitness of each individual in the population.
The function provides a fitness score to each individual in the population. The fitness score is
the probability of the individual selection in the reproduction process.
3) Selection Operator: The idea is to give preference to the individuals with good fitness
scores and allow them to pass their genes to successive generations.
4) Crossover Operator: This represents mating between individuals. Two individuals are
selected using selection operator and crossover sites are chosen randomly. Then the genes at
these crossover sites are exchanged thus creating a completely new individual (offspring). For
example –
5) Mutation Operator: The key idea is to insert random genes in offspring to maintain the
diversity in the population to avoid premature convergence. For example –
So far, we have studied that the genetic algorithm is a classification method that is adaptive,
robust and used globally in situations where the area of classification is large. The algorithms
optimize a fitness function based on the criteria preferred by data mining so as to obtain an
optimal solution, for example:
1. Knowledge discovery system
2. MASSON system
However, the data mining application based on the genetic algorithm is not as rich as the
application based on fuzzy sets. In the section ahead, we have categorized some systems
based on the genetic algorithm used in data mining.
Regression
Data mining identifies human interpretable patterns; it includes a prediction that determines a
future value from the available variable or attributes in the database. The basic assumption of
the linear multi regression model is that there is no interaction among the attributes.
GA handles the interaction among the attributes in a far better way. The non-linear multi
regression model uses GA to get single out from the training data set.
Association Rules
Multiple objective GA deals with the problems with multiple objective functions and
constraints, to determine the optimal set of solutions. None of the solutions from this set
must exist in the search space that can dominate any member of this set.
Such algorithms are used for rule mining with a large search space with many attributes and
records. To obtain the optimal solutions, multi-objective GA performs the global search with
multiple objectives. Such as a combination of factors like predictive accuracy,
comprehensibility and interestingness.
Advantages
GA uses the pay off information instead of the derivative to yield an optimal solution.
MCQ QUESTIONS :
1.Which of the listed below helps to identify abstracted patterns in unlabeled data?
Hybrid learning
Unsupervised learning
Supervised learning
Reinforcement learning
2. Which of the listed below helps to infer a model from labeled data?
Hybrid learning
Unsupervised learning
Supervised learning
Reinforcement learning
3.Among which of the following can query the unstructured textual data?
Information retrieval
Information access
Information manipulation
Information update
4.Which of the following process is not involved in the data mining process?
Data exploration
Data transformation
Data archaeology
Knowledge extraction
5. Which of the following is taken into account before diving in the data mining process?
Vendor consideration
Functionalibility
Compatibility
7.Which of the following process uses intelligent methods to extract data patterns?
Data mining
Text mining
Warehousing
Data selection
Logical system
Transaction system
Database table
Online database
Flat files
It is a subdivision of a set.
It can be defined as the process of extracting information from a large collection of data
The process of data mining involves several other processes like data cleaning, data
transformation, and data integration.
The data in Update-Driven approach can be copied, integrated, summarized, and restructured
in the semantic data store in advance.
Both a and b
Both a and b
These are the group of the same objects that differ majorly from the other objects.
Symbolic representation of facts and ideas from which information can be extracted using the
data mining process
Answer: 1. These are the group of the same objects that differ majorly from the other objects.
19. Which statement given below closely defines the term data selection?
The selection of correct data for the process of Knowledge Discovery Database
Answer: 2. The selection of correct data for the process of the Knowledge Discovery Database
20. Which statement given below closely defines the term discovery?
It is hidden in a database and needs to be found out by the certain clues given (for example: IS
encrypted)
An extremely complex molecule that occurs in the human chromosomes and that carries
genetic information in the form of genes.
It a kind of process of carrying out implicit, previously unknown, and potentially useful
information from the data.
Answer: 3. It a kind of process of carrying out implicit, previously unknown, and potentially
useful information from the data.
UNIT-II
ALGORITHMS
What Is Classification?
• Classification is the process of recognizing, understanding, and grouping ideas
and objects into preset categories or “sub-populations.”
• Using pre-categorized training datasets, machine learning programs use a
variety of algorithms to classify future datasets into categories.
• Classification algorithms in machine learning use input training data to predict the
likelihood that subsequent data will fall into one of the predetermined categories.
• One of the most common uses of classification is filtering emails into “spam” or
“non-spam.”In short, classification is a form of “pattern recognition,” with
classification algorithms applied to the training data to find the same pattern
(similar words or sentiments, number sequences, etc.) in future sets of data.
• Using classification algorithms, which we’ll go into more detail about below, text
analysis software can perform tasks like aspect-based sentiment analysis to
categorize unstructured text by topic and polarity of opinion (positive, negative,
neutral, and beyond).
• Try out this pre-trained sentiment classifier to understand how classification
algorithms work in practice, then read on to learn more about different types of
classification algorithms.
Data Classification process includes two steps −
Statistical Analysis:
• Data mining refers to extracting or mining knowledge from large amounts of data. In other
words, data mining is the science, art, and technology of discovering large and complex bodies
of data in order to discover useful patterns.
• Theoreticians and practitioners are continually seeking improved techniques to make the
process more efficient, cost-effective, and accurate. Any situation can be analyzed in two ways
in data mining:
Statistical Analysis: In statistics, data is collected, analyzed, explored, and presented to identify patterns
and trends. Alternatively, it is referred to as quantitative analysis.
Non-statistical Analysis: This analysis provides generalized information and includes sound, still images,
and moving images.
Descriptive Statistics: The purpose of descriptive statistics is to organize data and identify the main
characteristics of that data. Graphs or numbers summarize the data. Average, Mode, SD(Standard
Deviation), and Correlation are some of the commonly used descriptive statistical methods.
Inferential Statistics: The process of drawing conclusions based on probability theory and generalizing
the data. By analyzing sample statistics, you can infer parameters about populations and make models
of relationships within data.
There are various statistical terms that one should be aware of while dealing with statistics. Some of
these are:
• Population
• Sample
• Variable
• Quantitative Variable
• Qualitative Variable
• Discrete Variable
• Continuous Variable
This is the analysis of raw data using mathematical formulas, models, and techniques. Through the use
of statistical methods, information is extracted from research data, and different ways are available to
judge the robustness of research outputs.
As a matter of fact, today’s statistical methods used in the data mining field typically are derived from
the vast statistical toolkit developed to answer problems arising in other fields. These techniques are
taught in science curriculums.
It is necessary to check and test several hypotheses. The hypotheses described above help us assess the
validity of our data mining endeavor when attempting to infer any inferences from the data under
study. When using more complex and sophisticated statistical estimators and tests, these issues become
more pronounced.
EXtracting knowledge from databases containing different types of observations, a variety of statistical
methods are available in Data Mining and some of these are:
Now, let’s try to understand some of the important statistical methods which are used in data mining:
Linear Regression: The linear regression method uses the best linear relationship between the
independent and dependent variables to predict the target variable. In order to achieve the best fit,
make sure that all the distances between the shape and the actual observations at each point are as
small as possible. A good fit can be determined by determining that no other position would produce
fewer errors given the shape chosen.
Simple linear regression and multiple linear regression are the two major types of linear regression. By
fitting a linear relationship to the independent variable, the simple linear regression predicts the
dependent variable. Using multiple independent variables, multiple linear regression fits the best linear
relationship with the dependent variable.
Classification: This is a method of data mining in which a collection of data is categorized so that a
greater degree of accuracy can be predicted and analyzed. An effective way to analyze very large
datasets is to classify them. Classification is one of several methods aimed at improving the efficiency of
the analysis process. A Logistic Regression and a Discriminant Analysis stand out as two major
classification techniques.
Logistic Regression: It can also be applied to machine learning applications and predictive analytics. In
this approach, the dependent variable is either binary (binary regression) or multinomial (multinomial
regression): either one of the two or a set of one, two, three, or four options. With a logistic regression
equation, one can estimate probabilities regarding the relationship between the independent variable
and the dependent variable. For understanding logistic regression analysis in detail, you can refer to
logistic regression.
Discriminant Analysis: A Discriminant Analysis is a statistical method of analyzing data based on the
measurements of categories or clusters and categorizing new observations into one or more populations
that were identified a priori. The discriminant analysis models each response class independently then
uses Bayes’s theorem to flip these projections around to estimate the likelihood of each response
category given the value of X. These models can be either linear or quadratic.
Linear Discriminant Analysis: According to Linear Discriminant Analysis, each observation is assigned a
discriminant score to classify it into a response variable class. By combining the independent variables in
a linear fashion, these scores can be obtained. Based on this model, observations are drawn from a
Gaussian distribution, and the predictor variables are correlated across all k levels of the response
variable, Y. and for further details linear discriminant analysis
Correlation Analysis: In statistical terms, correlation analysis captures the relationship between
variables in a pair. The value of such variables is usually stored in a column or rows of a database table
and represents a property of an object.
Regression Analysis: Based on a set of numeric data, regression is a data mining method that predicts a
range of numerical values (also known as continuous values). You could, for instance, use regression to
predict the cost of goods and services based on other variables. A regression model is used across
numerous industries for forecasting financial data, modeling environmental conditions, and analyzing
trends.
Clustering :
Clustering is similar to classification except that the groups are not predefined, but rather defined by
the data alone. Clustering is alternatively referred to as unsupervised learning or segmentation. It can be
thought of as partitioning or segmenting the data into groups that might or might not be disjointed. The
clustering is usually accomplished by determining the similarity among the data on predefined
attributes. The most similar data are grouped into clusters.
• Regression − Regression issues deal with the evaluation of an output value located on input values.
When utilized for classification, the input values are values from the database and the output values
define the classes. Regression can be used to clarify classification issues, but it is used for different
applications including forecasting. The elementary form of regression is simple linear regression that
includes only one predictor and a prediction.
Regression can be used to implement classification using two various methods which are as follows −
• Division − The data are divided into regions located on class.
• Prediction − Formulas are created to predict the output class’s value.
• Bayesian Classification − Statistical classifiers are used for the classification. Bayesian classification
is based on the Bayes theorem. Bayesian classifiers view high efficiency and speed when used to high
databases.
Bayes Theorem − Let X be a data tuple. In the Bayesian method, X is treated as “evidence.” Let H be
some hypothesis, including that the data tuple X belongs to a particularized class C. The probability P
(H|X) is decided to define the data. This probability P (H|X) is the probability that hypothesis H’s
influence has given the “evidence” or noticed data tuple X.
P (H|X) is the posterior probability of H conditioned on X. For instance, consider the nature of data
tuples is limited to users defined by the attribute age and income, commonly, and that X is 30 years old
users with Rs. 20,000 income. Assume that H is the hypothesis that the user will purchase a computer.
Thus P (H|X) reverses the probability that user X will purchase a computer given that the user’s age and
income are acknowledged.
P (H) is the prior probability of H. For instance, this is the probability that any given user will purchase
a computer, regardless of age, income, or some other data. The posterior probability P (H|X) is located
on more data than the prior probability P (H), which is free of X.
Likewise, P (X|H) is the posterior probability of X conditioned on H. It is the probability that a user X
is 30 years old and gains Rs. 20,000.
P (H), P (X|H), and P (X) can be measured from the given information. Bayes theorem supports a
method of computing the posterior probability P (H|X), from P (H), P (X|H), and P(X). It is given by
P(H|X)=P(X|H)P(H)
P(X)
Bayes rule allows us to assign probabilities of hypotheses given a data value, P(h j I Xi). Here we
discuss tuples when in actuality each Xi may be an attribute value or other data label. Each h1 may
be an attribute value, set of attribute values (such as a range), or even a combination of attribute
values.
Advantages
The advantages of the load-sensitive routing algorithm are as follows −
• The ability of dynamic routing to congest links and improve application performance makes it a valuable
traffic engineering tool.
• Therefore, deployment of load-sensitive routing is hampered by the overheads imposed by link-state
update propagation, path selection, and signaling.
Disadvantages
There are some problems with a load-sensitive routing protocol, such as −
Hypothesis Testing is a type of statistical analysis in which you put your assumptions about a
population parameter to the test. It is used to estimate the relationship between 2 statistical
variables.
• A teacher assumes that 60% of his college's students come from lower-middle-class families.
• A doctor believes that 3D (Diet, Dose, and Discipline) is 90% effective for diabetic patients.
Now that you know about hypothesis testing, look at the two types of hypothesis testing in
statistics.
Z = ( x̅ – μ0 ) / (σ /√n)
• Here, x̅ is the sample mean,
The null hypothesis is typically an equality hypothesis between population parameters; for
example, a null hypothesis may claim that the population means return equals zero. The
alternate hypothesis is essentially the inverse of the null hypothesis (e.g., the population means
the return is not equal to zero). As a result, they are mutually exclusive, and only one can be
correct. One of the two possibilities, however, will always be correct.
The Null Hypothesis is the assumption that the event will not occur. A null hypothesis has no
bearing on the study's outcome unless it is rejected.
The Alternate Hypothesis is the logical opposite of the null hypothesis. The acceptance of the
alternative hypothesis follows the rejection of the null hypothesis. H1 is the symbol for it.
A sanitizer manufacturer claims that its product kills 95 percent of germs on average.
To put this company's claim to the test, create a null and alternate hypothesis.
Let's consider a hypothesis test for the average height of women in the United States. Suppose
our null hypothesis is that the average height is 5'4". We gather a sample of 100 women and
determine that their average height is 5'5". The standard deviation of population is 2.
z = ( x̅ – μ0 ) / (σ /√n)
z = 0.5 / (0.045)
z = 11.11
We will reject the null hypothesis as the z-score of 11.11 is very large and conclude that there
is evidence to suggest that the average height of women in the US is greater than 5'4".
It is critical to rephrase your original research hypothesis (the prediction that you wish to study)
as a null (Ho) and alternative (Ha) hypothesis so that you can test it quantitatively. Your first
hypothesis, which predicts a link between variables, is generally your alternate hypothesis. The
null hypothesis predicts no link between the variables of interest.
Step 2: Gather Data
For a statistical test to be legitimate, sampling and data collection must be done in a way that is
meant to test your hypothesis. You cannot draw statistical conclusions about the population you
are interested in if your data is not representative.
Other statistical tests are available, but they all compare within-group variance (how to spread
out the data inside a category) against between-group variance (how different the categories
are from one another). If the between-group variation is big enough that there is little or no
overlap between groups, your statistical test will display a low p-value to represent this. This
suggests that the disparities between these groups are unlikely to have occurred by accident.
Alternatively, if there is a large within-group variance and a low between-group variance, your
statistical test will show a high p-value. Any difference you find across groups is most likely
attributable to chance. The variety of variables and the level of measurement of your obtained
data will influence your statistical test selection.
Your statistical test results must determine whether your null hypothesis should be rejected or
not. In most circumstances, you will base your judgment on the p-value provided by the
statistical test. In most circumstances, your preset level of significance for rejecting the null
hypothesis will be 0.05 - that is, when there is less than a 5% likelihood that these data would
be seen if the null hypothesis were true. In other circumstances, researchers use a lower level
of significance, such as 0.01 (1%). This reduces the possibility of wrongly rejecting the null
hypothesis.
The findings of hypothesis testing will be discussed in the results and discussion portions of
your research paper, dissertation, or thesis. You should include a concise overview of the data
and a summary of the findings of your statistical test in the results section. You can talk about
whether your results confirmed your initial hypothesis or not in the conversation. Rejecting or
failing to reject the null hypothesis is a formal term used in hypothesis testing. This is likely a
must for your statistics assignments.
T Test
A statistical test called a t-test is employed to compare the means of two groups. To determine
whether two groups differ or if a procedure or treatment affects the population of interest, it is
frequently used in hypothesis testing.
Chi-Square
You utilize a Chi-square test for hypothesis testing concerning whether your data is as
predicted. To determine if the expected and observed results are well-fitted, the Chi-square test
analyzes the differences between categorical variables from a random sample. The test's
fundamental premise is that the observed values in your data should be compared to the
predicted values that would be present if the null hypothesis were true.
1. Null Hypothesis (H0): Represents the default assumption, stating that there is no significant
effect or relationship in the data.
2. Alternative Hypothesis (Ha): Contradicts the null hypothesis and proposes a specific effect or
relationship that researchers want to investigate.
3. Nondirectional Hypothesis: An alternative hypothesis that doesn't specify the direction of the
effect, leaving it open for both positive and negative possibilities.
3. Adds rigor and validity: It adds scientific rigor to research using statistical methods to analyze
data, ensuring that conclusions are based on sound statistical evidence.
They provide the foundations for many popular and effective machine
learning algorithms like KNN (K-Nearest Neighbours) for supervised
learning and K-Means clustering for unsupervised learning.
Overview:
2. Hamming Distance
3. Euclidean Distance
6. Mahalanobis Distance
7. Cosine Similarity
Most commonly, the two objects are rows of data that describes a subject
(such as a person, car, or house), or an event (such as purchases, a claim, or
a diagnosis)
Perhaps, the most likely way we can encounter distance measures is when
we are using a specific machine learning algorithm that uses distance
measures at its core. The most famous algorithm is KNN — [K-Nearest
Neighbours Algorithm]
KNN :
A few of the More popular machine learning algorithms that use distance
measures at their core is
4. K-Means Clustering
• Perhaps the most widely know kernel method is the Support Vector
Machine algorithm (SVM)
• The Example set might have real values, boolean, categorical, and
ordinal values.
• Numerical values may have different scales. this can greatly impact
the calculation of distance measure and it is often a good practice to
normalize or standardize numerical values prior to calculating the
distance measure.
• The calculation of the errors, such as the mean squared error or mean
absolute error, may resemble a standard distance measure.
As we can see, distance measures play an important role in machine
learning,
1. Hamming Distance
2. Euclidean Distance
3. Manhattan Distance
4. Minkowski Distance
5. Mahalanobis
HAMMING DISTANCE
Hamming distance calculates the distance between two binary vectors, also
referred to as binary strings or bitstrings
Example Set
the distance between red and green could be calculated as the sum or the
average number of bit differences between the two bitstrings. This is
Hamming distance.
For a One-hot encoded string, it might make more sense to summarize the
sum of the bit difference between the strings, which will always be a 0 or 1.
For bitstrings that may have many 1 bits, it is more common to calculate the
average number of bit differences to give a hamming distance score between
0(identical) and 1 (all different).
we can also perform the same calculation using hamming() function from
SciPy.
Here we can confirm the example we get the same results, confirming our
manual implementation
0.33333333333333
Euclidean Distance
You are most likely to use Euclidean distance when calculating the distance
between two rows of data that have numerical values, such a floating-point
or integer values.
Euclidean distance is calculated as the square root of the sum of the squared
differences between the two vectors.
Running the example, we can see we get the same result, confirming our
manual implementation.
6.082762530298219
• The Manhattan distance, also called the Taxicab distance or the City
Block distance, calculates the distance between two real-valued
vectors.
The Manhattan distance is related to the L1 vector norm and the sum
absolute error and mean absolute error metric.
Running the example reports the Manhattan distance between the two
vectors.
13
Running the example, we can see we get the same result, confirming our
manual implementation.
13
Minkowski Distance
Running the example first calculates and prints the Minkowski distance
with p set to 1 to give the Manhattan distance, then with p set to 2 to give the
Euclidean distance, matching the values calculated on the same data from
the previous sections.
13.0
6.082762530298219
Running the example, we can see we get the same results, confirming our
manual implementation.
13.0
6.082762530298219
Mahalanobis Distance
filepath = 'local/input.csv'
df = pd.read_csv(filepath).iloc[:, [0,4,6]]
df.head()
Cosine Distance:
Mostly Cosine distance metric is used to find similarities between different
documents. In cosine metrics, we measure the degree of angle between two
documents/vectors(the term frequencies in different documents collected as
metrics). This particular metric is used when the magnitude between vectors
does not matter but the orientation.
Cosine similarity formula can be derived from the equation of dot products:-
Now, you must be thinking about which value of cosine angle will be helpful
in finding out the similarities.
Now that we have the values which will be considered in order to measure
the similarities, we need to know what do 1, 0, and -1 signify.
Here cosine value 1 is for vectors pointing in the same direction i.e. there are
similarities between the documents/data points. At zero for orthogonal
vectors i.e. Unrelated(some similarity found). Value -1 for vectors pointing
in opposite directions(No similarity).
sklearn.metrics.pairwise.cosine_similarity(X, Y=None, dense_output=True)
Running the example, we can see we get the same results, confirming our
manual implementation.
0.972284251712
Decision Tree Mining is a type of data mining technique that is used to build Classification
Models. It builds classification models in the form of a tree-like structure, just like its name.
In supervised learning, the target result is already known. Decision trees can be used for both
categorical and numerical data. The categorical data represent gender, marital status, etc.
while the numerical data represent age, temperature, etc.
An example of a decision tree with the dataset is shown below.
Decision tree induction is the method of learning the decision trees from the training set. The
training set consists of attributes and class labels. Applications of decision tree induction
include astronomy, financial analysis, medical diagnosis, manufacturing, and production.
A decision tree is a flowchart tree-like structure that is made from training set tuples. The
dataset is broken down into smaller subsets and is present in the form of nodes of a tree. The
tree structure has a root node, internal nodes or decision nodes, leaf node, and branches.
The root node is the topmost node. It represents the best attribute selected for classification.
Internal nodes of the decision nodes represent a test of an attribute of the dataset leaf node or
terminal node which represents the classification or decision label. The branches show the
outcome of the test performed.
1. Decision tree classification does not require any domain knowledge, hence, it is
appropriate for the knowledge discovery process.
2. The representation of data in the form of the tree is easily understood by humans
and it is intuitive.
3. It can handle multidimensional data.
4. It is a quick process with great accuracy.
Disadvantages Of Decision Tree Classification
Given below are the various demerits of Decision Tree Classification:
1. Sometimes decision trees become very complex and these are called overfitted
trees.
2. The decision tree algorithm may not be an optimal solution.
3. The decision trees may return a biased solution if some class label dominates it.
Neural Network:
Neural Network is an information processing paradigm that is inspired by the human nervous system. As
in the Human Nervous system, we have Biological neurons in the same way in Neural networks we have
Artificial Neurons which is a Mathematical Function that originates from biological neurons. The human
brain is estimated to have around 10 billion neurons each connected on average to 10,000 other
neurons. Each neuron receives signals through synapses that control the effects of the signal on the
neuron.
While there are numerous different neural network architectures that have been created by
researchers, the most successful applications in data mining neural networks have been multilayer
feedforward networks. These are networks in which there is an input layer consisting of nodes that
simply accept the input values and successive layers of nodes that are neurons as depicted in the above
figure of Artificial Neuron. The outputs of neurons in a layer are inputs to neurons in the next layer. The
last layer is called the output layer. Layers between the input and output layers are known as hidden
layers.
Rule Based Algorithm:
Rule Based Data Mining Classifier is a direct approach for data mining. This classifier is simple and more
easily interpretable than regular data mining algorithms. These are learning sets of rules which are
implemented using the IF-THEN clause. It works very well with both numerical as well as categorical
data.
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the
following from −
IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the
following from −
Points to remember −
The antecedent part the condition consist of one or more attribute tests and these tests are logically
ANDed.
If the condition holds true for a given tuple, then the antecedent is satisfied.
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a decision tree.
Points to remember −
Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data. We do not
require to generate a decision tree first. In this algorithm, each rule for a given class covers many of the
tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general strategy the
rules are learned one at a time. For each time rules are learned, a tuple covered by the rule is removed
and the process continues for the rest of the tuples. This is because the path to each leaf in a decision
tree corresponds to a rule.
Note − The Decision tree induction can be considered as learning a set of rules simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one class at a time. When
learning a rule from a class Ci, we want the rule to cover all the tuples from class C only and no tuple
form any other class.
Method:
repeat
end for
return Rule_Set;
classification methods such as Genetic Algorithms, Rough Set Approach, and Fuzzy Set Approach.
Data mining includes the utilization of refined data analysis tools to find previously unknown, valid
patterns and relationships in huge data sets. These tools can incorporate statistical models, machine
learning techniques, and mathematical algorithms, such as neural networks or decision trees. Thus, data
mining incorporates analysis and prediction.
In recent data mining projects, various major data mining techniques have been developed and used,
including association, classification, clustering, prediction, sequential patterns, and regression.
1. Classification:
This technique is used to obtain important and relevant information about data and metadata. This data
mining technique helps to classify data in different classes.
Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial data, text data,
time-series data, World Wide Web, and so on..
This classification based on the data model involved. For example. Object-oriented database,
transactional database, relational database, and so on..
Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining functionalities. For
example, discrimination, classification, clustering, characterization, etc. some frameworks tend to be
extensive frameworks offering a few data mining functionalities together..
This classification is as per the data analysis approach utilized, such as neural networks, machine
learning, genetic algorithms, visualization, statistics, data warehouse-oriented or database-oriented, etc.
The classification can also take into account, the level of user interaction involved in the data mining
procedure, such as query-driven systems, autonomous systems, or interactive exploratory systems.
2. Clustering:
Clustering is a division of information into groups of connected objects. Describing the data by a few
clusters mainly loses certain confine details, but accomplishes improvement. It models data by its
clusters.
Data modeling puts clustering from a historical point of view rooted in statistics, mathematics, and
numerical analysis. From a machine learning point of view, clusters relate to hidden patterns, the search
for clusters is unsupervised learning, and the subsequent framework represents a data concept. From a
practical point of view, clustering plays an extraordinary job in data mining applications.
For example, scientific data exploration, text mining, information retrieval, spatial database applications,
CRM, Web analysis, computational biology, medical diagnostics, and much more.
In other words, we can say that Clustering analysis is a data mining technique to identify similar data.
This technique helps to recognize the differences and similarities between the data. Clustering is very
similar to the classification, but it involves grouping chunks of data together based on their similarities
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the relationship between
variables because of the presence of the other factor. It is used to define the probability of the specific
variable. Regression, primarily a form of planning and modeling. For example, we might use it to project
certain costs, depending on other factors such as availability, consumer demand, and competition.
Primarily it gives the exact relationship between two or more variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a hidden pattern
in the data set.
Association rules are if-then statements that support to show the probability of interactions between
data items within large data sets in different types of databases. Association rule mining has several
applications and is commonly used to help sales correlations in data or medical data sets.
The way the algorithm works is that you have various data, For example, a list of grocery items that you
have been buying for the last six months. It calculates a percentage of items being purchased together.
Lift:
This measurement technique measures the accuracy of the confidence over how often item B is
purchased.
Support:
This measurement technique measures how often multiple items are purchased and compared it to the
overall dataset.
Confidence:
This measurement technique measures how often item B is purchased when item A is purchased as well.
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data set, which do not
match an expected pattern or expected behavior. This technique may be used in various domains like
intrusion, detection, fraud detection, etc. It is also known as Outlier Analysis or Outilier mining.
The outlier is a data point that diverges too much from the rest of the dataset. The majority of the real-
world datasets have an outlier. Outlier detection plays a significant role in the data mining field. Outlier
detection is valuable in numerous fields like network interruption identification, credit or debit card
fraud detection, detecting outlying in wireless sensor network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential data to discover
sequential patterns. It comprises of finding interesting subsequences in a set of sequences, where the
stake of a sequence can be measured in terms of different criteria like length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or recognize similar patterns in
transaction data over some time.
7. Prediction:
Prediction used a combination of other data mining techniques such as trends, clustering, classification,
etc. It analyzes past events or instances in the right sequence to predict a future event.
UNIT-III
Clustering:
The process of making a group of abstract objects into classes of similar objects is
known as clustering.
Points to Remember:
In the process of cluster analysis, the first step is to partition the set of data into groups
with the help of data similarity, and then groups are assigned to their respective labels.
It is widely used in many applications such as image processing, data analysis, and
pattern recognition.
It helps marketers to find the distinct groups in their customer base and they can
characterize their customer groups by using purchasing patterns.
It can be used in the field of biology, by deriving animal and plant taxonomies and
identifying genes with the same capabilities.
Clustering Methods:
Model-Based Method
Hierarchical Method
Constraint-Based Method
Grid-Based Method
Partitioning Method
Density-Based Method
The following are some points why clustering is important in data mining.
Ability to deal with different kinds of attributes – Algorithms should be able to work
with the type of data such as categorical, numerical, and binary data.
Discovery of clusters with attribute shape – The algorithm should be able to detect
clusters in arbitrary shapes and it should not be bounded to distance measures.
High dimensionality – The algorithm should be able to handle high dimensional space
instead of only handling low dimensional data.
Similarity Measures:
Similarity or Similarity distance measure is a basic building block of data mining and
greatly used in Recommendation Engine, Clustering Techniques and Detecting
Anomalies.
If you are going to analyze any task to analyze data sets, you will always have some
assumptions based on how this data is generated. If you find some data points that are
likely to contain some form of error, then these are definitely outliers, and depending on
the context, you want to overcome those errors.
The data mining process involves the analysis and prediction of data that the data
holds. In 1969, Grubbs introduced the first definition of outliers.
Global Outliers
Global outliers are also called point outliers. Global outliers are taken as the simplest
form of outliers. When data points deviate from all the rest of the data points in a given
data set, it is known as the global outlier. In most cases, all the outlier detection
procedures are targeted to determine the global outliers. The green data point is the
global outlier.
Collective Outliers
In a given set of data, when a group of data points deviates from the rest of the
data set is called collective outliers. Here, the particular set of data objects may
not be outliers, but when you consider the data objects as a whole, they may
behave as outliers.
To identify the types of different outliers, you need to go through background
information about the relationship between the behavior of outliers shown by
different data objects.
For example, in an Intrusion Detection System, the DOS package from one
system to another is taken as normal behavior. Therefore, if this happens with
the various computer simultaneously, it is considered abnormal behavior, and as
a whole, they are called collective outliers. The green data points as a whole
represent the collective outlier.
Contextual Outliers
As the name suggests, "Contextual" means this outlier introduced within a
context.
For example, in the speech recognition technique, the single background noise.
Contextual outliers are also known as Conditional outliers.
These types of outliers happen if a data object deviates from the other data
points because of any specific condition in a given data set.
As we know, there are two types of attributes of objects of data:
contextual attributes
behavioral attributes.
Contextual outlier analysis enables the users to examine outliers in different contexts
and conditions, which can be useful in various applications.
Outliers Analysis
Outliers are discarded at many places when data mining is applied. But it is still used in
many applications like fraud detection, medical, etc. It is usually because the events that
occur rarely can store much more significant information than the events that occur
more regularly.
Other applications where outlier detection plays a vital role are given below.
Any unusual response that occurs due to medical treatment can be analyzed through
outlier analysis in data mining.
Fraud detection in the telecom industry
In market analysis, outlier analysis enables marketers to identify the customer's
behaviors.
The process in which the behavior of the outliers is identified in a dataset is called
outlier analysis. It is also known as "outlier mining", the process is defined as a
significant task of data mining.
Hierarchical Clustering:
This method creates a hierarchical decomposition of the given set of data objects.
Wecan classify hierarchical methods on the basis of how the hierarchical
decomposition is formed.
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with
eachobject forming a separate group. It keeps on merging the objects or groups that
are close to one another. It keep on doing so until all of the groups are merged into one
oruntil the termination condition holds.
Algorithm:
For i=1 to N:
# the primary diagonal so we compute only lower# part of the primary diagonal
For j=1 to i:
dis_mat[i][j] = distance[di,dj]
each data point is a singleton clusterrepeat merge the two cluster having minimum
distance update the distance matrix until only a single cluster remains.
Step-1: Consider each alphabet as a single cluster and calculate the distance of one
cluster from all the other clusters.
Step-2: In the second step comparable clusters are merged together to form a single
cluster. Let’s say cluster (B) and cluster (C) are very similar to each other therefore we
merge them in the second step similarly to cluster (D) and (E) and at last, we get the
clusters [(A), (BC), (DE), (F)]
Step-3: We recalculate the proximity according to the algorithm and merge the two
nearest clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
Step-4: Repeating the same process; The clusters DEF and BC are comparable and
merged together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
Step-5: At last the two remaining clusters are merged together to form a single cluster
[(ABCDEF)].
2. Divisive:
Partitioning Algorithms:
Partitioning Method: This clustering method classifies the information into multiple groups
based on the characteristics and similarity of the data. Its the data analysts to specify the
number of clusters that has to be generated for the clustering methods.
In the partitioning method when database(D) that contains multiple(N) objects then the
partitioning method constructs user-specified(K) partitions of the data in which each partition
represents a cluster and a particular region.
There are many algorithms that come under partitioning method some of the popular ones are
K-Mean, PAM(K-Medoids), CLARA algorithm (Clustering Large Applications) etc. In this article,
we will be seeing the working of K Mean algorithm in detail. K-Mean (A ce ntroid based
Technique):
The K means algorithm takes the input parameter K from the user and partitions the dataset
containing N objects into K clusters so that resulting similarity among the data objects inside
the group (intra cluster) is high but the similarity of data objects with the data objects from
outside the cluster is low (inter cluster).
The similarity of the cluster is determined with respect to the mean value of the cluster. It is a
type of square error algorithm. At the start randomly k objects from the dataset are chosen in
which each of the objects represents a cluster mean(centre).
For the rest of the data objects, they are assigned to the nearest cluster based on their distance
from the cluster mean.
The new mean of each of the cluster is then calculated with the added data objects. Algorithm:
K mean:
Input:K: The number of clusters in which the dataset has to be dividD: A dataset
containing N number of objectOutpudataset of K clusters Method:
(Re) Assign each object to which object is most similar based upon mean values.
Update Cluster means, i.e., Recalculate the mean of each cluster with the updated values.
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66
Initial Cluster:
Note: These two points are chosen randomly from the dataset.
Iteration-1:
C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-2:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore we get the clusters (16-29) and (36-
66) as 2 clusters we get using K Mean Algorithm.
Association rules :
Medicine. Doctors can use association rules to help diagnose patients. There
are many variables to consider when making a diagnosis, as many diseases
share symptoms. By using association rules and machine learning-fueled data
analysis, doctors can determine the conditional probability of a given illness by
comparing symptom relationships in the data from past cases. As new
diagnoses get made, machine learning models can adapt the rules to reflect the
updated data.
User experience (UX) design. Developers can collect data on how consumers
use a website they create. They can then use associations in the data to optimize
the website user interface -- by analyzing where users tend to click and what
maximizes the chance that they engage with a call to action, for example.
Entertainment. Services like Netflix and Spotify can use association rules to
fuel their content recommendation engines. Machine learning models analyze
past user behavior data for frequent patterns, develop association rules and use
those rules to recommend content that a user is likely to engage with, or organize
content in a way that is likely to put the most interesting content for a given user
first.
How do association rules work?
Association rule mining, at a basic level, involves the use of machine
learning models to analyze data for patterns, or co-occurrences, in a database. It
identifies frequent if-then associations, which themselves are the association rules.
An association rule has two parts: an antecedent (if) and a consequent (then). An
antecedent is an item found within the data.
With the AIS algorithm, itemsets are generated and counted as it scans the
data. In transaction data, the AIS algorithm determines which large itemsets
contained a transaction, and new candidate itemsets are created by extending
the large itemsets with other items in the transaction data.
At the end of the pass, the support count of candidate itemsets is created by
aggregating the sequential structure. The downside of both the AIS and SETM
algorithms is that each one can generate and count many small candidate
itemsets, according to published materials from Dr. Saed Sayad, author
of Real Time Data Mining.
Apriori algorithm, candidate itemsets are generated using only the large
itemsets of the previous pass. The large itemset of the previous pass is joined
with itself to generate all itemsets with a size that's larger by one. Each
generated itemset with a subset that is not large is then deleted. The
remaining itemsets are the candidates.
It saves time and money as many resources working together will reduce the
time and cut potential costs.
It can take advantage of non-local resources when the local resources are finite.
Types of Parallelism:
Bit-level parallelism –
It is the form of parallel computing which is based on the increasing processor’s size. It
reduces the number of instructions that the system must execute in order to perform a
task on large-sized data.
Example: Consider a scenario where an 8-bit processor must compute the sum of two
16-bit integers. It must first sum up the 8 lower-order bits, then add the 8 higher-order
bits, thus requiring two instructions to perform the operation. A 16-bit processor can
perform the operation with just one instruction.
Instruction-level parallelism –
A processor can only address less than one instruction for each clock cycle phase.
These instructions can be re-ordered and grouped which are later on executed
concurrently without affecting the result of the program. This is called instruction-level
parallelism.
Task Parallelism –
Task parallelism employs the decomposition of a task into subtasks and then allocating
each of the subtasks for execution. The processors perform the execution of sub-tasks
concurrently.
The whole real-world runs in dynamic nature i.e. many things happen at a certain time
but at different places concurrently. This data is extensively huge to manage.
Real-world data needs more dynamic simulation and modeling, and for achieving
the same, parallel computing is the key.
Complex, large datasets, and their management can be organized only and only
using parallel computing’s approach.
The algorithms must be managed in such a way that they can be handled in a
parallel mechanism.
The algorithms or programs must have low coupling and high cohesion. But it’s
difficult to create such programs.
The computational graph has undergone a great transition from serial computing
to parallel computing.
Tech giant such as Intel has already taken a step towards parallel computing by
employing multicore processors.
Parallel computation will revolutionize the way computers work in the future, for
the better good. With all the world connecting to each other even more than
before, Parallel Computing does a better role in helping us stay that way.
Data Collection: The set of relevant data in the database and data
warehouse is collected by query Processing and partitioned into a target
class and one or a set of contrasting classes.
Only the highly relevant dimensions are included in the further analysis.
Synchronous Generalization: The process of generalization is performed
upon the target class to the level controlled by the user or expert specified
dimension threshold, which results in a prime target class relation or
cuboid.
Case 2: An itemset is large in the original database, but is not large in the
newly inserted transactions.
Case 3: An itemset is not large in the original database, but is large in the
newly inserted transactions.
Case 4: An itemset is not large in the original database and in the newly
inserted transactions.
Since itemsets in Case 1 are large in both the original database and the
new transactions, they will still be large after the weighted average of the
counts. Similarly, itemsets in Case 4 will still be small after the new
transactions are inserted. Thus Cases 1 and 4 will not affect the final
association rules. Case 2 may remove existing association rules, and Case
3 may add new association rules.
3. Seek itemsets that appear only in the newly inserted transactions and
determine whether they are large in the updated database.
For any given multi-item transaction, association rules aim to obtain rules
that determine how or why certain items are linked.
Association rules are created for finding information about general if-then
patterns using specific criteria with support and trust to define what the
key relationships are.
Some of the uses of association rules in different fields are given below:
Market Basket Analysis: It is one of the most popular examples and uses
of association rule mining. Big retailers typically use this technique to
determine the association between items.
A cluster is the collection of data objects which are similar to each other within the
same group. The data objects of a cluster are dissimilar to data objects of other
groups or clusters.
Clustering Approaches:
k-medoids
CLARINS
Density-based methods:
DBSACN
OPTICS
STING
WaveCluster
CLIQUE
Diana
Agnes
BIRCH
CAMELEON
If all the data objects in the cluster are highly similar then the cluster has high
quality. We can measure the quality of Clustering by using the
Dissimilarity/Similarity metric in most situations.
But there are some other methods to measure the Qualities of Good Clustering if
the clusters are alike.
1. Dissimilarity/Similarity metric:
Euclidean distance
Mahalanobis distance
Cosine distance
Let us consider the clustering C1, which contains the sub-clusters s1 and s2,
where the members of the s1 and s2 cluster belong to the same category
according to ground truth. Let us consider another clustering C2 which is
identical to C1 but now s1 and s2 are merged into one cluster.
Then the quality of those cluster categories is measured by the Rag Bag
method. According to the rag bag method, we should put the heterogeneous
object into a rag bag category.
According to the ground truth, this situation is noisy and the quality of
clustering is measured using the rag bag criteria.
we define the clustering quality measure, Q, and according to rag bag method
criteria C2, will have more cluster quality compared to the C1 that is, Q(C2,
Cg )>Q(C1, Cg).
The small cluster preservation criterion states that are splitting a small
category into pieces is not advisable and it further decreases the quality of
clusters as the pieces of clusters are distinctive.
Suppose clustering C1 has split into three clusters, C11 = {d1, . . . , dn}, C12 =
{dn+1}, and C13 = {dn+2}.
Let clustering C2 also split into three clusters, namely C1 = {d1, . . . , dn−1}, C2
= {dn}, and C3 = {dn+1,dn+2}. As C1 splits the small category of objects and
C2 splits the big category which is preferred according to the rule mentioned
above the clustering quality measure Q should give a higher score to C2, that
is, Q(C2, Cg ) > Q(C1, Cg ).
UNIT-IV
Data Warehousing:
Introduction :
Data warehousing is the process of constructing and using a data warehouse. A datawarehouse is
constructed by integrating data from multiple heterogeneous sources that support analytical reporting,
structured and/or ad hoc queries, and decision making.Data warehousing involves data cleaning, data
integration, and data consolidations.
A data warehousing is defined as a technique for collecting and managing data from varied sources to
provide meaningful business insights. It is a blend of technologies and components which aids the
strategic use of data.
It is electronic storage of a large amount of information by a business which is designedfor query and
analysis instead of transaction processing. It is a process of transformingdata into information and
making it available to users in a timely manner to make Data Warehousing:
Introduction :
Data warehousing is the process of constructing and using a data warehouse. A datawarehouse is
constructed by integrating data from multiple heterogeneous sources thatsupport analytical reporting,
structured and/or ad hoc queries, and decision making.
Data warehousing involves data cleaning, data integration, and data consolidations.
A data warehousing is defined as a technique for collecting and managing data from varied sources to
provide meaningful business insights. It is a blend of technologies and components which aids the
strategic use of data.
It is electronic storage of a large amount of information by a business which is designed for query and
analysis instead of transaction processing. It is a process of transforming data into information and
making it available to users in a timely manner to make a difference.
For example, "sales" can be a particular subject. Integrated: A data warehouseintegrates data from
multiple data sources. For example, source A and source B mayhave different ways of identifying a
product, but in a data warehouse, there will be only a single way of identifying a product.ifference.
A data warehouse is a subject-oriented, integrated, time-variant and non-volatilecollection of data in
support of management's decision making process.SubjectOriented: A data warehouse can be used to
analyze a particular subject area.
integrates data from multiple data sources. For example, source A and source B may have different ways
of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.
The data stored in the warehouse is uploaded from the operational systems (such as marketing or sales).
The data may pass through an operational data store and may require data cleansing[2] for additional
operations to ensure data quality before it is used in the data warehouse for reporting.
Extract, transform, load (ETL) and extract, load, transform (ELT) are the two main approaches used to
build a data warehouse system.
Data warehouse characteristics
There are basic features that define the data in the data warehouse that include subject orientation,
data integration, time-variant, nonvolatile data, and data granularity.
Subject-oriented
Unlike the operational systems, the data in the data warehouse revolves around the subjects of the
enterprise. Subject orientation is not database normalization. Subject orientation can be really useful for
decision-making. Gathering the required objects is called subject-oriented.
Integrated
The data found within the data warehouse is integrated. Since it comes from several operational
systems, all inconsistencies must be removed. Consistencies include naming conventions, measurement
of variables, encoding structures, physical attributes of data, and so forth.
Time-variant
While operational systems reflect current values as they support day-to-day operations, data warehouse
data represents a long time horizon (up to 10 years) which means it stores mostly historical data. It is
mainly meant for data mining and forecasting. (E.g. if a user is searching for a buying pattern of a
specific customer, the user needs to look at data on the current and past purchases.
Nonvolatile
The data in the data warehouse is read-only, which means it cannot be updated, created, or deleted
(unless there is a regulatory or statutory obligation to do so).[30]
Benefits
• A data warehouse maintains a copy of information from the source transaction systems. This
architectural complexity provides the opportunity to:
• Integrate data from multiple sources into a single database and data model. More congregation
of data to single database so a single query engine can be used to present data in an ODS.
• Mitigate the problem of database isolation level lock contention in transaction processing
systems caused by attempts to run large, long-running analysis queries in transaction processing
databases.
• Maintain data history, even if the source transaction systems do not.
• Integrate data from multiple source systems, enabling a central view across the enterprise. This
benefit is always valuable, but particularly so when the organization has grown by merger.
• Improve data quality, by providing consistent codes and descriptions, flagging or even fixing bad
data.
• Present the organization's information consistently.
• Provide a single common data model for all data of interest regardless of the data's source.
• Restructure the data so that it makes sense to the business users.
• Restructure the data so that it delivers excellent query performance, even for complex analytic
queries, without impacting the operational systems.
• Add value to operational business applications, notably customer relationship management
(CRM) systems.
Design methods:
Bottom-up design
• In the bottom-up approach, data marts are first created to provide reporting and analytical
capabilities for specific business processes. These data marts can then be integrated to create a
comprehensive data warehouse.
• The data warehouse bus architecture is primarily an implementation of "the bus", a collection
of conformed dimensions and conformed facts, which are dimensions that are shared (in a
specific way) between facts in two or more data marts.
Top-down design
• The top-down approach is designed using a normalized enterprise data model. "Atomic" data,
that is, data at the greatest level of detail, are stored in the data warehouse.
• Dimensional data marts containing data needed for specific business processes or specific
departments are created from the data warehouse.
Hybrid design
• Data warehouses often resemble the hub and spokes architecture. Legacy systems feeding the
warehouse often include customer relationship management and enterprise resource planning,
generating large amounts of data.
• To consolidate these various data models, and facilitate the extract transform load process,
data warehouses often make use of an operational data store, the information from which is
parsed into the actual data warehouse. To reduce data redundancy, larger systems often store
the data in a normalized way. Data marts for specific reports can then be built on top of the data
warehouse.
• A hybrid (also called ensemble) data warehouse database is kept on third normal form to
eliminate data redundancy.
• A normal relational database, however, is not efficient for business intelligence reports where
dimensional modelling is prevalent. Small data marts can shop for data from the consolidated
warehouse and use the filtered, specific data for the fact tables and dimensions required
• The data warehouse provides a single source of information from which the data marts can
read, providing a wide range of business information. The hybrid architecture allows a data
warehouse to be replaced with a master data management repository where operational (not
static) information could reside.
• The data vault modeling components follow hub and spokes architecture. This modeling style is
a hybrid design, consisting of the best practices from both third normal form and star schema.
The data vault model is not a true third normal form, and breaks some of its rules, but it is a top-
down architecture with a bottom up design.
• The data vault model is geared to be strictly a data warehouse. It is not geared to be end-user
accessible, which, when built, still requires the use of a data mart or star schema-based release
area for business purposes.
• A data mart is a simple form of a data warehouse that is focused on a single subject (or
functional area), hence they draw data from a limited number of sources such as sales, finance
or marketing.
• Data marts are often built and controlled by a single department within an organization.
• The sources could be internal operational systems, a central data warehouse, or external data.
• Denormalization is the norm for data modeling techniques in this system. Given that data marts
generally cover only a subset of the data contained in a data warehouse, they are often easier
and faster to implement.
• The OLAP approach is used to analyze multidimensional data from multiple sources and
perspectives.
• The three basic operations in OLAP are Roll-up (Consolidation), Drill-down, and Slicing & Dicing.
• Online transaction processing (OLTP) is characterized by a large number of short on-line
transactions (INSERT, UPDATE, DELETE).
• OLTP systems emphasize very fast query processing and maintaining data integrity in multi-
access environments.
• OLTP systems, effectiveness is measured by the number of transactions per second. OLTP
databases contain detailed and current data. The schema used to store transactional databases
is the entity model (usually 3NF).
• Predictive analytics is about finding and quantifying hidden patterns in the data using complex
mathematical models that can be used to predict future outcomes.
• Predictive analysis is different from OLAP in that OLAP focuses on historical data analysis and is
reactive in nature, while predictive analysis focuses on the future. These systems are also used
for customer relationship management (CRM).
• Data modeling is the process of creating a visual representation of either a whole information
system or parts of it to communicate connections between data points and structures.
• The goal is to illustrate the types of data used and stored within the system, the relationships
among these data types, the ways the data can be grouped and organized and its formats and
attributes.
• Data models are built around business needs. Rules and requirements are defined upfront
through feedback from business stakeholders so they can be incorporated into the design of a
new system or adapted in the iteration of an existing one.
• Data can be modeled at various levels of abstraction. The process begins by collecting
information about business requirements from stakeholders and end users
• These business rules are then translated into data structures to formulate a concrete database
design. A data model can be compared to a roadmap, an architect’s blueprint or any formal
diagram that facilitates a deeper understanding of what is being designed.
• design process, database and information system design begins at a high level of abstraction
and becomes increasingly more concrete and specific.
• Data models can generally be divided into three categories, which vary according to their
degree of abstraction.
• The process will start with a conceptual model, progress to a logical model and conclude with a
physical model. Each type of data model is discussed in more detail in subsequent sections:
• They are also referred to as domain models and offer a big-picture view of what the system will
contain, how it will be organized, and which business rules are involved.
• Conceptual models are usually created as part of the process of gathering initial project
requirements. Typically, they include entity classes (defining the types of things that are
important for the business to represent in the data model), their characteristics and constraints,
the relationships between them and relevant security and data integrity requirements. Any
notation is typically simple.
Logical data models
• They are less abstract and provide greater detail about the concepts and relationships in the
domain under consideration.
• One of several formal data modeling notation systems is followed. These indicate data
attributes, such as data types and their corresponding lengths, and show the relationships
among entities.
• Logical data models don’t specify any technical system requirements. This stage is frequently
omitted in agile or DevOps practices.
• Logical data models can be useful in highly procedural implementation environments, or for
projects that are data-oriented by nature, such as data warehouse design or reporting system
development.
• They provide a schema for how the data will be physically stored within a database. As such,
they’re the least abstract of all.
• They offer a finalized design that can be implemented as a relational database, including
associative tables that illustrate the relationships among entities as well as the primary keys and
foreign keys that will be used to maintain those relationships.
• Physical data models can include database management system (DBMS)-specific properties,
including performance tuning.
Data modeling process
• As a discipline, data modeling invites stakeholders to evaluate data processing and storage in
painstaking detail.
• Data modeling techniques have different conventions that dictate which symbols are used to
represent the data, how models are laid out, and how business requirements are conveyed.
• All approaches provide formalized workflows that include a sequence of tasks to be performed
in an iterative manner.
The process of data modeling begins with the identification of the things, events or concepts that are
represented in the data set that is to be modeled. Each entity should be cohesive and logically discrete
from all others.
Identify key properties of each entity. Each entity type can be differentiated from all others because it
has one or more unique properties, called attributes. For instance, an entity called “customer” might
possess such attributes as a first name, last name, telephone number and salutation, while an entity
called “address” might include a street name and number, a city, state, country and zip code.
• This will ensure the model reflects how the business will use the data. Several formal data
modeling patterns are in widespread use.
• Object-oriented developers often apply analysis patterns or design patterns, while stakeholders
from other business domains may turn to other patterns.
• Assign keys as needed, and decide on a degree of normalization that balances the need to
reduce redundancy with performance requirements.
• Normalization is a technique for organizing data models (and the databases they represent) in
which numerical identifiers, called keys, are assigned to groups of data to represent
relationships between them without repeating the data.
• For instance, if customers are each assigned a key, that key can be linked to both their address
and their order history without having to repeat this information in the table of customer
names.
• Normalization tends to reduce the amount of storage space a database will require, but it can at
cost to query performance.
• Finalize and validate the data model. Data modeling is an iterative process that should be
repeated and refined as business needs change.
Data modeling has evolved alongside database management systems, with model types increasing in
complexity as businesses' data storage needs have grown.
Hierarchical data models represent one-to-many relationships in a treelike format. In this type of model,
each record has a single root or parent which maps to one or more child tables.
This model was implemented in the IBM Information Management System (IMS), which was introduced
in 1966 and rapidly found widespread use, especially in banking. Though this approach is less efficient
than more recently developed database models, it’s still used in Extensible Markup Language (XML)
systems and geographic information systems (GISs).
Relational data models were initially proposed by IBM researcher E.F. Codd in 1970.
• They are still implemented today in the many different relational databases commonly used in
enterprise computing.
• Relational data modeling doesn’t require a detailed understanding of the physical properties of
the data storage being used. In it, data segments are explicitly joined through the use of tables,
reducing database complexity.
• Relational databases frequently employ structured query language (SQL) for data management.
• These databases work well for maintaining data integrity and minimizing redundancy. They’re
often used in point-of-sale systems, as well as for other types of transaction processing.
• Entity-relationship (ER) data models use formal diagrams to represent the relationships between
entities in a database.
• Several ER modeling tools are used by data architects to create visual maps that convey
database design objectives.
• object-oriented programming and it became popular in the mid-1990s. The “objects” involved
are abstractions of real-world entities. Objects are grouped in class hierarchies, and have
associated features.
• Object-oriented databases can incorporate tables, but can also support more complex data
relationships. This approach is employed in multimedia and hypertext databases as well as other
use cases.
• Dimensional data models were developed by Ralph Kimball, and they were designed to optimize
data retrieval speeds for analytic purposes in a data warehouse.
• While relational and ER models emphasize efficient storage, dimensional models increase
redundancy in order to make it easier to locate information for reporting and retrieval. This
modeling is typically used across OLAP systems.
• Two popular dimensional data models are the star schema, in which data is organized into facts
(measurable items) and dimensions (reference information), where each fact is surrounded by
its associated dimensions in a star-like pattern.
• The other is the snowflake schema, which resembles the star schema but includes additional
layers of associated dimensions, making the branching pattern more complex.
• The star schema is the simplest type of Data Warehouse schema. It is known as star schema as
its structure resembles a star.
• Comparing Snowflake vs Star schema, a Snowflake Schema is an extension of a Star Schema, and
it adds additional dimensions. It is called snowflake because its diagram resembles a Snowflake.
• In a star schema, only single join defines the relationship between the fact table and any
dimension tables.
• Star schema contains a fact table surrounded by dimension tables.
• Snowflake schema is surrounded by dimension table which are in turn surrounded by dimension
table
• A snowflake schema requires many joins to fetch the data.
• Comparing Star vs Snowflake schema, Start schema has simple DB design, while Snowflake
schema has very complex DB design.
• Star Schema in data warehouse, in which the center of the star can have one fact table and a
number of associated dimension tables. It is known as star schema as its structure resembles a
star.
• The Star Schema data model is the simplest type of Data Warehouse schema. It is also known as
Star Join Schema and is optimized for querying large data sets.
In the following Snowflake Schema example, Country is further normalized into an individual
Hierarchies for the dimensions are stored in the Hierarchies are divided into separate tables.
dimensional table.
It contains a fact table surrounded by dimension One fact table surrounded by dimension table
tables. which are in turn surrounded by dimension
table
In a star schema, only single join creates the A snowflake schema requires many joins to
relationship between the fact table and any fetch the data
dimension tables.
Denormalized Data structure and query also run Normalized Data Structure.
faster.
Single Dimension table contains aggregated Data Split into different Dimension Tables.
data.
Offers higher performing queries using Star Join The Snowflake schema is represented by
Query Optimization. centralized fact table which unlikely connected
with multiple dimensions
Tables may be connected with multiple
dimensions
stored stored
• Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It
allows managers, and analysts to get an insight of the information through fast, consistent, and
interactive access to information.
• This chapter cover the types of OLAP, operations on OLAP, difference between OLAP, and
statistical databases and OLTP.
ROLAP servers are placed between relational back-end server and client front-end tools. To store and
manage warehouse data, ROLAP uses relational or extended-relational DBMS.
MOLAP uses array-based multidimensional storage engines for multidimensional views of data. With
multidimensional data stores, the storage utilization may be low if the data set is sparse.
Therefore, many MOLAP server use two levels of data storage representation to handle dense and
sparse data sets.
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of ROLAP and faster
computation of MOLAP. HOLAP servers allows to store the large data volumes of detailed information.
The aggregations are stored separately in MOLAP store.
Specialized SQL servers provide advanced query language and query processing support for SQL queries
over star and snowflake schemas in a read-only environment.
OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP operations in
multidimensional data.
• Roll-up
• Drill-down
• Slice and dice
• Pivot (rotate)
Roll-up :
Roll-up :
• Drill-down is the reverse operation of roll-up. It is performed by either of the following ways −
• By stepping down a concept hierarchy for a dimension
• By introducing a new dimension.
Initially the concept hierarchy was "day < month < quarter < year."
• On drilling down, the time dimension is descended from the level of quarter to the level of
month.
• When drill-down is performed, one or more dimensions from the data cube are added.It
navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a new sub-cube.
Consider the following diagram that shows how slice works.
Here Slice is performed for the dimension "time" using the criterion time = "Q1".It will form a new sub-
cube by selecting one or more dimensions.
Dice
• Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider
the following diagram that shows the dice operation.
• The dice operation on the cube based on the following selection criteria involves three
dimensions.
(location = "Toronto" or "Vancouver")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an
alternative presentation of data. Consider the following diagram that shows the pivot operation.
OLAP vs OLTP
OLAP systems are used by knowledge workers OLTP systems are used by clerks, DBAs, or
such as executives, managers and analysts. database professionals
Establishing a data warehousing system infrastructure that enables you to meet all of your
business intelligence targets is by no means an easy task. With Astera Data Warehouse Builder,
you can cut down the numerous standard and repetitive tasks involved in the data warehousing
lifecycle to just a few simple steps.
In this article, we will examine a use case that describes the process of building a data
warehouse with a step-by-step approach using Astera Data Warehouse Builder.
Use Case
Shop-Stop is a fictitious online retail store that currently maintains its sales data in an SQL
database. The company has recently decided to implement a data warehouse across its
enterprise to improve business intelligence and gain a more solid reporting architecture.
However, their IT team and technical experts have warned them about the substantial amount
of capital and resources needed to execute and maintain the entire process.
As an alternative to the traditional data warehousing approach, Shop-Stop has decided to use
Astera Data Warehouse Builder to design, develop, deploy, and maintain their data warehouse.
Let’s take a look at the process we’d follow to build a data warehouse for them.
The first step in building a data warehouse with Astera Data Warehouse Builder is to identify
and model the source data. But before we can do that, we need to create a data warehousing
project that will contain all of the work items needed as part of the process. To learn how you
can create a data warehousing project and add new items to it, click here.
Once we’ve added a new data model to the project, we’ll reverse engineer Shop-Stop’s sales
database using the Reverse Engineer icon on the data model toolbar.
To learn more about reverse engineering from an existing database, click here.
Here’s what Shop-Stop’s source data model looks like once we’ve reverse engineered it:
Next, we’ll verify the data model to perform a check for errors and warnings. You can verify a
model through the Verify for Read and Write Deployment option in the main toolbar.
For more information on deploying a data model, click here.We’ve successfully created, verified,
and deployed a source data model for Shop-Stop.
The next step in the process is to design a dimensional model that will serve as a destination
schema for Stop-Stop’s data warehouse. You can use the Entity object available in the data
model toolbox, and the data modeler’s drag-and-drop interface to design a model from scratch
However, in Shop-Stop’s case, they’ve already designed a data warehouse schema in an SQL
database. First, we’ll reverse engineer that database. Here’s what the data model looks like:
Note: Each entity in this model represents a table in Shop-Stop’s final data warehouse.
Next, we’ll convert this model into a dimensional model by assigning facts and dimensions. The
type for each entity, when a database is reverse engineered, is set as General by default. You
can conveniently change the type to Fact or Dimension by right-clicking on the entity, hovering
over Entity Type in the context menu, and selecting an appropriate type from the given options.
In this model, the Sale entity in the center is the fact entity and the rest of them are dimension
entities. Here is a look at the model once we’ve defined all of the entity types and converted it
into a dimensional model:
To learn more about converting a data model into a dimensional model, click here.
Once the dimensions and facts are in place, we’ll configure each entity for enhanced
data storage and retrieval by assigning specified roles to the fields present in the layout
of each entity.
For dimension entities, the Dimension Role column in the Layout Builder provides a
comprehensive list of options. These include the following:
Record identifiers (Effective and Expiration dates, Current Record Designator, and
Version Number) to keep track of historical data.
Placeholder Dimension to keep track of late and early arriving facts and dimensions.
As an example, here is the layout of the Employee entity in the dimensional model after
we’ve assigned dimension roles to its fields.
To learn more about fact entities, click here.
Now that the dimensional model is ready, we’ll verify and deploy it for further usage.
In this step, we’ll populate Shop-Stop’s data warehouse by designing ETL pipelines to load
relevant source data into each table. In Astera Data Warehouse Builder, you can create ETL
pipelines in the dataflow designer.
Once you’ve added a new dataflow to the data warehousing project, you can use the extensive
set of objects available in the dataflow toolbox to design an ETL process. The Fact Loader and
Dimension Loader objects can be used to load data into fact and dimension tables, respectively.
Here is the dataflow that we’ve designed to load data into the Customer table in the data
warehouse:
On the left side, we’ve used a Database Table Source object to fetch data from a table present
in the source model. On the right side, we’ve used the Dimension Loader object to load data into
a table present in the destination dimensional model.
You’ll recall that both of the models mentioned above were deployed to the server and made
available for usage. While configuring the objects in this dataflow, we connected each of them
to the relevant model via the Astera Data Model connection in the list of data providers.
The Database Table Source object was configured with the source data model’s deployment.
On the other hand, the Dimension Loader object was configured with the destination
dimensional model’s deployment.
To learn more about the Data Model Query Source object, click here.
Now that all of the dataflows are ready, we’ll execute each of them to populate Shop-Stop’s data
warehouse with their sales data. You can execute or start a dataflow through the Start Dataflow
icon in the main toolbar.
To avoid executing all of the dataflows individually, we’ve designed a workflow to orchestrate
the entire process.
Finally, we’ll automate the process of refreshing this data through the built-in Job Scheduler. To
access the job scheduler, go to Server > Job Schedules in the main menu.
In the Scheduler tab, you can create a new schedule to automate the execution process at a
given frequency.
Step 4: Visualize and Analyze
Shop-Stop’s data warehouse can now be integrated with industry-leading visualization and
analytics tools such as Power BI, Tableau, Domo, etc. through a built-in OData service. The
company can use these tools to effectively analyze their sales data and gain valuable business
insights from it.
External Sources –
External source is a source from where data is collected irrespective of the type of data. Data
can be structured, semi structured and unstructured as well.
Stage Area –
Since the data, extracted from the external sources does not follow a particular format, so there
is a need to validate this data to load into datawarehouse. For this purpose, it is recommended
to use ETL tool.
L(Load): Data is loaded into datawarehouse after transforming it into the standard format.
Data-warehouse –
After cleansing of data, it is stored in the datawarehouse as central repository. It actually stores
the meta data and the actual data gets stored in the data marts. Note that datawarehouse
stores the data in its purest form in this top-down approach.
Data Marts –
Data mart is also a part of storage component. It stores the information of a particular function
of an organisation which is handled by single authority. There can be as many number of data
marts in an organisation depending upon the functions. We can also say that data mart
contains subset of the data stored in datawarehouse.
Data Mining –
The practice of analysing the big data present in datawarehouse is data mining. It is
used to find the hidden patterns that are present in the database or in datawarehouse
with the help of algorithm of data mining.
departments, which may lead to limited user involvement in the design and
implementation process. This can result in data marts that do not meet the specific
needs of business users.
Data latency:
The top-down approach may result in data latency, particularly when data is sourced from
multiple systems. This can impact the accuracy and timeliness of reporting and analysis.
Data ownership:
The top-down approach can create challenges around data ownership and control. Since data
is centralized in the data warehouse, it may not be clear who is responsible for maintaining and
updating the data.
Cost:
The top-down approach can be expensive to implement and maintain, particularly for smaller
organizations that may not have the resources to invest in a large-scale data warehouse and
associated data marts.
The top-down approach may face challenges in integrating data from different sources,
particularly when data is stored in different formats or structures. This can lead to data
inconsistencies and inaccuracies.
Bottom-up approach:
First, the data is extracted from external sources (same as happens in top-down approach).
Then, the data go through the staging area (as explained above) and loaded into data marts
instead of datawarehouse. The data marts are created first and provide reporting capability. It
addresses a single business area.
This approach is given by Kinball as – data marts are created first and provides a thin view for
analyses and datawarehouse is created after complete data marts have been created.
As the data marts are created first, so the reports are quickly generated.
We can accommodate more number of data marts here and in this way datawarehouse can be
extended.
Also, the cost and time taken in designing this model is low comparatively.
Incremental development: The bottom-up approach supports incremental development,
allowing for the creation of data marts one at a time. This allows for quick wins and incremental
improvements in data reporting and analysis.
User involvement: The bottom-up approach encourages user involvement in the design and
implementation process. Business users can provide feedback on the data marts and reports,
helping to ensure that the data marts meet their specific needs.
Flexibility: The bottom-up approach is more flexible than the top-down approach, as it allows for
the creation of data marts based on specific business needs. This approach can be particularly
useful for organizations that require a high degree of flexibility in their reporting and analysis.
Faster time to value: The bottom-up approach can deliver faster time to value, as the data marts
can be created more quickly than a centralized data warehouse. This can be particularly useful
for smaller organizations with limited resources.
Reduced risk: The bottom-up approach reduces the risk of failure, as data marts can be tested
and refined before being incorporated into a larger data warehouse. This approach can also
help to identify and address potential data quality issues early in the process.
Scalability: The bottom-up approach can be scaled up over time, as new data marts can be
added as needed. This approach can be particularly useful for organizations that are growing
rapidly or undergoing significant change.
Data ownership: The bottom-up approach can help to clarify data ownership and control, as
each data mart is typically owned and managed by a specific business unit. This can help to
ensure that data is accurate and up-to-date, and that it is being used in a consistent and
appropriate way across the organization.
Metadata is simply defined as data about data. The data that is used to represent other data is
known as metadata. For example, the index of a book serves as a metadata for the contents in
the book. In other words, we can say that metadata is the summarized data that leads us to
detailed data. In terms of data warehouse, we can define metadata as follows.
Categories of Metadata
Business Metadata − It has the data ownership information, business definition, and changing
policies.
Technical Metadata − It includes database system names, table and column names and sizes,
data types and allowed values. Technical metadata also includes structural information such as
primary and foreign key attributes and indices.
Operational Metadata − It includes currency of data and data lineage. Currency of data means
whether the data is active, archived, or purged. Lineage of data means the history of data
migrated and transformation applied on it
Role of Metadata
Metadata has a very important role in a data warehouse. The role of metadata in a
warehouse is different from the warehouse data, yet it plays an important role. The
various roles of metadata are explained below.
This directory helps the decision support system to locate the contents of the data
warehouse.
Metadata helps in decision support system for mapping of data when data is
transformed from operational environment to data warehouse environment.
Metadata helps in summarization between current detailed data and highly summarized
data.
Metadata also helps in summarization between lightly detailed data and highly
summarized data.
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It has the following
metadata −
Definition of data warehouse − It includes the description of structure of data warehouse. The
description is defined by schema, view, hierarchies, derived data definitions, and data mart
locations and contents.
Business metadata − It contains has the data ownership information, business definition, and
changing policies.
Operational Metadata − It includes currency of data and data lineage. Currency of data means
whether the data is active, archived, or purged. Lineage of data means the history of data
migrated and transformation applied on it.
Data for mapping from operational environment to data warehouse − It includes the source
databases and their contents, data extraction, data partition cleaning, transformation rules, data
refresh and purging rules.
The importance of metadata can not be overstated. Metadata helps in driving the accuracy of
reports, validates data transformation, and ensures the accuracy of calculations. Metadata also
enforces the definition of business terms to business end-users. With all these uses of
metadata, it also has its challenges. Some of the challenges are discussed below.
Metadata in a big organization is scattered across the organization. This metadata is spread in
spreadsheets, databases, and applications.
Metadata could be present in text files or multimedia files. To use this data for information
management solutions, it has to be correctly defined.
There are no industry-wide accepted standards. Data management solution vendors have
narrow focus.
Data marts should be designed as a smaller version of starflake schema within the data
warehouse and should match with the database design of the data warehouse. It helps in
maintaining control over database instances.
The summaries are data marted in the same way as they would have been designed within the
data warehouse. Summary tables help to utilize all dimension data in the starflake schema.
Network Access
Although data marts are created on the same hardware, they require some additional hardware
and software. To handle user queries, it requires additional processing power and disk storage.
If detailed data and the data mart exist within the data warehouse, then we would face
additional cost to store and manage replicated data.
Note − Data marting is more expensive than aggregations, therefore it should be used as an
additional strategy and not as an alternative strategy.
Network Access
A data mart could be on a different location from the data warehouse, so we should ensure that
the LAN or WAN has the capacity to handle the data volumes being transferred within the data
mart load process.
Time Window Constraints
The extent to which a data mart loading process will eat into the available time window depends
on the complexity of the transformations and the data volumes being shipped. The
determination of how many data marts are possible depends on −
Network capacity.
1. INTRODUCTION :
the fact that the laws and methods are well defined,
e-Governance is the mode that has large number of advantages in implementing easy,
transparent, fair and interactive solutions within minimum time frame.
2. E-GOVERNANCE :
‘E’ in e-Government stands for much more than electronic and digital world.
‘E’ indicates:
Efficient – do it the right way with the goal to achieve maximum output with minimum
effort and/or cost.
The advancements in ICT over the years along with Internet provides effective medium to
establish the communication of people with the government hereby playing major role in
achieving good governance goals. The information technology is playing major role in assisting
the government to provide effective governance in terms of time, cost and accessibility.
Data Warehouse has been defined by Inmon as "A data warehouse is a subject-oriented,
integrated, time-variant and non-volatile collection of data in support of management's decision
making process" [11]. Data from large number of homogeneous and/or heterogeneous sources
are being accumulated to form data warehouse. It provides convenient and effective platform
with help of online analytical processing (OLAP) to run queries over consolidated data
Data Mining is analysis tool used to extract knowledge from vast amount of data for effective
decision making. Mathematical and statistical concepts are used to uncover patterns, trends
and relationships among the huge repository of data stored in a data warehouse [3].
e-Governance which need to be taken into consideration. Some technical issues are [24]:
Presentation of meaningful patterns for timely decision making process Large amount of data
is being accumulated by the governments over the years. To use such data for effective
decision-making, a data warehouse need to be constructed over this enormous historical data.
Number of queries that require complex analysis of data can be effectively handled by decision-
makers. It also helps government in making decisions that have huge impact on citizens. The
decision makers are also provided with strategic intelligence to have better view of overall
situation. This significantly assists the government in taking accurate decisions within
minimum time frame without depending on their IT staff.
Data mining approach extracts new and hidden interesting patterns (i.e. knowledge) from this
large volume of data sets.
The e-governance administrators can use this discovered knowledge to improve the quality of
service. The decision involving activity in e-governance is mainly focused on the available funds,
experiences from past and ground report.
The government institutions are now analyzing large amount of current and historical data to
identify new and useful patterns from the large dataset. The area of focus includes:
1) Data warehousing,
3) Data Mining
MINING
Data Mining is the tool to discover previously unknown useful patterns from large
heterogeneous databases. As historical data need to be accumulated from distinct sources to
have better analysis and with prices of storage devices becoming drastically cheaper, the
concept of data warehousing came into existence. If there is no centralized repository of
accurate data, application of data mining tools is almost impossible.
Data warehouse is used for collecting, storing and analyzing the data to assist the decision
making process. Data mining can be applied to any kind of information repository like data
warehouses, different types of database systems, World Wide Web, flat files etc.
Therefore, data warehousing and data mining are best suited for number of applications based
on e-Governance in G2B (Government to Business), G2C (Government to Citizen) and G2G
(Government to Government) environment. In order to have effective implementation there
should be solid Data Warehouse on data collected from heterogeneous reliable sources . The
subcategories of e-government are The various steps involved for implementing
Mining (DWDM) in e-Governance The use of DWDM technologies will assist decision makers to
reach important conclusion that can play important role in any ‘e-Governance’ initiative .The
need of DWDM in e-governance includes:
Provision of integrated data from diverse platforms for better implementation of strategies at
state or national level.
There is no requirement to use complex tools to derive information from vast amount of data.
There is mass reduction of dependence on IT staff.Strong tool towards corruption free India.
The origin of e-governance in India in mid seventies was confined mainly in the area of defense
and to handle queries that involves that large amount of data related to census, elections and
tax administration. The set up of National Informatics Centre (NIC) in 1976 by Government of
India was major boost towards e-governance. The major push towards e-governance was
initiated in 1987 with the launch of NICNET-the national satellite based computer network.
The launch of District information system by NIC to computerize all district offices in India was
another major step towards egovernance. During early nineties, there was significant increase in
the use of IT in applications where government policies start reaching to non urban areas by
having good inputs from number of NGOs and private sector.
The area of e-governance has become very wide now. The government is now implementing e-
governance in every field. e-governance has now spread its wings from urban to rural areas.
There is hardly any field left in which e-governance has not entered. The e-governance is playing
major role in routine transactions like payment of bills and taxes, public grievance system,
municipal services like maintaining records of land and property, issue of birth/death/marriage
certificate, registration and attorneys of properties, traffic management, health services,
disaster management, education sector, crime and criminal tracking system, public distribution
systems and most importantly providing up to date information in agriculture sector. Number of
states have set up their own portals but most of these portals are incapable of providing
complete solution to people by just click of the mouse. In most of cases ministries and
individual departments have separate websites to provide the necessary information. This
should not be the case as the user has to visit multiple websites to get relevant information.
Ideally the official website should act as single window to provide necessary information and
services .
E-governance in India is at infant stage. However, there are limited successful and completed e-
governance projects like e-Seva, CARD, etc. Lack of insight can be attributes as major factor for
failure of e-governance projects in India. Reservation and inflation can be topic of national
debates but e-governance was never the issue in Indian politics. Lately the government of India
has risen to occasion and started pushing the projects related to egovernance.
a) Illiteracy and limited awareness regarding positives of e-governance
c) Lack of electricity and internet facilities especially in rural areas to reap benefits of e-
governance.
f) Absence of qualified pool of resources to manage the system is challenging task. Refusal of
IT professionals to work in rural areas also affects the projects.
g) The role of public in policy making is negligible. If the opinion of people at the grassroots
level is taken into account than majority of problems can be solved.
A large number of national data warehouses can be identified from the existing data resources
within the central government ministries. These are potential subject areas in which data
warehouses may be developed at present and in the future.
Census Data :
Census data can help planners and others understand their communities' social,
economic, and demographic conditions.
Census data is the primary data used by planners to understand the social, economic,
and demographic conditions locally and nationally. We sometimes think the census is
unique to the United States, but census data is collected in other countries.
The total process of collecting, compiling, and publishing demographic, economic, and
social data about a specific time to all people in a country or fixed or defined area part of
a country. Most countries also include a housing census.
Education planning
Assessment/Evaluation of programs