0% found this document useful (0 votes)
48 views166 pages

Data Mining Basics & Techniques

Uploaded by

maha sri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views166 pages

Data Mining Basics & Techniques

Uploaded by

maha sri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 166

DATA MINING AND WAREHOUSING

UNIT - I

BASICS AND TECHNIQUES

What is Data Mining?

Data Mining is defined as extracting information from huge sets of data. In other words, we
can say that data mining is the procedure of mining knowledge from data. The information or
knowledge extracted so can be used for any of the following applications −

 Market Analysis

 Fraud Detection

 Customer Retention

 Production Control

 Science Exploration

Data mining deals with the kind of patterns that can be mined. On the basis of the kind of data
to be mined, there are two categories of functions involved in Data Mining −

 Descriptive

 Classification and Prediction

Descriptive Function

The descriptive function deals with the general properties of data in the database. Here is the
list of descriptive functions −

Class/Concept Description

 Mining of Frequent Patterns

 Mining of Associations

 Mining of Correlations

 Mining of Clusters

Class/Concept Description

Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and printers,
and concepts of customers include big spenders and budget spenders. Such
descriptions of a class or a concept are called class/concept descriptions. These
descriptions can be derived by the following two ways −

1. Data Characterization − This refers to summarizing data of class under study.


This class under study is called as Target Class.

2. Data Discrimination − It refers to the mapping or classification of a class with


some predefined group or class.

What is Knowledge Discovery?

Some people don’t differentiate data mining from knowledge discovery while others view data
mining as an essential step in the process of knowledge discovery. Here is the list of steps
involved in the knowledge discovery process −

 Data Cleaning − In this step, the noise and inconsistent data is removed.

 Data Integration − In this step, multiple data sources are combined.


 Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.

 Data Transformation − In this step, data is transformed or consolidated into forms


appropriate for mining by performing summary or aggregation operations.

 Data Mining − In this step, intelligent methods are applied in order to extract data
patterns.

 Pattern Evaluation − In this step, data patterns are evaluated.

 Knowledge Presentation − In this step, knowledge is represented.

 The following diagram shows the process of knowledge discovery −

 Knowledge Discovery

It refers to the kind of functions to be performed. These functions are −

 Characterization

 Discrimination

 Association and Correlation Analysis

 Classification

 Prediction

 Clustering

 Outlier Analysis

 Evolution Analysis
Background knowledge

The background knowledge allows data to be mined at multiple levels of abstraction. For
example, the Concept hierarchies are one of the background knowledge that allows data to be
mined at multiple levels of abstraction.

DATA MINING ISSUSE:

Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place.

It needs to be integrated from various heterogeneous data sources. These factors also create
some issues. Here in this tutorial, we will discuss the major issues regarding −

 Mining Methodology and User Interaction

 Performance Issues

 Diverse Data Types Issues

The following diagram describes the major issues.


Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −


Mining different kinds of knowledge in databases − Different users may be interested in
different kinds of knowledge.

Therefore it is necessary for data mining to cover a broad range of knowledge discovery task.

 Interactive mining of knowledge at multiple levels of abstraction − The data mining


process needs to be interactive because it allows users to focus the search for
patterns, providing and refining data mining requests based on the returned results.

 Incorporation of background knowledge − To guide discovery process and to express


the discovered patterns, the background knowledge can be used. Background
knowledge may be used to express the discovered patterns not only in concise terms
but at multiple levels of abstraction.

 Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse quere language and optimized for efficient and flexible data mining.

 Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.

 Handling noisy or incomplete data − The data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the discovered patterns will be
poor.

 Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.

DATA MINING METRICS:

Data mining is one of the forms of artificial intelligence that uses perception models,
analytical models, and multiple algorithms to simulate the techniques of the human brain.
Data mining supports machines to take human decisions and create human choices.

The user of the data mining tools will have to direct the machine rules, preferences, and even
experiences to have decision support data mining metrics are as follows −

 Usefulness − Usefulness involves several metrics that tell us whether the model
provides useful data. For instance, a data mining model that correlates save the
location with sales can be both accurate and reliable, but cannot be useful, because it
cannot generalize that result by inserting more stores at the same location.
Furthermore, it does not answer the fundamental business question of why specific locations
have more sales. It can also find that a model that appears successful is meaningless
because it depends on cross-correlations in the data.

 Return on Investment (ROI) − Data mining tools will find interesting patterns buried
inside the data and develop predictive models. These models will have several
measures for denoting how well they fit the records. It is not clear how to create a
decision based on some of the measures reported as an element of data mining
analyses.

Access Financial Information during Data Mining − The simplest way to frame decisions in
financial terms is to augment the raw information that is generally mined to also contain
financial data. Some organizations are investing and developing data warehouses, and data
marts.

The design of a warehouse or mart contains considerations about the types of analyses and
data needed for expected queries. It is designing warehouses in a way that allows access to
financial information along with access to more typical data on product attributes, user
profiles, etc. can be useful.

Converting Data Mining Metrics into Financial Terms − A general data mining metric is the
measure of "Lift". Lift is a measure of what is achieved by using the specific model or pattern
relative to a base rate in which the model is not used. High values mean much is achieved. It
can seem then that one can simply create a decision based on Lift.

 Accuracy − Accuracy is a measure of how well the model correlates results with the
attributes in the data that has been supported. There are several measures of accuracy,
but all measures of accuracy are dependent on the information that is used. In reality,
values can be missing or approximate, or the data can have been changed by several
processes.

It is the procedure of exploration and development, it can decide to accept a specific amount
of error in the data, especially if the data is fairly uniform in its characteristics. For example, a
model that predicts sales for a specific store based on past sales can be powerfully correlated
and very accurate, even if that store consistently used the wrong accounting techniques. Thus,
measurements of accuracy should be balanced by assessments of reliability.

SOCIAL IMPLICATIONS OF DATA MINING:

There are various social implications of data mining which are as follows −

 Privacy − It is a loaded issue. In current years privacy concerns have taken on a more
important role in American society as merchants, insurance companies, and
government agencies amass warehouses including personal records.
 The concerns that people have over the group of this data will generally extend to
some analytic capabilities used to the data. Users of data mining should start thinking
about how their use of this technology will be impacted by legal problems associated
with privacy.

 Profiling − Data Mining and profiling is a developing field that attempts to organize,
understand, analyze, reason, and use the explosion of data in this information age. The
process contains using algorithms and experience to extract design or anomalies that
are very complex, difficult, or time-consuming to recognize.

 The founder of Microsoft's Exploration Team used complex data mining algorithms to
solve an issue that had haunted astronomers for some years. The problem of
reviewing, describing, and categorizing 2 billion sky objects recorded over 3 decades.
The algorithm extracted the relevant design to allocate the sky objects like stars or
galaxies. The algorithms were able to extract the feature that represented sky objects
as stars or galaxies. This developing field of data mining and profiling has several
frontiers where it can be used.

 Unauthorized Used − Trends obtain through data mining designed to be used for
marketing goals or some other ethical goals, can be misused. Unethical businesses or
people can use the data obtained through data mining to take benefit of vulnerable
people or discriminate against a specific group of people. Furthermore, the data
mining technique is not 100 percent accurate; thus mistakes do appear which can have
serious results.

What is data mining from a database perspective?

Data mining, rhich is also referred to as knowledge discovery in databases, means a process
of nontrivial extraction of implicit, previously unknorn and potentially useful information (such
as knorledge rules, constraints, regularities) from data in databases .

Data mining is used to explore increasingly large databases and to improve market
segmentation. By analysing the relationships between parameters such as customer age,
gender, tastes, etc., it is possible to guess their behaviour in order to direct personalised
loyalty campaigns.

The data mining process is usually broken into the following steps.

Step 1: Understand the Business. ...

Step 2: Understand the Data. ...

Step 3: Prepare the Data. ...

Step 4: Build the Model. ...

Step 5: Evaluate the Results. ...

Step 6: Implement Change and Monitor

Supermarkets, for example, use joint purchasing patterns to identify product associations and
decide how to place them in the aisles and on the shelves. Data mining also detects which
offers are most valued by customers or increase sales at the checkout queue. Banking.

What is the role of data mining in database management?

Data mining is the process of finding anomalies, patterns and correlations within large data
sets to predict outcomes. Using a broad range of techniques, you can use this information to
increase revenues, cut costs, improve customer relationships, reduce risks and more.

DATA MINING

What is Data Mining?

 Data mining is the process of analyzing hidden patterns of data according to different

perspectives for categorization into useful information, which is collected and

assembled in common areas, such as data warehouses, for efficient analysis, data

mining algorithms, facilitating business decision making and other information

requirements to ultimately cut costs and increase revenue.

 Data mining is also known as data discovery and knowledge discovery.

Data mining refers to extracting or mining knowledge from large amountsof data. The

term is actually a misnomer.


 Thus, data miningshould have been more appropriately

named as knowledge mining which emphasis on mining from large amounts of data.

It is the computational process of discovering patterns in large data sets involving

methods at the intersection of artificial intelligence, machine learning, statistics, and

database systems.

 The overall goal of the data mining process is to extract information

from a data set and transform it into an understandable structure for further use.

The key properties of data mining are Automatic discovery of patterns Prediction of likely

outcomes Creation of actionable information Focus on large datasets and databases.

Data mining definition:

Data mining is the process of sorting through large data sets to identify patterns and

establish relationships to solve problems through data analysis. Data mining tools allow

enterprises to predict future trends.

Data Mining Techniques:

Data mining uses algorithms and various other techniques to convert large collections of data
into useful output. The most popular types of data mining techniques include: Association
rules, also referred to as market basket analysis, search for relationships between variables.

DM Techniques

1.Classification:

This analysis is used to retrieve important and relevant information about data, and

metadata. This data mining method helps to classify data in different classes.

2. Clustering:

Clustering analysis is a data mining technique to identify data that are like each other.

This process helps to understand the differences and similarities between the data.
3. Regression:

Regression analysis is the data mining method of identifying and analyzing the

relationship between variables. It is used to identify the likelihood of a specific variable,

given the presence of other variables.

4. Association Rules:

This data mining technique helps to find the association between two or more Items. It

discovers a hidden pattern in the data set.

5. Outer detection:

This type of data mining technique refers to observation of data items in the dataset

which do not match an expected pattern or expected behavior. This technique can be

used in a variety of domains, such as intrusion, detection, fraud or fault detection, etc.

Outer detection is also called Outlier Analysis or Outlier mining.

6. Sequential Patterns:

This data mining technique helps to discover or identify similar patterns or trends in

transaction data for certain period.

7. Prediction:

Prediction has used a combination of the other data mining techniques like trends,

sequential patterns, clustering, classification, etc. It analyzes past events or instances in

a right sequence for predicting a future event.

Data Mining Application

 Communications - Data mining techniques are used in communication sector to

predict customer behavior to offer highly targetted and relevant campaigns.

 Insurance - Data mining helps insurance companies to price their products profitable

and promote new offers to their new or existing customers.

 Education - Data mining benefits educators to access student data, predict


achievement levels and find students or groups of students which need extra attention.

For example, students who are weak in maths subject.

 Manufacturing - With the help of Data Mining Manufacturers can predict wear and tear

of production assets. They can anticipate maintenance which helps them reduce them

to minimize downtime.

 Banking - Data mining helps finance sector to get a view of market risks and manage

regulatory compliance. It helps banks to identify probable defaulters to decide whether

to issue credit cards, loans, etc.

 Retail - Data Mining techniques help retail malls and grocery stores identify and
arrange

most sellable items in the most attentive positions. It helps store owners to comes up

with the offer which encourages customers to increase their spending.

 Service Providers - Service providers like mobile phone and utility industries use Data

Mining to predict the reasons when a customer leaves their company. They analyze

billing details, customer service interactions, complaints made to the company to assign

each customer a probability score and offers incentives.

 E-Commerce - E-commerce websites use Data Mining to offer cross-sells and up-sells

through their websites. One of the most famous names is Amazon, who use Data

mining techniques to get more customers into their eCommerce store.

 Super Markets - Data Mining allows supermarket's develope rules to predict if their

shoppers were likely to be expecting. By evaluating their buying pattern, they could find

woman customers who are most likely pregnant. They can start targeting products like

baby powder, baby shop, diapers and so on.

Techniques of statistical data mining:

There are various techniques of statistical data mining which are as follows −
 Regression − These approaches are used to forecast the value of a response
(dependent) variable from one or more predictor (independent) variables where the
variables are numeric.

 There are several forms of regression, including linear, multiple, weighted, polynomial,
nonparametric, and robust (robust techniques are beneficial when errors fail to satisfy
normalcy conditions or when the data includes significant outliers).

 Generalized linear models − These models, and their generalization (generalized


additive models), enable a categorical response variable (or some transformation of it)
to be associated with a set of predictor variables like the modeling of a numeric
response variable using linear regression. Generalized linear models contain logistic
regression and Poisson regression.

 Analysis of variance − These methods analyze experimental data for two or more
populations defined by a numeric response variable and one or more categorical
variables (factors). In general, an ANOVA (single-factor analysis of variance) problem
contains a comparison of k population or treatment defines to decide if at least two of
the means are different.

 Mixed-effect models − These models are for analyzing grouped data—data that can be
categorized as per one or more grouping variables. They generally define relationships
between a response variable and some covariates in data combined as per one or
more factors. Typical areas of application such as multilevel data, repeated measures
data, block designs, and longitudinal data.

 Factor analysis − This method can determine which variables are merged to make a
given factor. For instance, for some psychiatric data, it is not feasible to measure a
specific factor of interest directly (including intelligence); however, it is applicable to
measure other quantities (including student test scores) that reflect the element of
interest. Here, none of the variables are designated as dependent.

 Discriminant analysis − This method can predict a categorical response variable.


Unlike generalized linear models, it implies that the independent variables follow a
multivariate normal distribution.

The process tries to determine some discriminant functions (linear set of the independent
variables) that discriminate between the groups represented by the response variable.
Discriminant analysis is generally used in social sciences.

 Time series analysis − There are some statistical techniques for analyzing time-series
data, including auto-regression methods, univariate ARIMA (autoregressive integrated
moving average) modeling, and long-memory time-sequence modeling.

 Survival analysis − Several well-established statistical methods exist for survival


analysis. These methods initially were designed to forecast the probability that a
patient undergoing medical treatment can survive at least to time t.

 Quality control − Several statistics can be used to prepare charts for quality control,
including Shewhart charts and CUSUM charts (both of which display group summary
statistics). These statistics contain the mean, standard deviation, range, count, moving
average, moving standard deviation, and moving range.

recent data mining projects, various major data mining techniques have been developed and
used, including association, classification, clustering, prediction, sequential patterns, and
regression.

Measures of Distance in Data Mining

 Clustering consists of grouping certain objects that are similar to each other, it can be
used to decide if two items are similar or dissimilar in their properties.

In a Data Mining sense, the similarity measure is a distance with dimensions describing object
features.

That means if the distance among two data points is small then there is a high degree of
similarity among the objects and vice versa.

The similarity is subjective and depends heavily on the context and application. For example,
similarity among vegetables can be determined from their taste, size, colour etc.

Most clustering approaches use distance measures to assess the similarities or differences
between a pair of objects, the most popular distance measures used are:
1. Euclidean Distance:

Euclidean distance is considered the traditional metric for problems with geometry. It can be
simply explained as the ordinary distance between two points.

It is one of the most used algorithms in the cluster analysis. One of the algorithms that use
this formula would be K-mean.

Mathematically it computes the root of squared differences between the coordinates


between two objects.

2. Manhattan Distance:

This determines the absolute difference among the pair of the coordinates.

Suppose we have two points P and Q to determine the distance between these points we
simply have to calculate the perpendicular distance of the points from X-Axis and Y-Axis.

In a plane with P at coordinate (x1, y1) and Q at (x2, y2).

Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|


Here the total distance of the Red line gives the Manhattan distance between both the points.

3. Jaccard Index:

The Jaccard distance measures the similarity of the two data set items as the intersection of
those items divided by the union of the data items.
4. Minkowski distance:

It is the generalized form of the Euclidean and Manhattan Distance Measure. In an N-


dimensional space, a point is represented as,
DECISION TREES

Decision tree is the most powerful and popular tool for classification and prediction. A

Decision tree is a flowchart like tree structure, where each internal node denotes a test

on an attribute, each branch represents an outcome of the test, and each leaf node

(terminal node) holds a class label.

Introduction:

Decision tree algorithm falls under the category of supervised learning. They canbe used to
solve both regression and classification problems.

Decision tree uses the tree representation to solve the problem in which each
leaf node corresponds to a class label and attributes are represented on the

internal node of the tree.

We can represent any boolean function on discrete attributes using the decision

tree.

Decision Tree Construction Princples:

 Nodes – There are several types of nodes in decision trees. The root node is the parent
of all nodes, which represents the overriding message. Chance nodes tell you the
probability of a certain outcome, whereas decision nodes determine the decisions you
should make.

 Branches – Branches connect nodes. Like rivers flowing between two cities, they show
your data flow from questions to answers.

 Leaves – Leaves are also known as end nodes. These elements indicate the outcome
of your algorithm. No more nodes can spring out of these nodes. They are the
cornerstone of effective decision-making.
Definition of Decision Tree

If you’re a college student, you learn in two ways –

1. supervised

2. unsupervised.

The same division can be found in algorithms, and the decision tree belongs to the former
category. It’s a supervised algorithm you can use to regress or classify data. It relies on
training data to predict values or outcomes.

Components of Decision Tree

What’s the first thing you notice when you look at a tree? If you’re like most people, it’s
probably the leaves and branches.

The decision tree algorithm has the same elements. Add nodes to the equation, and you have
the entire structure of this algorithm right in front of you.

Nodes – There are several types of nodes in decision trees. The root node is the parent of all
nodes, which represents the overriding message. Chance nodes tell you the probability of a
certain outcome, whereas decision nodes determine the decisions you should make.

Branches – Branches connect nodes. Like rivers flowing between two cities, they show your
data flow from questions to answers.
Leaves – Leaves are also known as end nodes. These elements indicate the outcome of your
algorithm. No more nodes can spring out of these nodes. They are the cornerstone of
effective decision-making.

Types of Decision Trees

When you go to a park, you may notice various tree species: birch, pine, oak, and acacia. By
the same token, there are multiple types of decision tree algorithms:

1. Classification Trees – These decision trees map observations about particular data by
classifying them into smaller groups. The chunks allow machine learning specialists to
predict certain values.

2. Regression Trees – According to IBM, regression decision trees can help anticipate
events by looking at input variables.

Decision Tree Algorithm in Data Mining

Knowing the definition, types, and components of decision trees is useful, but it doesn’t give
you a complete picture of this concept. So, buckle your seatbelt and get ready for an in-depth
overview of this algorithm.

Overview of Decision Tree Algorithms

Just as there are hierarchies in your family or business, there are hierarchies in any decision
tree in data mining. Top-down arrangements start with a problem you need to solve and break
it down into smaller chunks until you reach a solution. Bottom-up alternatives sort of wing it –
they enable data to flow with some supervision and guide the user to results.

Popular Decision Tree Algorithms

ID3 (Iterative Dichotomiser 3) – Developed by Ross Quinlan, the ID3 is a versatile algorithm
that can solve a multitude of issues. It’s a greedy algorithm (yes, it’s OK to be greedy
sometimes), meaning it selects attributes that maximize information output.

5 – This is another algorithm created by Ross Quinlan. It generates outcomes according to


previously provided data samples. The best thing about this algorithm is that it works great
with incomplete information.

CART (Classification and Regression Trees) – This algorithm drills down on predictions. It
describes how you can predict target values based on other, related information.

CHAID (Chi-squared Automatic Interaction Detection) – If you want to check out how your
variables interact with one another, you can use this algorithm. CHAID determines how
variables mingle and explain particular outcomes.

Key Concepts in Decision Tree Algorithms

No discussion about decision tree algorithms is complete without looking at the most
significant concept from this area:

 Entropy

As previously mentioned, decision trees are like trees in many ways. Conventional trees
branch out in random directions. Decision trees share this randomness, which is where
entropy comes in.

Entropy tells you the degree of randomness (or surprise) of the information in your decision
tree.

 Information Gain

A decision tree isn’t the same before and after splitting a root node into other nodes. You can
use information gain to determine how much it’s changed. This metric indicates how much
your data has improved since your last split. It tells you what to do next to make better
decisions.

 Gini Index

Mistakes can happen, even in the most carefully designed decision tree algorithms. However,
you might be able to prevent errors if you calculate their probability.

Enter the Gini index (Gini impurity). It establishes the likelihood of misclassifying an instance
when choosing it randomly.

 Pruning

You don’t need every branch on your apple or pear tree to get a great yield. Likewise, not all
data is necessary for a decision tree algorithm.

Pruning is a compression technique that allows you to get rid of this redundant information
that keeps you from classifying useful data.

 Building a Decision Tree in Data Mining

Growing a tree is straightforward – you plant a seed and water it until it is fully formed.
Creating a decision tree is simpler than some other algorithms, but quite a few steps are
involved nevertheless.
Neural Network:

Neural Network is an information processing paradigm that is inspired by the human nervous
system.

As in the Human Nervous system, we have Biological neurons in the same way in Neural
networks we have Artificial Neurons which is a Mathematical Function that originates from
biological neurons.

The human brain is estimated to have around 10 billion neurons each connected on average to
10,000 other neurons.

Each neuron receives signals through synapses that control the effects of the signal on the
neuron.
Artificial Neuron:

Pitts model is considered to be the first neural network and the Hebbian learning rule is one of
the earliest and simplest learning rules for the neural network. The neural network model can
be broadly divided into the following three types:

1. Feed-Forward Neural Networks: In Feed-Forward Network, if the output values cannot


be traced back to the input values and if for every input node, an output node is
calculated, then there is a forward flow of information and no feedback between the
layers. In simple words, the information moves in only one direction (forward) from the
input nodes, through the hidden nodes (if any), and to the output nodes. Such a type of
network is known as a feedforward network.

2. Feedback Neural Network: Signals can travel in both directions in a feedback network.
Feedback neural networks are very powerful a

nd can become very complex. feedback networks are dynamic. The “states” in such a
network are constantly changing until an equilibrium point is reached.

They stay at equilibrium until the input changes and a new equilibrium needs to be
found. Feedback neural network architectures are also known as interactive or
recurrent. Feedback loops are allowed in such networks. They are used for content
addressable memory.

 Oganization Neural Network: Self Organizing Neural Network (SONN) is a type of


artificial neural network but is trained using competitive learning rather than error-
correction learning (e.g., backpropagation with gradient descent) used by other
artificial neural networks.

 A Self Organizing Neural Network (SONN) is an unsupervised learning model in


Artificial Neural Network termed as Self-Organizing Feature Maps or Kohonen Maps. It
is used to produce a low-dimensional (typically two-dimensional) representation of a
higher-dimensional data set while preserving the topological structure of the data.

What is a Genetic Algorithm?

Genetic algorithms (GA) are adaptive search algorithms- adaptive in terms of the number of
parameters you provide or the types of parameters you provide.

The algorithms classify the best optimal solution among the several solutions, and its design
is based on the natural genetic solution.

Genetic algorithm emulates the principles of natural evolution, i.e. survival of the fittest.
Natural evolution propagates the genetic material in the fittest individuals from one
generation to the next.

The genetic algorithm applies the same technique in data mining – it iteratively performs the
selection, crossover, mutation, and encoding process to evolve the successive generation of
models.

The components of genetic algorithms consist of:

 Population incorporating individuals.

 Encoding or decoding mechanism of individuals.

 The objective function and an associated fitness evaluation criterion.

 Selection procedure.

 Genetic operators like recombination or crossover, mutation.

 Probabilities to perform genetic operations.

 Replacement technique.

 Termination combination.

At every iteration, the algorithm delivers a model that inherits its traits from the previous
model and competes with the other models until the most predictive model survive.

Foundation of Genetic Algorithms:

Genetic algorithms are based on an analogy with genetic structure and behaviour of
chromosomes of the population. Following is the foundation of GAs based on this analogy –

Individual in population compete for resources and mate

Those individuals who are successful (fittest) then mate to create more offspring than others.

Genes from “fittest” parent propagate throughout the generation, that is sometimes parents
create offspring which is better than either parent.

Thus each successive generation is more suited for their environment.

 Search space:

The population of individuals are maintained within search space. Each individual represents a
solution in search space for given problem. Each individual is coded as a finite length vector
(analogous to chromosome) of components. These variable components are analogous to
Genes. Thus a chromosome (individual) is composed of several genes (variable components).

 Fitness Score

A Fitness Score is given to each individual which shows the ability of an individual to
“compete”. The individual having optimal fitness score (or near optimal) are sought.

The GAs maintains the population of n individuals (chromosome/solutions) along with their
fitness scores.The individuals having better fitness scores are given more chance to
reproduce than others.
The individuals with better fitness scores are selected who mate and produce better offspring
by combining chromosomes of parents.

The population size is static so the room has to be created for new arrivals. So, some
individuals die and get replaced by new arrivals eventually creating new generation when all
the mating opportunity of the old population is exhausted.

It is hoped that over successive generations better solutions will arrive while least fit die.

replaced by new arrivals eventually creating new generation when all the mating opportunity of
the old population is exhausted.

The algorithm is said to be converged to a set of solutions for the problem.

Operators of Genetic Algorithms

Once the initial generation is created, the algorithm evolves the generation using following
operators –
1) Initial population

Being the first phase of the algorithm, it includes a set of individuals where each individual is a
solution to the concerned problem. We characterize each individual by the set of parameters
that we refer to as genes.

2)Calculate Fitness

A fitness function is implemented to compute the fitness of each individual in the population.
The function provides a fitness score to each individual in the population. The fitness score is
the probability of the individual selection in the reproduction process.

3) Selection Operator: The idea is to give preference to the individuals with good fitness
scores and allow them to pass their genes to successive generations.

4) Crossover Operator: This represents mating between individuals. Two individuals are
selected using selection operator and crossover sites are chosen randomly. Then the genes at
these crossover sites are exchanged thus creating a completely new individual (offspring). For
example –

5) Mutation Operator: The key idea is to insert random genes in offspring to maintain the
diversity in the population to avoid premature convergence. For example –

The whole algorithm can be summarized as –


1) Randomly initialize populations p

2) Determine fitness of population

3) Until convergence repeat:

a) Select parents from population

b) Crossover and generate new population

c) Perform mutation on new population

d) Calculate fitness for new population

Example problem and solution using Genetic Algorithms

Genetic Algorithms in Data Mining

So far, we have studied that the genetic algorithm is a classification method that is adaptive,
robust and used globally in situations where the area of classification is large. The algorithms
optimize a fitness function based on the criteria preferred by data mining so as to obtain an
optimal solution, for example:
1. Knowledge discovery system

2. MASSON system

However, the data mining application based on the genetic algorithm is not as rich as the
application based on fuzzy sets. In the section ahead, we have categorized some systems
based on the genetic algorithm used in data mining.

Regression

Data mining identifies human interpretable patterns; it includes a prediction that determines a
future value from the available variable or attributes in the database. The basic assumption of
the linear multi regression model is that there is no interaction among the attributes.

GA handles the interaction among the attributes in a far better way. The non-linear multi
regression model uses GA to get single out from the training data set.

Association Rules

Multiple objective GA deals with the problems with multiple objective functions and
constraints, to determine the optimal set of solutions. None of the solutions from this set
must exist in the search space that can dominate any member of this set.

Such algorithms are used for rule mining with a large search space with many attributes and
records. To obtain the optimal solutions, multi-objective GA performs the global search with
multiple objectives. Such as a combination of factors like predictive accuracy,
comprehensibility and interestingness.

Advantages and Disadvantages

Advantages

 Easy to understand as it is based on the concept of natural evolution.

 Classifies an optimal solution from a set of solutions.

 GA uses the pay off information instead of the derivative to yield an optimal solution.

 GA backs multi-objective optimization.

 GA is an adaptive search algorithm.

 GA also operates in a noisy environment.


Disadvantages

 An improper implementation may lead to a solution that is not optimal.

 Implementing fitness function iteratively may lead to computational challenges.

 GA is time-consuming as it deals with a lot of computation.

MCQ QUESTIONS :

1.Which of the listed below helps to identify abstracted patterns in unlabeled data?

Hybrid learning

Unsupervised learning

Supervised learning

Reinforcement learning

Answer: 2. Unsupervised learning

2. Which of the listed below helps to infer a model from labeled data?

Hybrid learning

Unsupervised learning

Supervised learning

Reinforcement learning

Answer: 3. supervised learning

3.Among which of the following can query the unstructured textual data?

Information retrieval

Information access

Information manipulation

Information update

Answer: 1. information retrieval

4.Which of the following process is not involved in the data mining process?
Data exploration

Data transformation

Data archaeology

Knowledge extraction

Answer: 3. Data transformation

5. Which of the following is taken into account before diving in the data mining process?

Vendor consideration

Functionalibility

Compatibility

All of the above

Answer: 4. all of the above

6. What is the full form of OLTP?

Online transaction processing

Offline transaction processing

Online traffic processing

None of the above

Answer: 1. Online transaction processing

7.Which of the following process uses intelligent methods to extract data patterns?

Data mining

Text mining

Warehousing

Data selection

Answer: 1. data mining

8. What is the full form of KDD in the data mining process?


Knowledge data house

Knowledge data definition

Knowledge discovery data

Knowledge discovery database

Answer: 4. Knowledge discovery database

9. What are the chief functions of the data mining process?

Prediction and characterization

Cluster analysis and evolution analysis

Association and correction analysis classification

All of the above

Answer: 4. all of the above

10. Where is data warehousing used?

Logical system

Transaction system

Decision support system

None of the above

Answer: 3. decision support system

11.Among which of the following can be used by the warehouse?

Database table

Online database

Flat files

All of the above

Answer: 4. all of the above

12.Which of the following statement is true regarding classification?


It is a measure of accuracy.

It is a subdivision of a set.

It is the task of assigning a classification.

None of the above.

Answer: 2. It is a subdivision of a set

13. Which is the correct option regarding data mining?

It can be referred to as the mining of knowledge from data

It can be defined as the process of extracting information from a large collection of data

The process of data mining involves several other processes like data cleaning, data
transformation, and data integration.

All of the above

Answer: 4. all of the above

14. Which is the correct process of data mining?

Infrastructure, exploration, analysis, interpretation, and exploitation

Exploration, Infrastructure, analysis, interpretation, and exploitation

Exploration, Infrastructure, interpretation, analysis, and exploitation

Exploration, Infrastructure, analysis, exploitation, and analysis

Answer: 1. Infrastructure, exploration, analysis, interpretation, exploitation

15. Which statement is incorrect regarding data cleaning?

It refers to correcting the inconsistent data.

It refers to the process of data cleaning.

It refers to the conversion of the wrong data to the right data.

All of the above

Answer: 4. all of the above

16. Which is the right advantage regarding the Update-Driven approach?


Update-Driven approach enables high performance

The data in Update-Driven approach can be copied, integrated, summarized, and restructured
in the semantic data store in advance.

Both a and b

None of the above

Answer: 3. both a and b

17. Which statement is correct regarding query tools?

It is used to query the databases

Attributes to a database can only take numerical values.

Both a and b

None of the above

Answer: 1. it is used to query the database

18.Which statement given below closely defines the term cluster?

These are the group of the same objects that differ majorly from the other objects.

Symbolic representation of facts and ideas from which information can be extracted using the
data mining process

It is simply an operation performed on databases to simplify the information so that it can be


further transformed into a machine learning algorithm.

All of the above

Answer: 1. These are the group of the same objects that differ majorly from the other objects.

19. Which statement given below closely defines the term data selection?

It is a knowledge discovery process of the actual discovery phase

The selection of correct data for the process of Knowledge Discovery Database

A subject orient integrated data in support of management.

All of the above

Answer: 2. The selection of correct data for the process of the Knowledge Discovery Database
20. Which statement given below closely defines the term discovery?

It is hidden in a database and needs to be found out by the certain clues given (for example: IS
encrypted)

An extremely complex molecule that occurs in the human chromosomes and that carries
genetic information in the form of genes.

It a kind of process of carrying out implicit, previously unknown, and potentially useful
information from the data.

None of the data

Answer: 3. It a kind of process of carrying out implicit, previously unknown, and potentially
useful information from the data.
UNIT-II
ALGORITHMS

What Is Classification?
• Classification is the process of recognizing, understanding, and grouping ideas
and objects into preset categories or “sub-populations.”
• Using pre-categorized training datasets, machine learning programs use a
variety of algorithms to classify future datasets into categories.
• Classification algorithms in machine learning use input training data to predict the
likelihood that subsequent data will fall into one of the predetermined categories.
• One of the most common uses of classification is filtering emails into “spam” or
“non-spam.”In short, classification is a form of “pattern recognition,” with
classification algorithms applied to the training data to find the same pattern
(similar words or sentiments, number sequences, etc.) in future sets of data.
• Using classification algorithms, which we’ll go into more detail about below, text
analysis software can perform tasks like aspect-based sentiment analysis to
categorize unstructured text by topic and polarity of opinion (positive, negative,
neutral, and beyond).
• Try out this pre-trained sentiment classifier to understand how classification
algorithms work in practice, then read on to learn more about different types of
classification algorithms.
Data Classification process includes two steps −

• Building the Classifier or Model


• Using Classifier for Classification
Building the Classifier or Model
• This step is the learning step or the learning phase.
• In this step the classification algorithms build the classifier.
• The classifier is built from the training set made up of database
tuples and their associated class labels.
• Each tuple that constitutes the training set is referred to as a
category or class. These tuples can also be referred to as
sample, object or data points.

Using Classifier for Classification

• In this step, the classifier is used for classification. Here the


test data is used to estimate the accuracy of classification
rules.
• The classification rules can be applied to the new data tuples if
the accuracy is considered acceptable.
Classification and Prediction Issues
The major issue is preparing the data for Classification and
Prediction. Preparing the data involves the following activities −

• Data Cleaning − Data cleaning involves removing the noise and


treatment of missing values. The noise is removed by applying
smoothing techniques and the problem of missing values is
solved by replacing a missing value with most commonly
occurring value for that attribute.
• Relevance Analysis − Database may also have the irrelevant
attributes. Correlation analysis is used to know whether any
two given attributes are related.
• Data Transformation and reduction − The data can be transformed
by any of the following methods.
• Normalization − The data is transformed using
normalization. Normalization involves scaling all values
for given attribute in order to make them fall within a
small specified range. Normalization is used when in the
learning step, the neural networks or the methods
involving measurements are used.
• Generalization − The data can also be transformed by
generalizing it to the higher concept. For this purpose we
can use the concept hierarchies.
Comparison of Classification and Prediction Methods
Here is the criteria for comparing the methods of Classification and
Prediction −

• Accuracy − Accuracy of classifier refers to the ability of


classifier. It predict the class label correctly and the accuracy
of the predictor refers to how well a given predictor can guess
the value of predicted attribute for a new data.
• Speed − This refers to the computational cost in generating and
using the classifier or predictor.
• Robustness − It refers to the ability of classifier or predictor to
make correct predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the
classifier or predictor efficiently; given large amount of data.
• Interpretability − It refers to what extent the classifier or
predictor understands.

Statistical Analysis:
• Data mining refers to extracting or mining knowledge from large amounts of data. In other
words, data mining is the science, art, and technology of discovering large and complex bodies
of data in order to discover useful patterns.
• Theoreticians and practitioners are continually seeking improved techniques to make the
process more efficient, cost-effective, and accurate. Any situation can be analyzed in two ways
in data mining:

Statistical Analysis: In statistics, data is collected, analyzed, explored, and presented to identify patterns
and trends. Alternatively, it is referred to as quantitative analysis.

Non-statistical Analysis: This analysis provides generalized information and includes sound, still images,
and moving images.

In statistics, there are two main categories:

Descriptive Statistics: The purpose of descriptive statistics is to organize data and identify the main
characteristics of that data. Graphs or numbers summarize the data. Average, Mode, SD(Standard
Deviation), and Correlation are some of the commonly used descriptive statistical methods.
Inferential Statistics: The process of drawing conclusions based on probability theory and generalizing
the data. By analyzing sample statistics, you can infer parameters about populations and make models
of relationships within data.

• In statistics, there are two main categories:


1. Descriptive Statistics: The purpose of descriptive statistics is to organize data and identify the
main characteristics of that data. Graphs or numbers summarize the data. Average, Mode,
SD(Standard Deviation), and Correlation are some of the commonly used descriptive statistical
methods.
2. Inferential Statistics: The process of drawing conclusions based on probability theory and
generalizing the data. By analyzing sample statistics, you can infer parameters about
populations and make models of relationships within data.

There are various statistical terms that one should be aware of while dealing with statistics. Some of
these are:

• Population
• Sample
• Variable
• Quantitative Variable
• Qualitative Variable
• Discrete Variable
• Continuous Variable

This is the analysis of raw data using mathematical formulas, models, and techniques. Through the use
of statistical methods, information is extracted from research data, and different ways are available to
judge the robustness of research outputs.

As a matter of fact, today’s statistical methods used in the data mining field typically are derived from
the vast statistical toolkit developed to answer problems arising in other fields. These techniques are
taught in science curriculums.

It is necessary to check and test several hypotheses. The hypotheses described above help us assess the
validity of our data mining endeavor when attempting to infer any inferences from the data under
study. When using more complex and sophisticated statistical estimators and tests, these issues become
more pronounced.

EXtracting knowledge from databases containing different types of observations, a variety of statistical
methods are available in Data Mining and some of these are:

• Logistic regression analysis


• Correlation analysis
• Regression analysis
• Discriminate analysis
• Linear discriminant analysis (LDA)
• Classification
• Clustering
• Outlier detection
• Classification and regression trees,
• Correspondence analysis
• Nonparametric regression,
• Statistical pattern recognition,
• Categorical data analysis,
• Time-series methods for trends and periodicity
• Artificial neural networks

Now, let’s try to understand some of the important statistical methods which are used in data mining:

Linear Regression: The linear regression method uses the best linear relationship between the
independent and dependent variables to predict the target variable. In order to achieve the best fit,
make sure that all the distances between the shape and the actual observations at each point are as
small as possible. A good fit can be determined by determining that no other position would produce
fewer errors given the shape chosen.

Simple linear regression and multiple linear regression are the two major types of linear regression. By
fitting a linear relationship to the independent variable, the simple linear regression predicts the
dependent variable. Using multiple independent variables, multiple linear regression fits the best linear
relationship with the dependent variable.

Classification: This is a method of data mining in which a collection of data is categorized so that a
greater degree of accuracy can be predicted and analyzed. An effective way to analyze very large
datasets is to classify them. Classification is one of several methods aimed at improving the efficiency of
the analysis process. A Logistic Regression and a Discriminant Analysis stand out as two major
classification techniques.

Logistic Regression: It can also be applied to machine learning applications and predictive analytics. In
this approach, the dependent variable is either binary (binary regression) or multinomial (multinomial
regression): either one of the two or a set of one, two, three, or four options. With a logistic regression
equation, one can estimate probabilities regarding the relationship between the independent variable
and the dependent variable. For understanding logistic regression analysis in detail, you can refer to
logistic regression.
Discriminant Analysis: A Discriminant Analysis is a statistical method of analyzing data based on the
measurements of categories or clusters and categorizing new observations into one or more populations
that were identified a priori. The discriminant analysis models each response class independently then
uses Bayes’s theorem to flip these projections around to estimate the likelihood of each response
category given the value of X. These models can be either linear or quadratic.

Linear Discriminant Analysis: According to Linear Discriminant Analysis, each observation is assigned a
discriminant score to classify it into a response variable class. By combining the independent variables in
a linear fashion, these scores can be obtained. Based on this model, observations are drawn from a
Gaussian distribution, and the predictor variables are correlated across all k levels of the response
variable, Y. and for further details linear discriminant analysis

Quadratic Discriminant Analysis: An alternative approach is provided by Quadratic Discriminant


Analysis. LDA and QDA both assume Gaussian distributions for the observations of the Y classes. Unlike
LDA, QDA considers each class to have its own covariance matrix. As a result, the predictor variables
have different variances across the k levels in Y.

Correlation Analysis: In statistical terms, correlation analysis captures the relationship between
variables in a pair. The value of such variables is usually stored in a column or rows of a database table
and represents a property of an object.

Regression Analysis: Based on a set of numeric data, regression is a data mining method that predicts a
range of numerical values (also known as continuous values). You could, for instance, use regression to
predict the cost of goods and services based on other variables. A regression model is used across
numerous industries for forecasting financial data, modeling environmental conditions, and analyzing
trends.

Clustering :
Clustering is similar to classification except that the groups are not predefined, but rather defined by
the data alone. Clustering is alternatively referred to as unsupervised learning or segmentation. It can be
thought of as partitioning or segmenting the data into groups that might or might not be disjointed. The
clustering is usually accomplished by determining the similarity among the data on predefined
attributes. The most similar data are grouped into clusters.

There are two types of statistical-based algorithms which are as follows −

• Regression − Regression issues deal with the evaluation of an output value located on input values.
When utilized for classification, the input values are values from the database and the output values
define the classes. Regression can be used to clarify classification issues, but it is used for different
applications including forecasting. The elementary form of regression is simple linear regression that
includes only one predictor and a prediction.
Regression can be used to implement classification using two various methods which are as follows −
• Division − The data are divided into regions located on class.
• Prediction − Formulas are created to predict the output class’s value.

• Bayesian Classification − Statistical classifiers are used for the classification. Bayesian classification
is based on the Bayes theorem. Bayesian classifiers view high efficiency and speed when used to high
databases.

Bayes Theorem − Let X be a data tuple. In the Bayesian method, X is treated as “evidence.” Let H be
some hypothesis, including that the data tuple X belongs to a particularized class C. The probability P
(H|X) is decided to define the data. This probability P (H|X) is the probability that hypothesis H’s
influence has given the “evidence” or noticed data tuple X.

P (H|X) is the posterior probability of H conditioned on X. For instance, consider the nature of data
tuples is limited to users defined by the attribute age and income, commonly, and that X is 30 years old
users with Rs. 20,000 income. Assume that H is the hypothesis that the user will purchase a computer.
Thus P (H|X) reverses the probability that user X will purchase a computer given that the user’s age and
income are acknowledged.

P (H) is the prior probability of H. For instance, this is the probability that any given user will purchase
a computer, regardless of age, income, or some other data. The posterior probability P (H|X) is located
on more data than the prior probability P (H), which is free of X.
Likewise, P (X|H) is the posterior probability of X conditioned on H. It is the probability that a user X
is 30 years old and gains Rs. 20,000.

P (H), P (X|H), and P (X) can be measured from the given information. Bayes theorem supports a
method of computing the posterior probability P (H|X), from P (H), P (X|H), and P(X). It is given by

P(H|X)=P(X|H)P(H)

P(X)

Bayes rule allows us to assign probabilities of hypotheses given a data value, P(h j I Xi). Here we
discuss tuples when in actuality each Xi may be an attribute value or other data label. Each h1 may
be an attribute value, set of attribute values (such as a range), or even a combination of attribute
values.
Advantages
The advantages of the load-sensitive routing algorithm are as follows −

• The ability of dynamic routing to congest links and improve application performance makes it a valuable
traffic engineering tool.
• Therefore, deployment of load-sensitive routing is hampered by the overheads imposed by link-state
update propagation, path selection, and signaling.

Disadvantages
There are some problems with a load-sensitive routing protocol, such as −

• Higher overhead on routers and especially instability.

What Is Hypothesis Testing in Statistics?

Hypothesis Testing is a type of statistical analysis in which you put your assumptions about a
population parameter to the test. It is used to estimate the relationship between 2 statistical
variables.

Let's discuss few examples of statistical hypothesis from real-life -

• A teacher assumes that 60% of his college's students come from lower-middle-class families.

• A doctor believes that 3D (Diet, Dose, and Discipline) is 90% effective for diabetic patients.

Now that you know about hypothesis testing, look at the two types of hypothesis testing in
statistics.

Hypothesis Testing Formula

Z = ( x̅ – μ0 ) / (σ /√n)
• Here, x̅ is the sample mean,

• μ0 is the population mean,

• σ is the standard deviation,

• n is the sample size.

Hypothesis Testing Works?


An analyst performs hypothesis testing on a statistical sample to present evidence of the
plausibility of the null hypothesis. Measurements and analyses are conducted on a random
sample of the population to test a theory. Analysts use a random population sample to test two
hypotheses: the null and alternative hypotheses.

The null hypothesis is typically an equality hypothesis between population parameters; for
example, a null hypothesis may claim that the population means return equals zero. The
alternate hypothesis is essentially the inverse of the null hypothesis (e.g., the population means
the return is not equal to zero). As a result, they are mutually exclusive, and only one can be
correct. One of the two possibilities, however, will always be correct.

Null Hypothesis and Alternate Hypothesis

The Null Hypothesis is the assumption that the event will not occur. A null hypothesis has no
bearing on the study's outcome unless it is rejected.

H0 is the symbol for it, and it is pronounced H-naught.

The Alternate Hypothesis is the logical opposite of the null hypothesis. The acceptance of the
alternative hypothesis follows the rejection of the null hypothesis. H1 is the symbol for it.

Let's understand this with an example.

A sanitizer manufacturer claims that its product kills 95 percent of germs on average.

To put this company's claim to the test, create a null and alternate hypothesis.

H0 (Null Hypothesis): Average = 95%.

Alternative Hypothesis (H1): The average is less than 95%.

Another straightforward example to understand this concept is determining whether or not a


coin is fair and balanced. The null hypothesis states that the probability of a show of heads is
equal to the likelihood of a show of tails. In contrast, the alternate theory states that the
probability of a show of heads and tails would be very different.

Hypothesis Testing Calculation With Examples

Let's consider a hypothesis test for the average height of women in the United States. Suppose
our null hypothesis is that the average height is 5'4". We gather a sample of 100 women and
determine that their average height is 5'5". The standard deviation of population is 2.

To calculate the z-score, we would use the following formula:

z = ( x̅ – μ0 ) / (σ /√n)

z = (5'5" - 5'4") / (2" / √100)

z = 0.5 / (0.045)

z = 11.11

We will reject the null hypothesis as the z-score of 11.11 is very large and conclude that there
is evidence to suggest that the average height of women in the US is greater than 5'4".

Steps of Hypothesis Testing

Step 1: Specify Your Null and Alternate Hypotheses

It is critical to rephrase your original research hypothesis (the prediction that you wish to study)
as a null (Ho) and alternative (Ha) hypothesis so that you can test it quantitatively. Your first
hypothesis, which predicts a link between variables, is generally your alternate hypothesis. The
null hypothesis predicts no link between the variables of interest.
Step 2: Gather Data

For a statistical test to be legitimate, sampling and data collection must be done in a way that is
meant to test your hypothesis. You cannot draw statistical conclusions about the population you
are interested in if your data is not representative.

Step 3: Conduct a Statistical Test

Other statistical tests are available, but they all compare within-group variance (how to spread
out the data inside a category) against between-group variance (how different the categories
are from one another). If the between-group variation is big enough that there is little or no
overlap between groups, your statistical test will display a low p-value to represent this. This
suggests that the disparities between these groups are unlikely to have occurred by accident.
Alternatively, if there is a large within-group variance and a low between-group variance, your
statistical test will show a high p-value. Any difference you find across groups is most likely
attributable to chance. The variety of variables and the level of measurement of your obtained
data will influence your statistical test selection.

Step 4: Determine Rejection Of Your Null Hypothesis

Your statistical test results must determine whether your null hypothesis should be rejected or
not. In most circumstances, you will base your judgment on the p-value provided by the
statistical test. In most circumstances, your preset level of significance for rejecting the null
hypothesis will be 0.05 - that is, when there is less than a 5% likelihood that these data would
be seen if the null hypothesis were true. In other circumstances, researchers use a lower level
of significance, such as 0.01 (1%). This reduces the possibility of wrongly rejecting the null
hypothesis.

Step 5: Present Your Results

The findings of hypothesis testing will be discussed in the results and discussion portions of
your research paper, dissertation, or thesis. You should include a concise overview of the data
and a summary of the findings of your statistical test in the results section. You can talk about
whether your results confirmed your initial hypothesis or not in the conversation. Rejecting or
failing to reject the null hypothesis is a formal term used in hypothesis testing. This is likely a
must for your statistics assignments.

Types of Hypothesis Testing


Z Test

To determine whether a discovery or relationship is statistically significant, hypothesis testing


uses a z-test. It usually checks to see if two means are the same (the null hypothesis). Only
when the population standard deviation is known and the sample size is 30 data points or more,
can a z-test be applied.

T Test

A statistical test called a t-test is employed to compare the means of two groups. To determine
whether two groups differ or if a procedure or treatment affects the population of interest, it is
frequently used in hypothesis testing.

Chi-Square

You utilize a Chi-square test for hypothesis testing concerning whether your data is as
predicted. To determine if the expected and observed results are well-fitted, the Chi-square test
analyzes the differences between categorical variables from a random sample. The test's
fundamental premise is that the observed values in your data should be compared to the
predicted values that would be present if the null hypothesis were true.

the 3 major types of hypothesis?

The three major types of hypotheses are:

1. Null Hypothesis (H0): Represents the default assumption, stating that there is no significant
effect or relationship in the data.

2. Alternative Hypothesis (Ha): Contradicts the null hypothesis and proposes a specific effect or
relationship that researchers want to investigate.

3. Nondirectional Hypothesis: An alternative hypothesis that doesn't specify the direction of the
effect, leaving it open for both positive and negative possibilities.

Hypothesis Testing Important in Research Methodology?

Hypothesis testing is crucial in research methodology for several reasons:

1. Provides evidence-based conclusions: It allows researchers to make objective conclusions


based on empirical data, providing evidence to support or refute their research hypotheses.
2. Supports decision-making: It helps make informed decisions, such as accepting or rejecting a
new treatment, implementing policy changes, or adopting new practices.

3. Adds rigor and validity: It adds scientific rigor to research using statistical methods to analyze
data, ensuring that conclusions are based on sound statistical evidence.

4. Contributes to the advancement of knowledge: By testing hypotheses, researchers contribute


to the growth of knowledge in their respective fields by confirming existing theories or
discovering new patterns and relationships.

Distance based algorithms:

Distance measures play an important role in machine learning

They provide the foundations for many popular and effective machine
learning algorithms like KNN (K-Nearest Neighbours) for supervised
learning and K-Means clustering for unsupervised learning.

Different distance measures must be chosen and used depending on the


types of data, As such, it is important to know how to implement and
calculate a range of different popular distance measures and the intuitions
for the resulting scores.

In this blog, we’ll discover distance measures in machine learning.

Overview:

1. Role of Distance Measures

2. Hamming Distance

3. Euclidean Distance

4. Manhattan Distance (Taxiable or City Block)


5. Minkowski Distance

6. Mahalanobis Distance

7. Cosine Similarity

Role of Distance Measures

• Distance measures play an important role in machine learning

• A distance measure is an objective score that summarizes the relative


difference between two objects in a problem domain.

Most commonly, the two objects are rows of data that describes a subject
(such as a person, car, or house), or an event (such as purchases, a claim, or
a diagnosis)
Perhaps, the most likely way we can encounter distance measures is when
we are using a specific machine learning algorithm that uses distance
measures at its core. The most famous algorithm is KNN — [K-Nearest
Neighbours Algorithm]

KNN :

A Classification or Regression prediction is made for new examples by


calculating the distance between the new and all existing example sets in the
training datasets.
The K examples in the training dataset with the smallest distance are then
selected and a prediction is made by averaging the outcome(mode of the
class label or mean of the real value for regression)

KNN belongs to a broader field of algorithms called case-based or instance-


based learning, most of which uses distance measures in a similar manner.
Another popular instance-based algorithm that uses distance measures is
the learning vector quantization or LVQ, the algorithm that may also be
considered a type of neural network.

Next, We have the Self-Organizing Map algorithm, or SOM, which is an


algorithm that also uses distance measures and can be used for supervised
and unsupervised learning algorithms that use distance measures at its core
is the K-means clustering algorithm.

In Instance-Based Learning, the training examples are stored verbatim and


a distance function is used to determine which member of the training set is
closest to an unknown test instance. Once the nearest training instance has
been located its class is predicted for the test instance.

A few of the More popular machine learning algorithms that use distance
measures at their core is

1. K-Nearest Neighbors (KNN)


2. Learning Vector Quantization (LVQ)

3. Self-Organizing Map (SOM)

4. K-Means Clustering

• There are many kernel-based methods that may also be considered


distance-based algorithms.

• Perhaps the most widely know kernel method is the Support Vector
Machine algorithm (SVM)

• The Example set might have real values, boolean, categorical, and
ordinal values.

• Numerical values may have different scales. this can greatly impact
the calculation of distance measure and it is often a good practice to
normalize or standardize numerical values prior to calculating the
distance measure.

• Numerical error in regression problems may also be considered a


distance. For example error between the expected value and the
predicted value is a one-dimensional distance measure that can be
summed or averaged over all examples in a test set to give a total
distance between the expected and predicted outcomes in the dataset.

• The calculation of the errors, such as the mean squared error or mean
absolute error, may resemble a standard distance measure.
As we can see, distance measures play an important role in machine
learning,

the most commonly used distance measures in machine learning are

1. Hamming Distance

2. Euclidean Distance

3. Manhattan Distance

4. Minkowski Distance

5. Mahalanobis

The most important is to know how to calculate each of these distance


measures when implementing the algorithms from scratch and the intuition
for what is being calculated when using algorithms that make use of these
distance measures.

HAMMING DISTANCE

Hamming distance calculates the distance between two binary vectors, also
referred to as binary strings or bitstrings

We are most likely going to encounter binary strings when we do One-Hot


Encode categorical columns of data.
For example,

Example Set

After One Hot Encoding

An example set with one-hot encodes

the distance between red and green could be calculated as the sum or the
average number of bit differences between the two bitstrings. This is
Hamming distance.
For a One-hot encoded string, it might make more sense to summarize the
sum of the bit difference between the strings, which will always be a 0 or 1.

• Hamming Distance = sum for i to N abs(v1[i] — v2[i])

For bitstrings that may have many 1 bits, it is more common to calculate the
average number of bit differences to give a hamming distance score between
0(identical) and 1 (all different).

• Hamming Distance = (sum for i to N abs(v1[i] — v2[i]))/N

We can demonstrate this with an example of calculating the Hamming


Distance between two bitstrings, listed below.
# calculating hamming distance between bit string
# calculate hamming distance
def hamming_distance(a,b):
return sum(abs(e1-e2) for e1, e2 in zip(a,b)) / len(a)# define data
row1 = [0,0,0,0,0,1]
row2 = [0,0,0,0,1,0]# calculate distance
dist = hamming_distance(row1, row2)
print(dist)
We can see that there are two differences between the strings, or 2 out of 6
bit different, which averaged (2/6) is about 1/3 or 0.33.
0.33333333333333

we can also perform the same calculation using hamming() function from
SciPy.

# calculating hamming distance between bit strings


from scipy.spatial.distance import hamming# define data
row1 = [0, 0, 0, 0, 0, 1]
row2 = [0, 0, 0, 0, 1, 0]# calculate distance
dist = hamming(row1, row2)
print(dist)

Here we can confirm the example we get the same results, confirming our
manual implementation
0.33333333333333
Euclidean Distance

Euclidean distance calculates the distance between two real-valued vectors.

In order to calculate the distance between data points, the A and B


Pythagorean theorem considers the length of the x and y-axis.

You are most likely to use Euclidean distance when calculating the distance
between two rows of data that have numerical values, such a floating-point
or integer values.

If columns have values with differing scales, it’s common to normalize or


standardize the numerical values across all columns prior to calculating the
euclidean distance. Otherwise, columns that have large values will dominate
the distance measure.

Euclidean distance is calculated as the square root of the sum of the squared
differences between the two vectors.

• Euclidean Distance = sqrt(sum for i to N (v1[i] — v2[i])²)

If the distance calculation is to be performed thousands or millions of times,


it is common to remove the square root operation in an effort to speed up
the calculation. The resulting scores will have the same relative proportions
after this modification and can still be used effectively within a machine
learning algorithm for finding the most similar examples.

• EuclideanDistance = sum for i to N (v1[i] — v2[i])²

This calculation is related to the L2 vector norm and is equivalent to the


sum squared error and the root sum squared error if the square root is
added.

We can demonstrate this with an example of calculating the Euclidean


distance between two real-valued vectors, listed below.
# calculating euclidean distance between vectors
from math import sqrt# calculate euclidean distance
def euclidean_distance(a, b):
return sqrt(sum((e1-e2)**2 for e1, e2 in zip(a,b)))# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]# calculate distance
dist = euclidean_distance(row1, row2)
print(dist)
Running the example reports the Euclidean distance between the two
vectors.
6.082762530298219

We can also perform the same calculation using the euclidean()


function from SciPy. The complete example is listed below.
# calculating euclidean distance between vectors
from scipy.spatial.distance import euclidean# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]# calculate distance
dist = euclidean(row1, row2)
print(dist)

Running the example, we can see we get the same result, confirming our
manual implementation.
6.082762530298219

Manhattan Distance (Taxicab or City Block Distance)

• The Manhattan distance, also called the Taxicab distance or the City
Block distance, calculates the distance between two real-valued
vectors.

• It is perhaps more useful to vectors that describe objects on a uniform


grid, like a chessboard or city blocks. The taxicab name for the
measure refers to the intuition for what the measure calculates: the
shortest path that a taxicab would take between city blocks
(coordinates on the grid).

• It might make sense to calculate Manhattan distance instead of


Euclidean distance for two vectors in an integer feature space.
• Manhattan distance is calculated as the sum of the absolute
differences between the two vectors.

• ManhattanDistance = sum for i to N sum |v1[i] — v2[i]|

The Manhattan distance is related to the L1 vector norm and the sum
absolute error and mean absolute error metric.

We can demonstrate this with an example of calculating the Manhattan


distance between two integer vectors, listed below.
# calculating manhattan distance between vectors
from math import sqrt# calculate manhattan distance
def manhattan_distance(a, b):
return sum(abs(e1-e2) for e1, e2 in zip(a,b))# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]# calculate distance
dist = manhattan_distance(row1, row2)
print(dist)

Running the example reports the Manhattan distance between the two
vectors.
13

We can also perform the same calculation using the cityblock()


function from SciPy. The complete example is listed below.
# calculating manhattan distance between vectors
from scipy.spatial.distance import cityblock# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]# calculate distance
dist = cityblock(row1, row2)
print(dist)

Running the example, we can see we get the same result, confirming our
manual implementation.
13
Minkowski Distance

Minkowski distance calculates the distance between two real-valued vectors.

It is a generalization of the Euclidean and Manhattan distance measures and


adds a parameter, called the “order” or “p“, that allows different distance
measures to be calculated.

The Minkowski distance measure is calculated as follows:

• EuclideanDistance = (sum for i to N (abs(v1[i] —


v2[i]))^p)^(1/p)

Where “p” is the order parameter.

When p is set to 1, the calculation is the same as the Manhattan distance.


When p is set to 2, it is the same as the Euclidean distance.

• p=1: Manhattan distance.

• p=2: Euclidean distance.

Intermediate values provide a controlled balance between the two measures.

It is common to use Minkowski distance when implementing a machine


learning algorithm that uses distance measures as it gives control over the
type of distance measure used for real-valued vectors via a hyperparameter
“p” that can be tuned.
We can demonstrate this calculation with an example of calculating the
Minkowski distance between two real vectors, listed below.
# calculating minkowski distance between vectors
from math import sqrt# calculate minkowski distance
def minkowski_distance(a, b, p):
return sum(abs(e1-e2)**p for e1, e2 in zip(a,b))**(1/p)# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]# calculate distance (p=1)
dist = minkowski_distance(row1, row2, 1)
print(dist)# calculate distance (p=2)
dist = minkowski_distance(row1, row2, 2)
print(dist)

Running the example first calculates and prints the Minkowski distance
with p set to 1 to give the Manhattan distance, then with p set to 2 to give the
Euclidean distance, matching the values calculated on the same data from
the previous sections.
13.0
6.082762530298219

We can also perform the same calculation using the minkowski_distance()


function from SciPy. The complete example is listed below.
# calculating minkowski distance between vectors
from scipy.spatial import minkowski_distance# define data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]# calculate distance (p=1)
dist = minkowski_distance(row1, row2, 1)
print(dist)# calculate distance (p=2)
dist = minkowski_distance(row1, row2, 2)
print(dist)

Running the example, we can see we get the same results, confirming our
manual implementation.
13.0
6.082762530298219
Mahalanobis Distance

• Mahalanobis distance is an effective multivariate distance metric that


measures the distance between a point(vector) and a distribution. It is
an extremely useful metric having, excellent applications in
multivariate anomaly detection, classification on highly imbalanced
datasets, and one-class classification.

• It has excellent applications in multivariate anomaly detection,


classification on highly imbalanced datasets and one-class
classification and more untapped use cases.

• Considering its extremely useful applications, this metric is seldom


discussed or used in stats or ML workflows. This post explains the why
and the when to use Mahalanobis distance and then explains the
intuition and the math with useful applications.

• Mahalanobis distance is the distance between a point and a


distribution. And not between two distinct points. It is effectively a
multivariate equivalent of the Euclidean distance

1. It transforms the columns into uncorrelated variables

2. Scale the columns to make their variance equal to 1

3. Finally, it calculates the Euclidean distance


where,
- D^2 is the square of the Mahalanobis distance.
- x is the vector of the observation (row in a dataset),
- m is the vector of mean values of independent variables (mean of each
column),
- C^(-1) is the inverse covariance matrix of independent variables.

Compute Mahalanobis Distance


import pandas as pd
import scipy as sp
import numpy as np

filepath = 'local/input.csv'
df = pd.read_csv(filepath).iloc[:, [0,4,6]]
df.head()

Let’s write the function to calculate Mahalanobis Distance:


def mahalanobis(x=None, data=None, cov=None):
"""Compute the Mahalanobis Distance between each row of x and the
data
x : vector or matrix of data with, say, p columns.
data : ndarray of the distribution from which Mahalanobis distance of
each observation of x is to be computed.
cov : covariance matrix (p x p) of the distribution. If None, will
be computed from data.
"""
x_minus_mu = x - np.mean(data)
if not cov:
cov = np.cov(data.values.T)
inv_covmat = sp.linalg.inv(cov)
left_term = np.dot(x_minus_mu, inv_covmat)
mahal = np.dot(left_term, x_minus_mu.T)
return mahal.diagonal()
df_x = df[['A', 'B', 'C']].head(500)
df_x['maha_dist'] = mahalanobis(x=df_x, data=df[['A', 'B', 'C']])
df_x.head()

Cosine Distance:
Mostly Cosine distance metric is used to find similarities between different
documents. In cosine metrics, we measure the degree of angle between two
documents/vectors(the term frequencies in different documents collected as
metrics). This particular metric is used when the magnitude between vectors
does not matter but the orientation.

Cosine similarity formula can be derived from the equation of dot products:-

Now, you must be thinking about which value of cosine angle will be helpful
in finding out the similarities.
Now that we have the values which will be considered in order to measure
the similarities, we need to know what do 1, 0, and -1 signify.

Here cosine value 1 is for vectors pointing in the same direction i.e. there are
similarities between the documents/data points. At zero for orthogonal
vectors i.e. Unrelated(some similarity found). Value -1 for vectors pointing
in opposite directions(No similarity).
sklearn.metrics.pairwise.cosine_similarity(X, Y=None, dense_output=True)

Example for Cosine Similarity


from scipy import spatial

dataSetI = [3, 45, 7, 2]


dataSetII = [2, 54, 13, 15]
result = 1 - spatial.distance.cosine(dataSetI, dataSetII)

Another Version based on Numpy


from numpy import dot
from numpy.linalg import norm

cos_sim = dot(a, b)/(norm(a)*norm(b))

Defining the Cosine Similarity function


import math
def cosine_similarity(v1,v2):
"compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)"
sumxx, sumxy, sumyy = 0, 0, 0
for i in range(len(v1)):
x = v1[i]; y = v2[i]
sumxx += x*x
sumyy += y*y
sumxy += x*y
return sumxy/math.sqrt(sumxx*sumyy)

v1,v2 = [3, 45, 7, 2], [2, 54, 13, 15]


print(cosine_similarity(v1,v2))

Running the example, we can see we get the same results, confirming our
manual implementation.
0.972284251712

Data Mining - Decision Tree Based algorithm

Decision Tree Mining is a type of data mining technique that is used to build Classification
Models. It builds classification models in the form of a tree-like structure, just like its name.

This type of mining belongs to supervised class learning.

In supervised learning, the target result is already known. Decision trees can be used for both
categorical and numerical data. The categorical data represent gender, marital status, etc.
while the numerical data represent age, temperature, etc.
An example of a decision tree with the dataset is shown below.

A decision tree is a structure that includes a root node, branches,


and leaf nodes. Each internal node denotes a test on an attribute,
each branch denotes the outcome of a test, and each leaf node
holds a class label. The topmost node in the tree is the root node.

The following decision tree is for the concept buy_computer that


indicates whether a customer at a company is likely to buy a
computer or not. Each internal node represents a test on an
attribute. Each leaf node represents a class.
The benefits of having a decision tree are as follows −

• It does not require any domain knowledge.


• It is easy to comprehend.
• The learning and classification steps of a decision tree are
simple and fast.
Decision Tree Induction Algorithm

Decision Tree Induction

Decision tree induction is the method of learning the decision trees from the training set. The
training set consists of attributes and class labels. Applications of decision tree induction
include astronomy, financial analysis, medical diagnosis, manufacturing, and production.

A decision tree is a flowchart tree-like structure that is made from training set tuples. The
dataset is broken down into smaller subsets and is present in the form of nodes of a tree. The
tree structure has a root node, internal nodes or decision nodes, leaf node, and branches.
The root node is the topmost node. It represents the best attribute selected for classification.
Internal nodes of the decision nodes represent a test of an attribute of the dataset leaf node or
terminal node which represents the classification or decision label. The branches show the
outcome of the test performed.

An example of splitting attribute is shown below:

a. The portioning above is discrete-valued.

b. The portioning above is for continuous-valued

Advantages Of Decision Tree Classification

Enlisted below are the various merits of Decision Tree Classification:

1. Decision tree classification does not require any domain knowledge, hence, it is
appropriate for the knowledge discovery process.
2. The representation of data in the form of the tree is easily understood by humans
and it is intuitive.
3. It can handle multidimensional data.
4. It is a quick process with great accuracy.
Disadvantages Of Decision Tree Classification
Given below are the various demerits of Decision Tree Classification:

1. Sometimes decision trees become very complex and these are called overfitted
trees.
2. The decision tree algorithm may not be an optimal solution.
3. The decision trees may return a biased solution if some class label dominates it.
Neural Network:
Neural Network is an information processing paradigm that is inspired by the human nervous system. As
in the Human Nervous system, we have Biological neurons in the same way in Neural networks we have
Artificial Neurons which is a Mathematical Function that originates from biological neurons. The human
brain is estimated to have around 10 billion neurons each connected on average to 10,000 other
neurons. Each neuron receives signals through synapses that control the effects of the signal on the
neuron.

Neural Network Architecture:

While there are numerous different neural network architectures that have been created by
researchers, the most successful applications in data mining neural networks have been multilayer
feedforward networks. These are networks in which there is an input layer consisting of nodes that
simply accept the input values and successive layers of nodes that are neurons as depicted in the above
figure of Artificial Neuron. The outputs of neurons in a layer are inputs to neurons in the next layer. The
last layer is called the output layer. Layers between the input and output layers are known as hidden
layers.
Rule Based Algorithm:

Rule Based Data Mining Classifier is a direct approach for data mining. This classifier is simple and more
easily interpretable than regular data mining algorithms. These are learning sets of rules which are
implemented using the IF-THEN clause. It works very well with both numerical as well as categorical
data.

IF-THEN Rules

Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the
following from −

IF condition THEN conclusion

Let us consider a rule R1,

R1: IF age = youth AND student = yes

THEN buy_computer = yes

Data Mining - Rule Based Classification

IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule in the
following from −

IF condition THEN conclusion

Let us consider a rule R1,

R1: IF age = youth AND student = yes

THEN buy_computer = yes

Points to remember −

The IF part of the rule is called rule antecedent or precondition.

The THEN part of the rule is called rule consequent.

The antecedent part the condition consist of one or more attribute tests and these tests are logically
ANDed.

The consequent part consists of class prediction.

Note − We can also write rule R1 as follows −

R1: (age = youth) ^ (student = yes))(buys computer = yes)

If the condition holds true for a given tuple, then the antecedent is satisfied.

Rule Extraction

Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a decision tree.

Points to remember −

• To extract a rule from a decision tree −


• One rule is created for each path from the root to the leaf node.
• To form a rule antecedent, each splitting criterion is logically ANDed.
• The leaf node holds the class prediction, forming the rule consequent.
Rule Induction Using Sequential Covering Algorithm:

Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data. We do not
require to generate a decision tree first. In this algorithm, each rule for a given class covers many of the
tuples of that class.

Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general strategy the
rules are learned one at a time. For each time rules are learned, a tuple covered by the rule is removed
and the process continues for the rest of the tuples. This is because the path to each leaf in a decision
tree corresponds to a rule.

Note − The Decision tree induction can be considered as learning a set of rules simultaneously.

The Following is the sequential learning Algorithm where rules are learned for one class at a time. When
learning a rule from a class Ci, we want the rule to cover all the tuples from class C only and no tuple
form any other class.

Algorithm: Sequential Covering :


Input:

D, a data set class-labeled tuples,

Att_vals, the set of all attributes and their possible values.

Output: A Set of IF-THEN rules.

Method:

Rule_set={ }; // initial set of rules learned is empty

for each class c do

repeat

Rule = Learn_One_Rule(D, Att_valls, c);

remove tuples covered by Rule form D;

until termination condition;


Rule_set=Rule_set+Rule; // add a new rule to rule-set

end for

return Rule_Set;

classification methods such as Genetic Algorithms, Rough Set Approach, and Fuzzy Set Approach.

Data Mining Techniques:

Data mining includes the utilization of refined data analysis tools to find previously unknown, valid
patterns and relationships in huge data sets. These tools can incorporate statistical models, machine
learning techniques, and mathematical algorithms, such as neural networks or decision trees. Thus, data
mining incorporates analysis and prediction.

In recent data mining projects, various major data mining techniques have been developed and used,
including association, classification, clustering, prediction, sequential patterns, and regression.
1. Classification:

This technique is used to obtain important and relevant information about data and metadata. This data
mining technique helps to classify data in different classes.

Data mining techniques can be classified by different criteria, as follows:

Classification of Data mining frameworks as per the type of data sources mined:

This classification is as per the type of data handled. For example, multimedia, spatial data, text data,
time-series data, World Wide Web, and so on..

Classification of data mining frameworks as per the database involved:

This classification based on the data model involved. For example. Object-oriented database,
transactional database, relational database, and so on..
Classification of data mining frameworks as per the kind of knowledge discovered:

This classification depends on the types of knowledge discovered or data mining functionalities. For
example, discrimination, classification, clustering, characterization, etc. some frameworks tend to be
extensive frameworks offering a few data mining functionalities together..

Classification of data mining frameworks according to data mining techniques used:

This classification is as per the data analysis approach utilized, such as neural networks, machine
learning, genetic algorithms, visualization, statistics, data warehouse-oriented or database-oriented, etc.

The classification can also take into account, the level of user interaction involved in the data mining
procedure, such as query-driven systems, autonomous systems, or interactive exploratory systems.

2. Clustering:

Clustering is a division of information into groups of connected objects. Describing the data by a few
clusters mainly loses certain confine details, but accomplishes improvement. It models data by its
clusters.

Data modeling puts clustering from a historical point of view rooted in statistics, mathematics, and
numerical analysis. From a machine learning point of view, clusters relate to hidden patterns, the search
for clusters is unsupervised learning, and the subsequent framework represents a data concept. From a
practical point of view, clustering plays an extraordinary job in data mining applications.

For example, scientific data exploration, text mining, information retrieval, spatial database applications,
CRM, Web analysis, computational biology, medical diagnostics, and much more.

In other words, we can say that Clustering analysis is a data mining technique to identify similar data.
This technique helps to recognize the differences and similarities between the data. Clustering is very
similar to the classification, but it involves grouping chunks of data together based on their similarities

3. Regression:

Regression analysis is the data mining process is used to identify and analyze the relationship between
variables because of the presence of the other factor. It is used to define the probability of the specific
variable. Regression, primarily a form of planning and modeling. For example, we might use it to project
certain costs, depending on other factors such as availability, consumer demand, and competition.
Primarily it gives the exact relationship between two or more variables in the given data set.
4. Association Rules:

This data mining technique helps to discover a link between two or more items. It finds a hidden pattern
in the data set.

Association rules are if-then statements that support to show the probability of interactions between
data items within large data sets in different types of databases. Association rule mining has several
applications and is commonly used to help sales correlations in data or medical data sets.

The way the algorithm works is that you have various data, For example, a list of grocery items that you
have been buying for the last six months. It calculates a percentage of items being purchased together.

These are three major measurements technique:

Lift:

This measurement technique measures the accuracy of the confidence over how often item B is
purchased.

(Confidence) / (item B)/ (Entire dataset)

Support:

This measurement technique measures how often multiple items are purchased and compared it to the
overall dataset.

(Item A + Item B) / (Entire dataset)

Confidence:

This measurement technique measures how often item B is purchased when item A is purchased as well.

(Item A + Item B)/ (Item A)

5. Outer detection:

This type of data mining technique relates to the observation of data items in the data set, which do not
match an expected pattern or expected behavior. This technique may be used in various domains like
intrusion, detection, fraud detection, etc. It is also known as Outlier Analysis or Outilier mining.

The outlier is a data point that diverges too much from the rest of the dataset. The majority of the real-
world datasets have an outlier. Outlier detection plays a significant role in the data mining field. Outlier
detection is valuable in numerous fields like network interruption identification, credit or debit card
fraud detection, detecting outlying in wireless sensor network data, etc.

6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential data to discover
sequential patterns. It comprises of finding interesting subsequences in a set of sequences, where the
stake of a sequence can be measured in terms of different criteria like length, occurrence frequency, etc.

In other words, this technique of data mining helps to discover or recognize similar patterns in
transaction data over some time.

7. Prediction:

Prediction used a combination of other data mining techniques such as trends, clustering, classification,
etc. It analyzes past events or instances in the right sequence to predict a future event.
UNIT-III

CLUSTERING AND ASSOCIATION

Clustering in Data Mining:

Clustering:

The process of making a group of abstract objects into classes of similar objects is
known as clustering.

Points to Remember:

One group is treated as a cluster of data objects

In the process of cluster analysis, the first step is to partition the set of data into groups
with the help of data similarity, and then groups are assigned to their respective labels.

The biggest advantage of clustering over-classification is it can adapt to the changes


made and helps single out useful features that differentiate different groups.

Applications of cluster analysis :

It is widely used in many applications such as image processing, data analysis, and
pattern recognition.

It helps marketers to find the distinct groups in their customer base and they can
characterize their customer groups by using purchasing patterns.

It can be used in the field of biology, by deriving animal and plant taxonomies and
identifying genes with the same capabilities.

It also helps in information discovery by classifying documents on the web.

Clustering Methods:

It can be classified based on the following categories.

 Model-Based Method

 Hierarchical Method

 Constraint-Based Method

 Grid-Based Method

 Partitioning Method
 Density-Based Method

Requirements of clustering in data mining:

The following are some points why clustering is important in data mining.

Scalability – we require highly scalable clustering algorithms to work with large


databases.

Ability to deal with different kinds of attributes – Algorithms should be able to work
with the type of data such as categorical, numerical, and binary data.

Discovery of clusters with attribute shape – The algorithm should be able to detect
clusters in arbitrary shapes and it should not be bounded to distance measures.

Interpretability – The results should be comprehensive, usable, and interpretable.

High dimensionality – The algorithm should be able to handle high dimensional space
instead of only handling low dimensional data.

Similarity Measures:

Similarity or Similarity distance measure is a basic building block of data mining and
greatly used in Recommendation Engine, Clustering Techniques and Detecting
Anomalies.

What is Outlier in data mining


Data analysis, the term outliers often come to our mind. As the name suggests,
"outliers" refer to the data points that exist outside of what is to be expected. The major
thing about the outliers is what you do with them.

If you are going to analyze any task to analyze data sets, you will always have some
assumptions based on how this data is generated. If you find some data points that are
likely to contain some form of error, then these are definitely outliers, and depending on
the context, you want to overcome those errors.

The data mining process involves the analysis and prediction of data that the data
holds. In 1969, Grubbs introduced the first definition of outliers.

Difference between outliers and noise


Any unwanted error occurs in some previously measured variable, or there is any
variance in the previously measured variable called noise. Before finding the outliers
present in any data set, it is recommended first to remove the noise.
Types of Outliers
Outliers are divided into three different types

1. Global or point outliers


2. Collective outliers
3. Contextual or conditional outliers

Global Outliers
Global outliers are also called point outliers. Global outliers are taken as the simplest
form of outliers. When data points deviate from all the rest of the data points in a given
data set, it is known as the global outlier. In most cases, all the outlier detection
procedures are targeted to determine the global outliers. The green data point is the
global outlier.
Collective Outliers
 In a given set of data, when a group of data points deviates from the rest of the
data set is called collective outliers. Here, the particular set of data objects may
not be outliers, but when you consider the data objects as a whole, they may
behave as outliers.
 To identify the types of different outliers, you need to go through background
information about the relationship between the behavior of outliers shown by
different data objects.
 For example, in an Intrusion Detection System, the DOS package from one
system to another is taken as normal behavior. Therefore, if this happens with
the various computer simultaneously, it is considered abnormal behavior, and as
a whole, they are called collective outliers. The green data points as a whole
represent the collective outlier.

Contextual Outliers
 As the name suggests, "Contextual" means this outlier introduced within a
context.
 For example, in the speech recognition technique, the single background noise.
Contextual outliers are also known as Conditional outliers.
 These types of outliers happen if a data object deviates from the other data
points because of any specific condition in a given data set.
As we know, there are two types of attributes of objects of data:

 contextual attributes
 behavioral attributes.

Contextual outlier analysis enables the users to examine outliers in different contexts
and conditions, which can be useful in various applications.

For example, A temperature reading of 45 degrees Celsius may behave as an outlier in


a rainy season. Still, it will behave like a normal data point in the context of a summer
season. In the given diagram, a green dot representing the low-temperature value in
June is a contextual outlier since the same value in December is not an outlier.

Outliers Analysis
Outliers are discarded at many places when data mining is applied. But it is still used in
many applications like fraud detection, medical, etc. It is usually because the events that
occur rarely can store much more significant information than the events that occur
more regularly.

Other applications where outlier detection plays a vital role are given below.

Any unusual response that occurs due to medical treatment can be analyzed through
outlier analysis in data mining.
 Fraud detection in the telecom industry
 In market analysis, outlier analysis enables marketers to identify the customer's
behaviors.

 In the Medical analysis field.


 Fraud detection in banking and finance such as credit cards, insurance sector, etc.

The process in which the behavior of the outliers is identified in a dataset is called
outlier analysis. It is also known as "outlier mining", the process is defined as a
significant task of data mining.

Hierarchical Clustering:

This method creates a hierarchical decomposition of the given set of data objects.
Wecan classify hierarchical methods on the basis of how the hierarchical
decomposition is formed.

 There are two approaches here −

Agglomerative Approach

 Divisive Approach

 Agglomerative Approach

This approach is also known as the bottom-up approach. In this, we start with
eachobject forming a separate group. It keeps on merging the objects or groups that
are close to one another. It keep on doing so until all of the groups are merged into one
oruntil the termination condition holds.

Algorithm:

given a dataset (d1,d2,d3……dn) of size N

#compute the distance matrix

For i=1 to N:

# as the distance matrix is symmetric about

# the primary diagonal so we compute only lower# part of the primary diagonal

For j=1 to i:
dis_mat[i][j] = distance[di,dj]

each data point is a singleton clusterrepeat merge the two cluster having minimum
distance update the distance matrix until only a single cluster remains.

Agglomerative Hierarchical clustering :

Step-1: Consider each alphabet as a single cluster and calculate the distance of one
cluster from all the other clusters.

Step-2: In the second step comparable clusters are merged together to form a single
cluster. Let’s say cluster (B) and cluster (C) are very similar to each other therefore we
merge them in the second step similarly to cluster (D) and (E) and at last, we get the
clusters [(A), (BC), (DE), (F)]

Step-3: We recalculate the proximity according to the algorithm and merge the two
nearest clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]

Step-4: Repeating the same process; The clusters DEF and BC are comparable and
merged together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
Step-5: At last the two remaining clusters are merged together to form a single cluster
[(ABCDEF)].

2. Divisive:

We can say that Divisive Hierarchical clustering is precisely the opposite of


Agglomerative Hierarchical clustering. In Divisive Hierarchical clustering, we take into
account all of the data points as a single cluster and in every iteration, we separate the
data points from the clusters which aren’t comparable. In the end, we are left with N
clusters.

Advantages of Hierarchical clustering

 It is simple to implement and gives the best output in some cases.


 It is easy and results in a hierarchy, a structure that contains more information.
 It does not need us to pre-specify the number of clusters.
Disadvantages of hierarchical clustering

 It breaks the large clusters.


 It is Difficult to handle different sized clusters and convex shapes.
 It is sensitive to noise and outliers.
 The algorithm can never be changed or deleted once it was done previously

Partitioning Algorithms:

Partitioning Method: This clustering method classifies the information into multiple groups
based on the characteristics and similarity of the data. Its the data analysts to specify the
number of clusters that has to be generated for the clustering methods.

In the partitioning method when database(D) that contains multiple(N) objects then the
partitioning method constructs user-specified(K) partitions of the data in which each partition
represents a cluster and a particular region.

There are many algorithms that come under partitioning method some of the popular ones are
K-Mean, PAM(K-Medoids), CLARA algorithm (Clustering Large Applications) etc. In this article,
we will be seeing the working of K Mean algorithm in detail. K-Mean (A ce ntroid based
Technique):

The K means algorithm takes the input parameter K from the user and partitions the dataset
containing N objects into K clusters so that resulting similarity among the data objects inside
the group (intra cluster) is high but the similarity of data objects with the data objects from
outside the cluster is low (inter cluster).

The similarity of the cluster is determined with respect to the mean value of the cluster. It is a
type of square error algorithm. At the start randomly k objects from the dataset are chosen in
which each of the objects represents a cluster mean(centre).

For the rest of the data objects, they are assigned to the nearest cluster based on their distance
from the cluster mean.
The new mean of each of the cluster is then calculated with the added data objects. Algorithm:
K mean:

Input:K: The number of clusters in which the dataset has to be dividD: A dataset
containing N number of objectOutpudataset of K clusters Method:

 Randomly assign K objects from the dataset(D) as cluster centres(C)

 (Re) Assign each object to which object is most similar based upon mean values.

 Update Cluster means, i.e., Recalculate the mean of each cluster with the updated values.

 Repeat Step 2 until no change occurs.


Figure – K-mean ClusteringFlowchart:

K-mean ClusteringExample: Suppose we want to group the visitors to a website using


just their age as follows:

16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66

Initial Cluster:
Note: These two points are chosen randomly from the dataset.

Iteration-1:

C1 = 16.33 [16, 16, 17]

C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66]

Iteration-2:

C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]

C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:

C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]

C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]

Iteration-4:

C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]

C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]

No change Between Iteration 3 and 4, so we stop. Therefore we get the clusters (16-29) and (36-
66) as 2 clusters we get using K Mean Algorithm.

Association rules :

What are association rules in data mining?


Association rules are "if-then" statements, that help to show the probability of
relationships between data items, within large data sets in various types of
databases.

Association rule mining has a number of applications and is widely used to


help discover sales correlations in transactional data or in medical data sets.

What are use cases for association rules?


In data science, association rules are used to find correlations and co-occurrences
between data sets. They are ideally used to explain patterns in data from seemingly
independent information repositories, such as relational databases and
transactional databases.

The act of using association rules is sometimes referred to as "association rule


mining" or "mining associations."
Below are a few real-world use cases for association rules:

 Medicine. Doctors can use association rules to help diagnose patients. There
are many variables to consider when making a diagnosis, as many diseases
share symptoms. By using association rules and machine learning-fueled data
analysis, doctors can determine the conditional probability of a given illness by
comparing symptom relationships in the data from past cases. As new
diagnoses get made, machine learning models can adapt the rules to reflect the
updated data.

 Retail. Retailers can collect data about purchasing patterns, recording


purchase data as item barcodes are scanned by point-of-sale systems. Machine
learning models can look for co-occurrence in this data to determine which
products are most likely to be purchased together. The retailer can then adjust
marketing and sales strategy to take advantage of this information.

 User experience (UX) design. Developers can collect data on how consumers
use a website they create. They can then use associations in the data to optimize
the website user interface -- by analyzing where users tend to click and what
maximizes the chance that they engage with a call to action, for example.

 Entertainment. Services like Netflix and Spotify can use association rules to
fuel their content recommendation engines. Machine learning models analyze
past user behavior data for frequent patterns, develop association rules and use
those rules to recommend content that a user is likely to engage with, or organize
content in a way that is likely to put the most interesting content for a given user
first.
How do association rules work?
Association rule mining, at a basic level, involves the use of machine
learning models to analyze data for patterns, or co-occurrences, in a database. It
identifies frequent if-then associations, which themselves are the association rules.

An association rule has two parts: an antecedent (if) and a consequent (then). An
antecedent is an item found within the data.

A consequent is an item found in combination with the antecedent.

Association rule algorithms


Popular algorithms that use association rules include AIS, SETM, Apriori and
variations of the latter.

With the AIS algorithm, itemsets are generated and counted as it scans the
data. In transaction data, the AIS algorithm determines which large itemsets
contained a transaction, and new candidate itemsets are created by extending
the large itemsets with other items in the transaction data.

The SETM algorithm also generates candidate itemsets as it scans a


database, but this algorithm accounts for the itemsets at the end of its scan.
New candidate itemsets are generated the same way as with the AIS
algorithm, but the transaction ID of the generating transaction is saved with
the candidate itemset in a sequential data structure.

At the end of the pass, the support count of candidate itemsets is created by
aggregating the sequential structure. The downside of both the AIS and SETM
algorithms is that each one can generate and count many small candidate
itemsets, according to published materials from Dr. Saed Sayad, author
of Real Time Data Mining.
Apriori algorithm, candidate itemsets are generated using only the large
itemsets of the previous pass. The large itemset of the previous pass is joined
with itself to generate all itemsets with a size that's larger by one. Each
generated itemset with a subset that is not large is then deleted. The
remaining itemsets are the candidates.

The Apriori algorithm considers any subset of a frequent itemset to also be a


frequent itemset. With this approach, the algorithm reduces the number of
candidates being considered by only exploring the itemsets whose support
count is greater than the minimum support count, according to Sayad.

Uses of association rules in data mining


In data mining, association rules are useful for analyzing and predicting
customer behavior. They play an important part in customer analytics, market
basket analysis, product clustering, catalog design and store layout.

Programmers use association rules to build programs capable of machine


learning. Machine learning is a type of artificial intelligence (AI) that seeks to
build programs with the ability to become more efficient without being
explicitly programmed.

Examples of association rules in data mining


A classic example of association rule mining refers to a relationship between
diapers and beers. The example, which seems to be fictional, claims that men
who go to a store to buy diapers are also likely to buy beer. Data that would
point to that might look like this:

A supermarket has 200,000 customer transactions. About 4,000 transactions,


or about 2% of the total number of transactions, include the purchase of
diapers. About 5,500 transactions (2.75%) include the purchase of beer. Of
those, about 3,500 transactions, 1.75%, include both the purchase of diapers
and beer. Based on the percentages, that large number should be much lower.
However, the fact that about 87.5% of diaper purchases include the purchase
of beer indicates a link between diapers and beer.
Introduction to Parallel Computing :
 Parallel Computing :
It is the use of multiple processing elements simultaneously for solving any
problem. Problems are broken down into instructions and are solved
concurrently as each resource that has been applied to work is working at the
same time.

 Advantages of Parallel Computing over Serial Computing are as follows:

 It saves time and money as many resources working together will reduce the
time and cut potential costs.

 It can be impractical to solve larger problems on Serial Computing.

 It can take advantage of non-local resources when the local resources are finite.

 Serial Computing ‘wastes’ the potential computing power, thus Parallel


Computing makes better work of the hardware.

Types of Parallelism:

Bit-level parallelism –
It is the form of parallel computing which is based on the increasing processor’s size. It
reduces the number of instructions that the system must execute in order to perform a
task on large-sized data.

Example: Consider a scenario where an 8-bit processor must compute the sum of two
16-bit integers. It must first sum up the 8 lower-order bits, then add the 8 higher-order
bits, thus requiring two instructions to perform the operation. A 16-bit processor can
perform the operation with just one instruction.

Instruction-level parallelism –
A processor can only address less than one instruction for each clock cycle phase.
These instructions can be re-ordered and grouped which are later on executed
concurrently without affecting the result of the program. This is called instruction-level
parallelism.
Task Parallelism –
Task parallelism employs the decomposition of a task into subtasks and then allocating
each of the subtasks for execution. The processors perform the execution of sub-tasks
concurrently.

4. Data-level parallelism (DLP) –


Instructions from a single stream operate concurrently on several data – Limited by non
-regular data manipulation patterns and by memory bandwidth

Why parallel computing?

The whole real-world runs in dynamic nature i.e. many things happen at a certain time
but at different places concurrently. This data is extensively huge to manage.

 Real-world data needs more dynamic simulation and modeling, and for achieving
the same, parallel computing is the key.

 Parallel computing provides concurrency and saves time and money.

 Complex, large datasets, and their management can be organized only and only
using parallel computing’s approach.

 Ensures the effective utilization of the resources. The hardware is guaranteed to


be used effectively whereas in serial computation only some part of the hardware
was used and the rest rendered idle.

 Also, it is impractical to implement real-time systems using serial computing.

Applications of Parallel Computing:

 Databases and Data mining.

 Real-time simulation of systems.

 Science and Engineering.

 Advanced graphics, augmented reality, and virtual reality.

Limitations of Parallel Computing:

 It addresses such as communication and synchronization between multiple sub-


tasks and processes which is difficult to achieve.

 The algorithms must be managed in such a way that they can be handled in a
parallel mechanism.

 The algorithms or programs must have low coupling and high cohesion. But it’s
difficult to create such programs.

 More technically skilled and expert programmers can code a parallelism-based


program well.

Future of Parallel Computing:

 The computational graph has undergone a great transition from serial computing
to parallel computing.

 Tech giant such as Intel has already taken a step towards parallel computing by
employing multicore processors.

 Parallel computation will revolutionize the way computers work in the future, for
the better good. With all the world connecting to each other even more than
before, Parallel Computing does a better role in helping us stay that way.

 faster networks, distributed systems, and multi-processor computers, it


becomes even more necessary.

The general procedure for class comparison is as follows:

Class Comparison Methods in Data Mining

Data Collection: The set of relevant data in the database and data
warehouse is collected by query Processing and partitioned into a target
class and one or a set of contrasting classes.

Dimension relevance analysis: If there are many dimensions and analytical


comparisons are desired, then dimension relevance analysis should be
performed.

Only the highly relevant dimensions are included in the further analysis.
Synchronous Generalization: The process of generalization is performed
upon the target class to the level controlled by the user or expert specified
dimension threshold, which results in a prime target class relation or
cuboid.

The concepts in the contrasting class or classes are generalized to the


same level as those in the prime target class relation or cuboid, forming the
prime contrasting class relation or cuboid.

Presentation of the derived comparison: The resulting class comparison


description can be visualized in the form of tables, charts, and rules. This
presentation usually includes a "contrasting" measure (such as count%)
that reflects the comparison between the target and contrasting classes.
As desired, the user can adjust the comparison description by applying drill-
down, roll-up, and other OLAP operations to the target and contrasting
classes.

For example, the task we want to perform is to compare graduate and


undergraduate students using the discriminant rule. So to do this, the
DMQL query would be as follows.

III.Incremental Association Rule Mining

The mining of association rules on transactional database is usually an


offline process since it is costly to find the association rules in large
databases. With usual market-basket applications, new transactions are
generated and old transactions may be obsolete as time advances.

As a result, incremental updating techniques should be developed for


maintenance of the discovered association rules to avoid redoing mining
on the whole updated database.
A database may allow frequent or occasional updates and such updates
may not only invalidate existing association rules but also activate new
rules. Thus it is nontrivial to maintain such discovered rules in large
databases.

Considering an original database and newly inserted transactions, the


following four cases may arise:

Case 1: An itemset is large in the original database and in the newly


inserted transactions.

Case 2: An itemset is large in the original database, but is not large in the
newly inserted transactions.

Case 3: An itemset is not large in the original database, but is large in the
newly inserted transactions.

Case 4: An itemset is not large in the original database and in the newly
inserted transactions.

Since itemsets in Case 1 are large in both the original database and the
new transactions, they will still be large after the weighted average of the
counts. Similarly, itemsets in Case 4 will still be small after the new
transactions are inserted. Thus Cases 1 and 4 will not affect the final
association rules. Case 2 may remove existing association rules, and Case
3 may add new association rules.

A good rule-maintenance algorithm should thus accomplish the following:

1. Evaluate large itemsets in the original database and determine whether


they are still large in the updated database;
2. Find out whether any small itemsets in the original database may
become large in the updated database;

3. Seek itemsets that appear only in the newly inserted transactions and
determine whether they are large in the updated database.

Types of Association Rules in Data Mining:

Association rule learning is a machine learning technique used for


discovering interesting relationships between variables in large databases.
It is designed to detect strong rules in the database based on some
interesting metrics.

For any given multi-item transaction, association rules aim to obtain rules
that determine how or why certain items are linked.

Association rules are created for finding information about general if-then
patterns using specific criteria with support and trust to define what the
key relationships are.

They help to show the frequency of an item in specific data since


confidence is defined by the number of times an if-then statement is found
to be true.
Types of Association Rules:

There are various types of association rules in data mining:-

 Multi-relational association rules

 Generalized association rules

 Quantitative association rules

 Interval information association rules

1. Multi-relational association rules:

Multi-Relation Association Rules (MRAR) is a new class of association


rules, different from original, simple, and even multi-relational association
rules (usually extracted from multi-relational databases), each rule element
consists of one entity but many a relationship. These relationships
represent indirect relationships between entities.

2. Generalized association rules: Generalized association rule extraction is


a powerful tool for getting a rough idea of interesting patterns hidden in
data. However, since patterns are extracted at each level of abstraction, the
mined rule sets may be too large to be used effectively for decision-making.

Therefore, in order to discover valuable and interesting knowledge, post-


processing steps are often required. Generalized association rules should
have categorical (nominal or discrete) properties on both the left and right
sides of the rule.
3. Quantitative association rules: Quantitative association rules is a special
type of association rule. Unlike general association rules, where both left
and right sides of the rule should be categorical (nominal or discrete)
attributes, at least one attribute (left or right) of quantitative association
rules must contain numeric attributes

Uses of Association Rules

Some of the uses of association rules in different fields are given below:

Medical Diagnosis: Association rules in medical diagnosis can be used to


help doctors cure patients. As all of us know that diagnosis is not an easy
thing, and there are many errors that can lead to unreliable end results.
Using the multi-relational association rule, we can determine the probability
of disease occurrence associated with various factors and symptoms.

Market Basket Analysis: It is one of the most popular examples and uses
of association rule mining. Big retailers typically use this technique to
determine the association between items.

There are three steps for measuring data quality.

1) Extract all association rules.

2) Select compatible association rules.

3) Add confidence factor of compatible rules as criteria of data quality of


transaction.
Measuring Clustering Quality in Data Mining

A cluster is the collection of data objects which are similar to each other within the
same group. The data objects of a cluster are dissimilar to data objects of other
groups or clusters.

Clustering in Data Mining

Clustering Approaches:

1. Partitioning approach: The partitioning approach constructs various partitions


and then evaluates them by some criterion, e.g., minimizing the sum of square
errors. It adopts exclusive cluster separation(each object belongs to exactly one
group) and uses iterative relocation techniques to improve the partitioning by
moving objects from one group to another. It uses a greedy approach and approach
at a local optimum. It finds clusters with spherical shapes in small to medium size
databases.

Partitioning approach methods:


K-means

k-medoids

CLARINS

2. Density-based approach: This approach is based on connectivity and density


functions. It divides the set of objects into multiple exclusive clusters or a hierarchy
of clusters.

Density-based methods:

DBSACN

OPTICS

3. Grid-based approach: This approach quantizes objects into a finite number of


cells that form a grid structure. Fast processing time and independent of a number
of data objects. Grid-based Clustering method is the efficient approach for spatial
data mining problems.

Grid-based approach methods:

STING

WaveCluster

CLIQUE

4. Hierarchical approach: This creates a hierarchical decomposition of the data


objects by using some measures.
Hierarchical approach methods:

Diana

Agnes

BIRCH

CAMELEON

Measures for Quality of Clustering:

If all the data objects in the cluster are highly similar then the cluster has high
quality. We can measure the quality of Clustering by using the
Dissimilarity/Similarity metric in most situations.

But there are some other methods to measure the Qualities of Good Clustering if
the clusters are alike.

1. Dissimilarity/Similarity metric:

The similarity between the clusters can be expressed in terms of a distance


function, which is represented by d(i, j). Distance functions are different for various
data types and data variables.

Distance function measure is different for continuous-valued variables,


categorical variables, and vector variables.
Distance function can be expressed as for different types of data.

 Euclidean distance

 Mahalanobis distance

 Cosine distance

2. Cluster completeness: Cluster completeness is the essential parameter for


good clustering, if any two data objects are having similar characteristics then
they are assigned to the same category of the cluster according to ground
truth. Cluster completeness is high if the objects are of the same category.

Let us consider the clustering C1, which contains the sub-clusters s1 and s2,
where the members of the s1 and s2 cluster belong to the same category
according to ground truth. Let us consider another clustering C2 which is
identical to C1 but now s1 and s2 are merged into one cluster.

Then, we define the clustering quality measure, Q, and according to cluster


completeness C2, will have more cluster quality compared to the C1 that is,
Q(C2, Cg ) > Q(C1, Cg ).

3. Ragbag: In some situations, there can be a few categories in which the


objects of those categories cannot be merged with other objects.

Then the quality of those cluster categories is measured by the Rag Bag
method. According to the rag bag method, we should put the heterogeneous
object into a rag bag category.

Let us consider a clustering C1 and a cluster C ∈ C1 so that all objects in C


belong to the same category of cluster C1 except the object o according to
ground truth.

Consider a clustering C2 which is identical to C1 except that o is assigned to


a cluster D which holds the objects of different categories.

According to the ground truth, this situation is noisy and the quality of
clustering is measured using the rag bag criteria.

we define the clustering quality measure, Q, and according to rag bag method
criteria C2, will have more cluster quality compared to the C1 that is, Q(C2,
Cg )>Q(C1, Cg).

4. Small cluster preservation: If a small category of clustering is further split


into small pieces, then those small pieces of cluster become noise to the
entire clustering and thus it becomes difficult to identify that small category
from the clustering.

The small cluster preservation criterion states that are splitting a small
category into pieces is not advisable and it further decreases the quality of
clusters as the pieces of clusters are distinctive.

Suppose clustering C1 has split into three clusters, C11 = {d1, . . . , dn}, C12 =
{dn+1}, and C13 = {dn+2}.

Let clustering C2 also split into three clusters, namely C1 = {d1, . . . , dn−1}, C2
= {dn}, and C3 = {dn+1,dn+2}. As C1 splits the small category of objects and
C2 splits the big category which is preferred according to the rule mentioned
above the clustering quality measure Q should give a higher score to C2, that
is, Q(C2, Cg ) > Q(C1, Cg ).
UNIT-IV

DATA WAREHOUSING AND MODELING

Data Warehousing:

Introduction :
Data warehousing is the process of constructing and using a data warehouse. A datawarehouse is
constructed by integrating data from multiple heterogeneous sources that support analytical reporting,
structured and/or ad hoc queries, and decision making.Data warehousing involves data cleaning, data
integration, and data consolidations.

What is a Data Warehouse:

A data warehousing is defined as a technique for collecting and managing data from varied sources to
provide meaningful business insights. It is a blend of technologies and components which aids the
strategic use of data.

It is electronic storage of a large amount of information by a business which is designedfor query and
analysis instead of transaction processing. It is a process of transformingdata into information and
making it available to users in a timely manner to make Data Warehousing:

Introduction :
Data warehousing is the process of constructing and using a data warehouse. A datawarehouse is
constructed by integrating data from multiple heterogeneous sources thatsupport analytical reporting,
structured and/or ad hoc queries, and decision making.
Data warehousing involves data cleaning, data integration, and data consolidations.

What is a Data Warehouse:

A data warehousing is defined as a technique for collecting and managing data from varied sources to
provide meaningful business insights. It is a blend of technologies and components which aids the
strategic use of data.
It is electronic storage of a large amount of information by a business which is designed for query and
analysis instead of transaction processing. It is a process of transforming data into information and
making it available to users in a timely manner to make a difference.

A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in


support of management's decision making process.SubjectOriented: A data warehouse can be used to
analyze a particular subject area.

For example, "sales" can be a particular subject. Integrated: A data warehouseintegrates data from
multiple data sources. For example, source A and source B mayhave different ways of identifying a
product, but in a data warehouse, there will be only a single way of identifying a product.ifference.
A data warehouse is a subject-oriented, integrated, time-variant and non-volatilecollection of data in
support of management's decision making process.SubjectOriented: A data warehouse can be used to
analyze a particular subject area.

For example, "sales" can be a particular subject. Integrated: A data warehouse

integrates data from multiple data sources. For example, source A and source B may have different ways
of identifying a product, but in a data warehouse, there will be only a single way of identifying a product.

The basic architecture of a data warehouse :

The data stored in the warehouse is uploaded from the operational systems (such as marketing or sales).
The data may pass through an operational data store and may require data cleansing[2] for additional
operations to ensure data quality before it is used in the data warehouse for reporting.

Extract, transform, load (ETL) and extract, load, transform (ELT) are the two main approaches used to
build a data warehouse system.
Data warehouse characteristics

There are basic features that define the data in the data warehouse that include subject orientation,
data integration, time-variant, nonvolatile data, and data granularity.

Subject-oriented

Unlike the operational systems, the data in the data warehouse revolves around the subjects of the
enterprise. Subject orientation is not database normalization. Subject orientation can be really useful for
decision-making. Gathering the required objects is called subject-oriented.

Integrated

The data found within the data warehouse is integrated. Since it comes from several operational
systems, all inconsistencies must be removed. Consistencies include naming conventions, measurement
of variables, encoding structures, physical attributes of data, and so forth.

Time-variant

While operational systems reflect current values as they support day-to-day operations, data warehouse
data represents a long time horizon (up to 10 years) which means it stores mostly historical data. It is
mainly meant for data mining and forecasting. (E.g. if a user is searching for a buying pattern of a
specific customer, the user needs to look at data on the current and past purchases.

Nonvolatile

The data in the data warehouse is read-only, which means it cannot be updated, created, or deleted
(unless there is a regulatory or statutory obligation to do so).[30]
Benefits

• A data warehouse maintains a copy of information from the source transaction systems. This
architectural complexity provides the opportunity to:
• Integrate data from multiple sources into a single database and data model. More congregation
of data to single database so a single query engine can be used to present data in an ODS.
• Mitigate the problem of database isolation level lock contention in transaction processing
systems caused by attempts to run large, long-running analysis queries in transaction processing
databases.
• Maintain data history, even if the source transaction systems do not.
• Integrate data from multiple source systems, enabling a central view across the enterprise. This
benefit is always valuable, but particularly so when the organization has grown by merger.
• Improve data quality, by providing consistent codes and descriptions, flagging or even fixing bad
data.
• Present the organization's information consistently.
• Provide a single common data model for all data of interest regardless of the data's source.
• Restructure the data so that it makes sense to the business users.
• Restructure the data so that it delivers excellent query performance, even for complex analytic
queries, without impacting the operational systems.
• Add value to operational business applications, notably customer relationship management
(CRM) systems.

Design methods:

Bottom-up design

• In the bottom-up approach, data marts are first created to provide reporting and analytical
capabilities for specific business processes. These data marts can then be integrated to create a
comprehensive data warehouse.
• The data warehouse bus architecture is primarily an implementation of "the bus", a collection
of conformed dimensions and conformed facts, which are dimensions that are shared (in a
specific way) between facts in two or more data marts.
Top-down design

• The top-down approach is designed using a normalized enterprise data model. "Atomic" data,
that is, data at the greatest level of detail, are stored in the data warehouse.
• Dimensional data marts containing data needed for specific business processes or specific
departments are created from the data warehouse.

Hybrid design

• Data warehouses often resemble the hub and spokes architecture. Legacy systems feeding the
warehouse often include customer relationship management and enterprise resource planning,
generating large amounts of data.
• To consolidate these various data models, and facilitate the extract transform load process,
data warehouses often make use of an operational data store, the information from which is
parsed into the actual data warehouse. To reduce data redundancy, larger systems often store
the data in a normalized way. Data marts for specific reports can then be built on top of the data
warehouse.
• A hybrid (also called ensemble) data warehouse database is kept on third normal form to
eliminate data redundancy.
• A normal relational database, however, is not efficient for business intelligence reports where
dimensional modelling is prevalent. Small data marts can shop for data from the consolidated
warehouse and use the filtered, specific data for the fact tables and dimensions required
• The data warehouse provides a single source of information from which the data marts can
read, providing a wide range of business information. The hybrid architecture allows a data
warehouse to be replaced with a master data management repository where operational (not
static) information could reside.

• The data vault modeling components follow hub and spokes architecture. This modeling style is
a hybrid design, consisting of the best practices from both third normal form and star schema.
The data vault model is not a true third normal form, and breaks some of its rules, but it is a top-
down architecture with a bottom up design.
• The data vault model is geared to be strictly a data warehouse. It is not geared to be end-user
accessible, which, when built, still requires the use of a data mart or star schema-based release
area for business purposes.

Data mart, OLAP, OLTP, predictive analytics:

• A data mart is a simple form of a data warehouse that is focused on a single subject (or
functional area), hence they draw data from a limited number of sources such as sales, finance
or marketing.
• Data marts are often built and controlled by a single department within an organization.
• The sources could be internal operational systems, a central data warehouse, or external data.
• Denormalization is the norm for data modeling techniques in this system. Given that data marts
generally cover only a subset of the data contained in a data warehouse, they are often easier
and faster to implement.

Difference between data warehouse and data mart

• Online analytical processing (OLAP) is characterized by a relatively low volume of transactions.


Queries are often very complex and involve aggregations. For OLAP systems, response time is an
effective measure.
• OLAP applications are widely used by Data Mining techniques. OLAP databases store
aggregated, historical data in multi-dimensional schemas (usually star schemas).
• OLAP systems typically have a data latency of a few hours, as opposed to data marts, where
latency is expected to be closer to one day.

• The OLAP approach is used to analyze multidimensional data from multiple sources and
perspectives.
• The three basic operations in OLAP are Roll-up (Consolidation), Drill-down, and Slicing & Dicing.
• Online transaction processing (OLTP) is characterized by a large number of short on-line
transactions (INSERT, UPDATE, DELETE).

• OLTP systems emphasize very fast query processing and maintaining data integrity in multi-
access environments.
• OLTP systems, effectiveness is measured by the number of transactions per second. OLTP
databases contain detailed and current data. The schema used to store transactional databases
is the entity model (usually 3NF).

Normalization is the norm for data modeling techniques in this system.

• Predictive analytics is about finding and quantifying hidden patterns in the data using complex
mathematical models that can be used to predict future outcomes.
• Predictive analysis is different from OLAP in that OLAP focuses on historical data analysis and is
reactive in nature, while predictive analysis focuses on the future. These systems are also used
for customer relationship management (CRM).

What is data modeling?

• Data modeling is the process of creating a visual representation of either a whole information
system or parts of it to communicate connections between data points and structures.
• The goal is to illustrate the types of data used and stored within the system, the relationships
among these data types, the ways the data can be grouped and organized and its formats and
attributes.
• Data models are built around business needs. Rules and requirements are defined upfront
through feedback from business stakeholders so they can be incorporated into the design of a
new system or adapted in the iteration of an existing one.
• Data can be modeled at various levels of abstraction. The process begins by collecting
information about business requirements from stakeholders and end users
• These business rules are then translated into data structures to formulate a concrete database
design. A data model can be compared to a roadmap, an architect’s blueprint or any formal
diagram that facilitates a deeper understanding of what is being designed.

Types of data models

• design process, database and information system design begins at a high level of abstraction
and becomes increasingly more concrete and specific.
• Data models can generally be divided into three categories, which vary according to their
degree of abstraction.
• The process will start with a conceptual model, progress to a logical model and conclude with a
physical model. Each type of data model is discussed in more detail in subsequent sections:

Conceptual data models

• They are also referred to as domain models and offer a big-picture view of what the system will
contain, how it will be organized, and which business rules are involved.
• Conceptual models are usually created as part of the process of gathering initial project
requirements. Typically, they include entity classes (defining the types of things that are
important for the business to represent in the data model), their characteristics and constraints,
the relationships between them and relevant security and data integrity requirements. Any
notation is typically simple.
Logical data models

• They are less abstract and provide greater detail about the concepts and relationships in the
domain under consideration.
• One of several formal data modeling notation systems is followed. These indicate data
attributes, such as data types and their corresponding lengths, and show the relationships
among entities.
• Logical data models don’t specify any technical system requirements. This stage is frequently
omitted in agile or DevOps practices.
• Logical data models can be useful in highly procedural implementation environments, or for
projects that are data-oriented by nature, such as data warehouse design or reporting system
development.

Physical data models

• They provide a schema for how the data will be physically stored within a database. As such,
they’re the least abstract of all.
• They offer a finalized design that can be implemented as a relational database, including
associative tables that illustrate the relationships among entities as well as the primary keys and
foreign keys that will be used to maintain those relationships.
• Physical data models can include database management system (DBMS)-specific properties,
including performance tuning.
Data modeling process

• As a discipline, data modeling invites stakeholders to evaluate data processing and storage in
painstaking detail.
• Data modeling techniques have different conventions that dictate which symbols are used to
represent the data, how models are laid out, and how business requirements are conveyed.
• All approaches provide formalized workflows that include a sequence of tasks to be performed
in an iterative manner.

Those workflows generally look like this:

Identify the entities.

The process of data modeling begins with the identification of the things, events or concepts that are
represented in the data set that is to be modeled. Each entity should be cohesive and logically discrete
from all others.

Identify key properties of each entity. Each entity type can be differentiated from all others because it
has one or more unique properties, called attributes. For instance, an entity called “customer” might
possess such attributes as a first name, last name, telephone number and salutation, while an entity
called “address” might include a street name and number, a city, state, country and zip code.

Identify relationships among entities


The earliest draft of a data model will specify the nature of the relationships each entity has with the
others. In the above example, each customer “lives at” an address. If that model were expanded to
include an entity called “orders,” each order would be shipped to and billed to an address as well. These
relationships are usually documented via unified modeling language (UML).

Map attributes to entities completely.

• This will ensure the model reflects how the business will use the data. Several formal data
modeling patterns are in widespread use.
• Object-oriented developers often apply analysis patterns or design patterns, while stakeholders
from other business domains may turn to other patterns.
• Assign keys as needed, and decide on a degree of normalization that balances the need to
reduce redundancy with performance requirements.
• Normalization is a technique for organizing data models (and the databases they represent) in
which numerical identifiers, called keys, are assigned to groups of data to represent
relationships between them without repeating the data.
• For instance, if customers are each assigned a key, that key can be linked to both their address
and their order history without having to repeat this information in the table of customer
names.
• Normalization tends to reduce the amount of storage space a database will require, but it can at
cost to query performance.
• Finalize and validate the data model. Data modeling is an iterative process that should be
repeated and refined as business needs change.

Types of data modeling

Data modeling has evolved alongside database management systems, with model types increasing in
complexity as businesses' data storage needs have grown.

Here are several model types:

Hierarchical data models represent one-to-many relationships in a treelike format. In this type of model,
each record has a single root or parent which maps to one or more child tables.

This model was implemented in the IBM Information Management System (IMS), which was introduced
in 1966 and rapidly found widespread use, especially in banking. Though this approach is less efficient
than more recently developed database models, it’s still used in Extensible Markup Language (XML)
systems and geographic information systems (GISs).

Relational data models were initially proposed by IBM researcher E.F. Codd in 1970.

• They are still implemented today in the many different relational databases commonly used in
enterprise computing.
• Relational data modeling doesn’t require a detailed understanding of the physical properties of
the data storage being used. In it, data segments are explicitly joined through the use of tables,
reducing database complexity.
• Relational databases frequently employ structured query language (SQL) for data management.
• These databases work well for maintaining data integrity and minimizing redundancy. They’re
often used in point-of-sale systems, as well as for other types of transaction processing.
• Entity-relationship (ER) data models use formal diagrams to represent the relationships between
entities in a database.
• Several ER modeling tools are used by data architects to create visual maps that convey
database design objectives.

Object-oriented data models

• object-oriented programming and it became popular in the mid-1990s. The “objects” involved
are abstractions of real-world entities. Objects are grouped in class hierarchies, and have
associated features.
• Object-oriented databases can incorporate tables, but can also support more complex data
relationships. This approach is employed in multimedia and hypertext databases as well as other
use cases.
• Dimensional data models were developed by Ralph Kimball, and they were designed to optimize
data retrieval speeds for analytic purposes in a data warehouse.
• While relational and ER models emphasize efficient storage, dimensional models increase
redundancy in order to make it easier to locate information for reporting and retrieval. This
modeling is typically used across OLAP systems.
• Two popular dimensional data models are the star schema, in which data is organized into facts
(measurable items) and dimensions (reference information), where each fact is surrounded by
its associated dimensions in a star-like pattern.

• The other is the snowflake schema, which resembles the star schema but includes additional
layers of associated dimensions, making the branching pattern more complex.

Key Difference Between Star Schema and Snowflake Schema

• The star schema is the simplest type of Data Warehouse schema. It is known as star schema as
its structure resembles a star.
• Comparing Snowflake vs Star schema, a Snowflake Schema is an extension of a Star Schema, and
it adds additional dimensions. It is called snowflake because its diagram resembles a Snowflake.
• In a star schema, only single join defines the relationship between the fact table and any
dimension tables.
• Star schema contains a fact table surrounded by dimension tables.
• Snowflake schema is surrounded by dimension table which are in turn surrounded by dimension
table
• A snowflake schema requires many joins to fetch the data.
• Comparing Star vs Snowflake schema, Start schema has simple DB design, while Snowflake
schema has very complex DB design.

What is a Star Schema?

• Star Schema in data warehouse, in which the center of the star can have one fact table and a
number of associated dimension tables. It is known as star schema as its structure resembles a
star.
• The Star Schema data model is the simplest type of Data Warehouse schema. It is also known as
Star Join Schema and is optimized for querying large data sets.

In the following Star Schema example, the fact table is at the

center which contains keys


to every dimension table like Dealer_ID, Model ID, Date_ID, Product_ID, Branch_ID & other attributes
like Units sold and revenue.
What is a Snowflake Schema?

• Snowflake Schema in data warehouse is a logical arrangement of tables in a multidimensional


database such that the ER diagram resembles a snowflake shape.
• A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions. The
dimension tables are normalized which splits data into additional tables.

In the following Snowflake Schema example, Country is further normalized into an individual

Difference between Star Schema and Snowflake Schema

Following is a key difference between Snowflake schema vs Star schema:

Star Schema Snowflake Schema Snowflake Schema

Hierarchies for the dimensions are stored in the Hierarchies are divided into separate tables.
dimensional table.

It contains a fact table surrounded by dimension One fact table surrounded by dimension table
tables. which are in turn surrounded by dimension
table
In a star schema, only single join creates the A snowflake schema requires many joins to
relationship between the fact table and any fetch the data
dimension tables.

Simple DB Design. Very Complex DB Design.

Denormalized Data structure and query also run Normalized Data Structure.
faster.

High level of Data redundancy Very low-level data redundancy

Single Dimension table contains aggregated Data Split into different Dimension Tables.
data.

Cube processing is faster. Cube processing might be slow because of the


complex join.

Offers higher performing queries using Star Join The Snowflake schema is represented by
Query Optimization. centralized fact table which unlikely connected
with multiple dimensions
Tables may be connected with multiple
dimensions

stored stored

Data Warehousing - OLAP :

• Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It
allows managers, and analysts to get an insight of the information through fast, consistent, and
interactive access to information.
• This chapter cover the types of OLAP, operations on OLAP, difference between OLAP, and
statistical databases and OLTP.

Types of OLAP Servers

• We have four types of OLAP servers −


• Relational OLAP (ROLAP)
• Multidimensional OLAP (MOLAP)
• Hybrid OLAP (HOLAP)
• Specialized SQL Servers
• Relational OLAP

ROLAP servers are placed between relational back-end server and client front-end tools. To store and
manage warehouse data, ROLAP uses relational or extended-relational DBMS.

ROLAP includes the following −

• Implementation of aggregation navigation logic.


• Optimization for each DBMS back end.
• Additional tools and services.
• Multidimensional OLAP

MOLAP uses array-based multidimensional storage engines for multidimensional views of data. With
multidimensional data stores, the storage utilization may be low if the data set is sparse.

Therefore, many MOLAP server use two levels of data storage representation to handle dense and
sparse data sets.

Hybrid OLAP

Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of ROLAP and faster
computation of MOLAP. HOLAP servers allows to store the large data volumes of detailed information.
The aggregations are stored separately in MOLAP store.

Specialized SQL Servers

Specialized SQL servers provide advanced query language and query processing support for SQL queries
over star and snowflake schemas in a read-only environment.
OLAP Operations

Since OLAP servers are based on multidimensional view of data, we will discuss OLAP operations in
multidimensional data.

Here is the list of OLAP operations −

• Roll-up
• Drill-down
• Slice and dice
• Pivot (rotate)

Roll-up :

• Roll-up performs aggregation on a data cube in any of the following ways −


• By climbing up a concept hierarchy for a dimension
• By dimension reduction

The following diagram illustrates how roll-up works.

Roll-up :

• Roll-up is performed by climbing up a concept hierarchy for the dimension location.


• Initially the concept hierarchy was "street < city < province < country".On rolling up, the data is
aggregated by ascending the location hierarchy from the level of city to the level of country.
• The data is grouped into cities rather than countries.When roll-up is performed, one or more
dimensions from the data cube are removed.
Drill-down :

• Drill-down is the reverse operation of roll-up. It is performed by either of the following ways −
• By stepping down a concept hierarchy for a dimension
• By introducing a new dimension.

The following diagram illustrates how drill-down works −


Drill-down is performed by stepping down a concept hierarchy for the dimension time.

Initially the concept hierarchy was "day < month < quarter < year."

• On drilling down, the time dimension is descended from the level of quarter to the level of
month.
• When drill-down is performed, one or more dimensions from the data cube are added.It
navigates the data from less detailed data to highly detailed data.

Slice

The slice operation selects one particular dimension from a given cube and provides a new sub-cube.
Consider the following diagram that shows how slice works.
Here Slice is performed for the dimension "time" using the criterion time = "Q1".It will form a new sub-
cube by selecting one or more dimensions.

Dice

• Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider
the following diagram that shows the dice operation.
• The dice operation on the cube based on the following selection criteria involves three
dimensions.
(location = "Toronto" or "Vancouver")

(time = "Q1" or "Q2")

(item =" Mobile" or "Modem")

Pivot

The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an
alternative presentation of data. Consider the following diagram that shows the pivot operation.
OLAP vs OLTP

Data Warehouse (OLAP) Operational Database (OLTP)

Involves historical processing of information. Involves day-to-day processing

OLAP systems are used by knowledge workers OLTP systems are used by clerks, DBAs, or
such as executives, managers and analysts. database professionals

Useful in analyzing the business. Useful in running the business.

It focuses on Information out It focuses on Data in.


Based on Star Schema, Snowflake, Schema and Based on Entity Relationship Model.
Fact Constellation Schema.

Contains historical data Contains current data.

Database size is from 100 GB to 1 TB Database size is from 100 MB to 1 GB.

Number or users is in hundreds. Number of users is in thousands.

Highly flexible Provides high performance.


UNIT -V

APPLICATIONS OF DATA WAREHOUSE

Building a Data Warehouse – A Step-By-Step Approach

Establishing a data warehousing system infrastructure that enables you to meet all of your
business intelligence targets is by no means an easy task. With Astera Data Warehouse Builder,
you can cut down the numerous standard and repetitive tasks involved in the data warehousing
lifecycle to just a few simple steps.

In this article, we will examine a use case that describes the process of building a data
warehouse with a step-by-step approach using Astera Data Warehouse Builder.

Use Case

Shop-Stop is a fictitious online retail store that currently maintains its sales data in an SQL
database. The company has recently decided to implement a data warehouse across its
enterprise to improve business intelligence and gain a more solid reporting architecture.
However, their IT team and technical experts have warned them about the substantial amount
of capital and resources needed to execute and maintain the entire process.

As an alternative to the traditional data warehousing approach, Shop-Stop has decided to use
Astera Data Warehouse Builder to design, develop, deploy, and maintain their data warehouse.
Let’s take a look at the process we’d follow to build a data warehouse for them.

Step 1: Create a Source Data Model

The first step in building a data warehouse with Astera Data Warehouse Builder is to identify
and model the source data. But before we can do that, we need to create a data warehousing
project that will contain all of the work items needed as part of the process. To learn how you
can create a data warehousing project and add new items to it, click here.
Once we’ve added a new data model to the project, we’ll reverse engineer Shop-Stop’s sales
database using the Reverse Engineer icon on the data model toolbar.

To learn more about reverse engineering from an existing database, click here.

Here’s what Shop-Stop’s source data model looks like once we’ve reverse engineered it:

Next, we’ll verify the data model to perform a check for errors and warnings. You can verify a
model through the Verify for Read and Write Deployment option in the main toolbar.

For more information on verifying a data model, click here.


After the model has been verified successfully, all that’s left to do is deploy it to the server and
make it available for use in ETL/ELT pipelines or for data analytics. In Astera Data Warehouse
Builder, you can do that through the Deploy Data Model option in the data model toolbar.

For more information on deploying a data model, click here.We’ve successfully created, verified,
and deployed a source data model for Shop-Stop.

Step 2: Build and Deploy a Dimensional Model

The next step in the process is to design a dimensional model that will serve as a destination
schema for Stop-Stop’s data warehouse. You can use the Entity object available in the data
model toolbox, and the data modeler’s drag-and-drop interface to design a model from scratch
However, in Shop-Stop’s case, they’ve already designed a data warehouse schema in an SQL
database. First, we’ll reverse engineer that database. Here’s what the data model looks like:

Note: Each entity in this model represents a table in Shop-Stop’s final data warehouse.

Next, we’ll convert this model into a dimensional model by assigning facts and dimensions. The
type for each entity, when a database is reverse engineered, is set as General by default. You
can conveniently change the type to Fact or Dimension by right-clicking on the entity, hovering
over Entity Type in the context menu, and selecting an appropriate type from the given options.
In this model, the Sale entity in the center is the fact entity and the rest of them are dimension
entities. Here is a look at the model once we’ve defined all of the entity types and converted it
into a dimensional model:
 To learn more about converting a data model into a dimensional model, click here.

 Once the dimensions and facts are in place, we’ll configure each entity for enhanced
data storage and retrieval by assigning specified roles to the fields present in the layout
of each entity.

 For dimension entities, the Dimension Role column in the Layout Builder provides a
comprehensive list of options. These include the following:

 Surrogate Key and Business Key.

 Slowly Changing Dimension types (SCD1, SCD2, SCD3, and SCD6).

 Record identifiers (Effective and Expiration dates, Current Record Designator, and
Version Number) to keep track of historical data.

 Placeholder Dimension to keep track of late and early arriving facts and dimensions.

 As an example, here is the layout of the Employee entity in the dimensional model after
we’ve assigned dimension roles to its fields.
To learn more about fact entities, click here.

Now that the dimensional model is ready, we’ll verify and deploy it for further usage.

Step 3: Populate the Data Warehouse

In this step, we’ll populate Shop-Stop’s data warehouse by designing ETL pipelines to load
relevant source data into each table. In Astera Data Warehouse Builder, you can create ETL
pipelines in the dataflow designer.

Once you’ve added a new dataflow to the data warehousing project, you can use the extensive
set of objects available in the dataflow toolbox to design an ETL process. The Fact Loader and
Dimension Loader objects can be used to load data into fact and dimension tables, respectively.
Here is the dataflow that we’ve designed to load data into the Customer table in the data
warehouse:

On the left side, we’ve used a Database Table Source object to fetch data from a table present
in the source model. On the right side, we’ve used the Dimension Loader object to load data into
a table present in the destination dimensional model.

You’ll recall that both of the models mentioned above were deployed to the server and made
available for usage. While configuring the objects in this dataflow, we connected each of them
to the relevant model via the Astera Data Model connection in the list of data providers.
The Database Table Source object was configured with the source data model’s deployment.

On the other hand, the Dimension Loader object was configured with the destination
dimensional model’s deployment.

Note: ShopStop_Source and ShopStop_Destination represent each deployed data model.To


learn more about how you can use deployed data models in ETL pipelines, click here.We’ve
designed separate dataflows to populate each table present in Shop-Stop’s data warehouse.
The dataflow that we designed to load data into the fact table is a bit different than the rest of
the dataflows because the fact table contains fields from multiple source tables. The Database
Table Source object that we saw in the Customer_Dimension dataflow can only extract data
from one table at a time. An alternative to this is the Data Model Query Source object, which
allows you to extract multiple tables from the source model by selecting a root entity.

To learn more about the Data Model Query Source object, click here.

Now that all of the dataflows are ready, we’ll execute each of them to populate Shop-Stop’s data
warehouse with their sales data. You can execute or start a dataflow through the Start Dataflow
icon in the main toolbar.

To avoid executing all of the dataflows individually, we’ve designed a workflow to orchestrate
the entire process.
Finally, we’ll automate the process of refreshing this data through the built-in Job Scheduler. To
access the job scheduler, go to Server > Job Schedules in the main menu.

In the Scheduler tab, you can create a new schedule to automate the execution process at a
given frequency.
Step 4: Visualize and Analyze

Shop-Stop’s data warehouse can now be integrated with industry-leading visualization and
analytics tools such as Power BI, Tableau, Domo, etc. through a built-in OData service. The
company can use these tools to effectively analyze their sales data and gain valuable business
insights from it.

Data Warehouse Architecture

A data-warehouse is a heterogeneous collection of different data sources organised under a


unified schema. There are 2 approaches for constructing data-warehouse: Top-down approach
and Bottom-up approach are explained as below.
1. Top-down approach:

The essential components are discussed below:

External Sources –

External source is a source from where data is collected irrespective of the type of data. Data
can be structured, semi structured and unstructured as well.

Stage Area –

Since the data, extracted from the external sources does not follow a particular format, so there
is a need to validate this data to load into datawarehouse. For this purpose, it is recommended
to use ETL tool.

E(Extracted): Data is extracted from External data source.

T(Transform): Data is transformed into the standard format.

L(Load): Data is loaded into datawarehouse after transforming it into the standard format.

Data-warehouse –

After cleansing of data, it is stored in the datawarehouse as central repository. It actually stores
the meta data and the actual data gets stored in the data marts. Note that datawarehouse
stores the data in its purest form in this top-down approach.
Data Marts –

Data mart is also a part of storage component. It stores the information of a particular function
of an organisation which is handled by single authority. There can be as many number of data
marts in an organisation depending upon the functions. We can also say that data mart
contains subset of the data stored in datawarehouse.

Data Mining –

 The practice of analysing the big data present in datawarehouse is data mining. It is
used to find the hidden patterns that are present in the database or in datawarehouse
with the help of algorithm of data mining.

 This approach is defined by Inmon as – datawarehouse as a central repository for the


complete organisation and data marts are created from it after the complete
datawarehouse has been created

 This concludes our discussion on building a data warehouse with a step-by-step


approach using Astera Data Warehouse Builder.

 departments, which may lead to limited user involvement in the design and
implementation process. This can result in data marts that do not meet the specific
needs of business users.

Data latency:

The top-down approach may result in data latency, particularly when data is sourced from
multiple systems. This can impact the accuracy and timeliness of reporting and analysis.

Data ownership:

The top-down approach can create challenges around data ownership and control. Since data
is centralized in the data warehouse, it may not be clear who is responsible for maintaining and
updating the data.

Cost:

The top-down approach can be expensive to implement and maintain, particularly for smaller
organizations that may not have the resources to invest in a large-scale data warehouse and
associated data marts.

The top-down approach may face challenges in integrating data from different sources,
particularly when data is stored in different formats or structures. This can lead to data
inconsistencies and inaccuracies.
Bottom-up approach:

First, the data is extracted from external sources (same as happens in top-down approach).

Then, the data go through the staging area (as explained above) and loaded into data marts
instead of datawarehouse. The data marts are created first and provide reporting capability. It
addresses a single business area.

These data marts are then integrated into datawarehouse.

This approach is given by Kinball as – data marts are created first and provides a thin view for
analyses and datawarehouse is created after complete data marts have been created.

Advantages of Bottom-Up Approach –

As the data marts are created first, so the reports are quickly generated.

We can accommodate more number of data marts here and in this way datawarehouse can be
extended.

Also, the cost and time taken in designing this model is low comparatively.
Incremental development: The bottom-up approach supports incremental development,
allowing for the creation of data marts one at a time. This allows for quick wins and incremental
improvements in data reporting and analysis.

User involvement: The bottom-up approach encourages user involvement in the design and
implementation process. Business users can provide feedback on the data marts and reports,
helping to ensure that the data marts meet their specific needs.

Flexibility: The bottom-up approach is more flexible than the top-down approach, as it allows for
the creation of data marts based on specific business needs. This approach can be particularly
useful for organizations that require a high degree of flexibility in their reporting and analysis.

Faster time to value: The bottom-up approach can deliver faster time to value, as the data marts
can be created more quickly than a centralized data warehouse. This can be particularly useful
for smaller organizations with limited resources.

Reduced risk: The bottom-up approach reduces the risk of failure, as data marts can be tested
and refined before being incorporated into a larger data warehouse. This approach can also
help to identify and address potential data quality issues early in the process.

Scalability: The bottom-up approach can be scaled up over time, as new data marts can be
added as needed. This approach can be particularly useful for organizations that are growing
rapidly or undergoing significant change.

Data ownership: The bottom-up approach can help to clarify data ownership and control, as
each data mart is typically owned and managed by a specific business unit. This can help to
ensure that data is accurate and up-to-date, and that it is being used in a consistent and
appropriate way across the organization.

What is Meta Data?

Metadata is simply defined as data about data. The data that is used to represent other data is
known as metadata. For example, the index of a book serves as a metadata for the contents in
the book. In other words, we can say that metadata is the summarized data that leads us to
detailed data. In terms of data warehouse, we can define metadata as follows.

Categories of Metadata

Metadata can be broadly categorized into three categories −

Business Metadata − It has the data ownership information, business definition, and changing
policies.

Technical Metadata − It includes database system names, table and column names and sizes,
data types and allowed values. Technical metadata also includes structural information such as
primary and foreign key attributes and indices.
Operational Metadata − It includes currency of data and data lineage. Currency of data means
whether the data is active, archived, or purged. Lineage of data means the history of data
migrated and transformation applied on it

Role of Metadata

 Metadata has a very important role in a data warehouse. The role of metadata in a
warehouse is different from the warehouse data, yet it plays an important role. The
various roles of metadata are explained below.

 Metadata acts as a directory.

 This directory helps the decision support system to locate the contents of the data
warehouse.

 Metadata helps in decision support system for mapping of data when data is
transformed from operational environment to data warehouse environment.

 Metadata helps in summarization between current detailed data and highly summarized
data.

 Metadata also helps in summarization between lightly detailed data and highly
summarized data.

 Metadata is used for query tools.


 Metadata is used in extraction and cleansing tools.

 Metadata is used in reporting tools.

 Metadata is used in transformation tools.

Metadata plays an important role in loading functions.

The following diagram shows the roles of metadata.

Metadata Repository

Metadata repository is an integral part of a data warehouse system. It has the following
metadata −

Definition of data warehouse − It includes the description of structure of data warehouse. The
description is defined by schema, view, hierarchies, derived data definitions, and data mart
locations and contents.

Business metadata − It contains has the data ownership information, business definition, and
changing policies.

Operational Metadata − It includes currency of data and data lineage. Currency of data means
whether the data is active, archived, or purged. Lineage of data means the history of data
migrated and transformation applied on it.
Data for mapping from operational environment to data warehouse − It includes the source
databases and their contents, data extraction, data partition cleaning, transformation rules, data
refresh and purging rules.

Algorithms for summarization − It includes dimension algorithms, data on granularity,


aggregation, summarizing, etc.

Challenges for Metadata Management

The importance of metadata can not be overstated. Metadata helps in driving the accuracy of
reports, validates data transformation, and ensures the accuracy of calculations. Metadata also
enforces the definition of business terms to business end-users. With all these uses of
metadata, it also has its challenges. Some of the challenges are discussed below.

Metadata in a big organization is scattered across the organization. This metadata is spread in
spreadsheets, databases, and applications.

Metadata could be present in text files or multimedia files. To use this data for information
management solutions, it has to be correctly defined.

There are no industry-wide accepted standards. Data management solution vendors have
narrow focus.

There are no easy and accepted methods of passing metadata.

Designing Data Marts :

Data marts should be designed as a smaller version of starflake schema within the data
warehouse and should match with the database design of the data warehouse. It helps in
maintaining control over database instances.
The summaries are data marted in the same way as they would have been designed within the
data warehouse. Summary tables help to utilize all dimension data in the starflake schema.

Cost of Data Marting :

The cost measures for data marting are as follows −


 Hardware and Software Cost

 Network Access

 Time Window Constraints

 Hardware and Software Cost

Although data marts are created on the same hardware, they require some additional hardware
and software. To handle user queries, it requires additional processing power and disk storage.
If detailed data and the data mart exist within the data warehouse, then we would face
additional cost to store and manage replicated data.

Note − Data marting is more expensive than aggregations, therefore it should be used as an
additional strategy and not as an alternative strategy.

Network Access

A data mart could be on a different location from the data warehouse, so we should ensure that
the LAN or WAN has the capacity to handle the data volumes being transferred within the data
mart load process.
Time Window Constraints

The extent to which a data mart loading process will eat into the available time window depends
on the complexity of the transformations and the data volumes being shipped. The
determination of how many data marts are possible depends on −

 Network capacity.

 Time window available

 Volume of data being transferred

 Mechanisms being used to insert data into a data mart.

1. INTRODUCTION :

The basic requirements of good governance are derived from

the fact that the laws and methods are well defined,

transparent and easily understandable by people. To provide such good governance in a


developing country like India is a challenge itself as most of the people are not educated or are
not economically strong. The challenge becomes larger in developing countries as the
democratic methods are used in forming the governments. In number of cases the rules and
procedures defined in the constitution themselves become obstacle in the path of governance
due to absence of transparency and procedural clarities. The solution to the above problems
lies in developing a mechanism that is interactive, fast and provides a clear repository of
guidelines that can be used in effective decision making for both the government and the
people.

e-Governance is the mode that has large number of advantages in implementing easy,
transparent, fair and interactive solutions within minimum time frame.

2. E-GOVERNANCE :

e-Governance involves collection of technology based processes that involves greater


interaction between government and citizens and hereby have highly improved delivery of public
services.

e-Governance is based on the effective utilization of information and communication


technologies (ICT) with major objectives of making public representatives more transparent,
accountable and effective by providing improved information and service delivery and enhanced
participation of people in day to day activities [18].

 ‘E’ in e-Government stands for much more than electronic and digital world.

 ‘E’ indicates:
 Efficient – do it the right way with the goal to achieve maximum output with minimum
effort and/or cost.

 Effective – do the right thing

 Empowerment – active role in governance process

 Enterprise – initiative and innovation

 Enhanced – enhanced user interface by providing access to government based services

 Environment friendly – it is achieved through paperless governance.

The advancements in ICT over the years along with Internet provides effective medium to
establish the communication of people with the government hereby playing major role in
achieving good governance goals. The information technology is playing major role in assisting
the government to provide effective governance in terms of time, cost and accessibility.

3. DATA WAREHOUSING AND DATA MINING :

Data Warehouse has been defined by Inmon as "A data warehouse is a subject-oriented,
integrated, time-variant and non-volatile collection of data in support of management's decision
making process" [11]. Data from large number of homogeneous and/or heterogeneous sources
are being accumulated to form data warehouse. It provides convenient and effective platform
with help of online analytical processing (OLAP) to run queries over consolidated data

which is extracted from multiples data sources. A centralized repository is maintained to


improve user access where large amount of data is archived for analysis purpose.

Data Mining is analysis tool used to extract knowledge from vast amount of data for effective
decision making. Mathematical and statistical concepts are used to uncover patterns, trends
and relationships among the huge repository of data stored in a data warehouse [3].

4. NEED OF DATA WAREHOUSING AND MINING IN E-GOVERNANCE :

There are some technical issues in the implementation of

e-Governance which need to be taken into consideration. Some technical issues are [24]:

 Technical infrastructure for e-governance

Collection, handling and managing huge volume of data

Analysis of the data for effective and correct decision making

 Online support to all departments of Government organizations


 Extraction of unknown relevant and interesting patterns

(i.e. knowledge) from the huge volume of data collected

Presentation of meaningful patterns for timely decision making process Large amount of data
is being accumulated by the governments over the years. To use such data for effective
decision-making, a data warehouse need to be constructed over this enormous historical data.
Number of queries that require complex analysis of data can be effectively handled by decision-
makers. It also helps government in making decisions that have huge impact on citizens. The
decision makers are also provided with strategic intelligence to have better view of overall
situation. This significantly assists the government in taking accurate decisions within
minimum time frame without depending on their IT staff.

Data mining approach extracts new and hidden interesting patterns (i.e. knowledge) from this
large volume of data sets.

The e-governance administrators can use this discovered knowledge to improve the quality of
service. The decision involving activity in e-governance is mainly focused on the available funds,
experiences from past and ground report.

The government institutions are now analyzing large amount of current and historical data to
identify new and useful patterns from the large dataset. The area of focus includes:

1) Data warehousing,

2) On-line Analytical Processing (OLAP), and

3) Data Mining

E-GOVERNANCE USING DATA WAREHOUSING AND DATA

MINING

Data Mining is the tool to discover previously unknown useful patterns from large
heterogeneous databases. As historical data need to be accumulated from distinct sources to
have better analysis and with prices of storage devices becoming drastically cheaper, the
concept of data warehousing came into existence. If there is no centralized repository of
accurate data, application of data mining tools is almost impossible.

There is wide disparity in allocation of resources in various government departments. The


resources may be allocated additionally in one department while there may be acute shortage in
other department. The reason behind this is nonavailability of any facility to transfer information
from one department to other. It is also possible that if various
government departments are computerized, the information available in one department might
not be beneficial to other departments as it may be possible that the information available is in
dissimilar formats in heterogeneous database systems on diverse platforms. There are two
approaches in designing Data Warehouse – Top down and bottom up approach. Information
that starts from top is divided to generate information for lower levels (Top down approach),
while information that begins from grass root level combined to generate information for higher
levels (Bottom up approach). This technique provides an ideal domain of

‘E-Governance’ framework using Data Warehouse and Data Mining applications :

Mining E-Governance Data Warehouse :

Data warehouse is used for collecting, storing and analyzing the data to assist the decision
making process. Data mining can be applied to any kind of information repository like data
warehouses, different types of database systems, World Wide Web, flat files etc.

Therefore, data warehousing and data mining are best suited for number of applications based
on e-Governance in G2B (Government to Business), G2C (Government to Citizen) and G2G
(Government to Government) environment. In order to have effective implementation there
should be solid Data Warehouse on data collected from heterogeneous reliable sources . The
subcategories of e-government are The various steps involved for implementing

Need for Data Warehousing and Data :

Mining (DWDM) in e-Governance The use of DWDM technologies will assist decision makers to
reach important conclusion that can play important role in any ‘e-Governance’ initiative .The
need of DWDM in e-governance includes:

Provision of integrated data from diverse platforms for better implementation of strategies at
state or national level.

To minimize piracy of data as storage requirement is reduced.

To increase operational effectiveness as various employees work in coordinated manner.

To increase transparency at highest level as relevant information will be available on


web.

To have better understanding of requirements of citizens.

To have faster access of data for effective decision making.

Integration of Data Warehousing and Data Mining with e-Governance

The advantages of integrating DWDM with ‘e-Governance’ are:


There is no requirement to deal with heterogeneous databases.

Officers will be able to derive the output at multiple levels of granularity.

There is no requirement to use complex tools to derive information from vast amount of data.

In depth analysis of data is possible to have solution to complex queries.

There is mass reduction of dependence on IT staff.Strong tool towards corruption free India.

ORIGINS OF E-GOVERNANCE IN INDIA :

The origin of e-governance in India in mid seventies was confined mainly in the area of defense
and to handle queries that involves that large amount of data related to census, elections and
tax administration. The set up of National Informatics Centre (NIC) in 1976 by Government of
India was major boost towards e-governance. The major push towards e-governance was
initiated in 1987 with the launch of NICNET-the national satellite based computer network.

The launch of District information system by NIC to computerize all district offices in India was
another major step towards egovernance. During early nineties, there was significant increase in
the use of IT in applications where government policies start reaching to non urban areas by
having good inputs from number of NGOs and private sector.

The area of e-governance has become very wide now. The government is now implementing e-
governance in every field. e-governance has now spread its wings from urban to rural areas.
There is hardly any field left in which e-governance has not entered. The e-governance is playing
major role in routine transactions like payment of bills and taxes, public grievance system,
municipal services like maintaining records of land and property, issue of birth/death/marriage
certificate, registration and attorneys of properties, traffic management, health services,
disaster management, education sector, crime and criminal tracking system, public distribution
systems and most importantly providing up to date information in agriculture sector. Number of
states have set up their own portals but most of these portals are incapable of providing
complete solution to people by just click of the mouse. In most of cases ministries and
individual departments have separate websites to provide the necessary information. This
should not be the case as the user has to visit multiple websites to get relevant information.
Ideally the official website should act as single window to provide necessary information and
services .

CHALLENGES FOR IMPLEMENTATION OF EGOVERNANCE IN INDIA :

E-governance in India is at infant stage. However, there are limited successful and completed e-
governance projects like e-Seva, CARD, etc. Lack of insight can be attributes as major factor for
failure of e-governance projects in India. Reservation and inflation can be topic of national
debates but e-governance was never the issue in Indian politics. Lately the government of India
has risen to occasion and started pushing the projects related to egovernance.
a) Illiteracy and limited awareness regarding positives of e-governance

b) Casual attitude of government officers towards public.

The officers cannot be punished in absence of proper guidelines.

c) Lack of electricity and internet facilities especially in rural areas to reap benefits of e-
governance.

d) Huge delay in implementing e-governance projects due to technical reasons or proper


support. Lack of understanding and interests among senior personnel also affect projects.

e) Diversity of country is biggest challenge. To have egovernance projects in local language is


huge task to implement.

f) Absence of qualified pool of resources to manage the system is challenging task. Refusal of
IT professionals to work in rural areas also affects the projects.

g) The role of public in policy making is negligible. If the opinion of people at the grassroots
level is taken into account than majority of problems can be solved.

National Data warehouse :

A large number of national data warehouses can be identified from the existing data resources
within the central government ministries. These are potential subject areas in which data
warehouses may be developed at present and in the future.

Census Data :

 Census data can help planners and others understand their communities' social,
economic, and demographic conditions.

 Census data is the primary data used by planners to understand the social, economic,
and demographic conditions locally and nationally. We sometimes think the census is
unique to the United States, but census data is collected in other countries.

 The total process of collecting, compiling, and publishing demographic, economic, and
social data about a specific time to all people in a country or fixed or defined area part of
a country. Most countries also include a housing census.

 A census is a procedure of systematically acquiring and recording information about the


members of a given population. It is a regularly occurring and official count of a
particular population. The term is used mostly in connection with national population
and housing censuses; other common censuses include agriculture, business, and
traffic censuses.
 Possible uses of Census Data (Examples)

 Economic and social planning

 Education planning

 Health care planning

 Infrastructure (roads, railway) development

 Improving housing conditions

 Providing amenities (water, electricity, communication, etc.)

 Assessment/Evaluation of programs

 Emergency planning and disaster response

 Business decisions by the private sector and individuals

 Administrative purposes (elections, creation of new units, etc.)

 Projection of future populations and their social-economic needs (schools, elderly,


immunizations, etc..)

You might also like