0% found this document useful (0 votes)
17 views47 pages

New DM

The document outlines the fundamentals of data mining, including its applications, techniques, and software, divided into three units: introduction to data mining, association rule mining, and cluster analysis. It emphasizes the importance of data understanding and preparation, detailing processes like data collection, cleaning, and ETL (Extraction, Transformation, Loading). Additionally, it discusses issues related to data quality, such as outliers and missing data, and the need for effective data preprocessing to ensure accurate mining results.

Uploaded by

fortune4world
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views47 pages

New DM

The document outlines the fundamentals of data mining, including its applications, techniques, and software, divided into three units: introduction to data mining, association rule mining, and cluster analysis. It emphasizes the importance of data understanding and preparation, detailing processes like data collection, cleaning, and ETL (Extraction, Transformation, Loading). Additionally, it discusses issues related to data quality, such as outliers and missing data, and the need for effective data preprocessing to ensure accurate mining results.

Uploaded by

fortune4world
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

DATA MINING AND R PROGRAMMING 23BIT55S

UNIT I: Introduc on: Data Mining Applica ons – Data Mining Techniques – The Future of
Data Mining - Data Mining So ware. Data understanding and data prepara on: introduc on
- data collec on and preprocessing - Outliers - Mining Outliers - Missing data - Types of Data
- Compu ng Distance - Data summarising using basic sta s cal measurements - Displaying
data graphically - Mul dimensional Data Visualisa on.

UNIT II: Associa on Rule Mining: Introduc on – Basics – The Task and a Naïve Algorithm –
The Apriori Algorithm – Improving the Efficiency of the Apriori Algorithm – Mining Frequent
pa erns without Candidate Genera on (FP-Growth) – Performance Evalua on of
Algorithms.
Classifica on: Introduc on – Decision Tree – Over fi ng and Pruning – Decision Tree Rules
– Naïve Bayes Method – Es ma ng Predic ve Accuracy of Classifica on Methods –
Improving Accuracy of Classifica on Methods – Other Evalua on Criteria for Classifica on
Methods – Classifica on So ware.
UNIT III: Cluster Analysis: Introduc on – features – Types of Data – Compu ng Distance -
Types of cluster Analysis Methods – Par oned Methods – Hierarchical Methods – Density
Based Methods – Quality and validity of Cluster Analysis Methods – Cluster Analysis
So ware.

TEXT BOOK
1. G.K Gupta, “Introduc on to Data Mining with Case Studies”, Pren ce Hall of
India(Pvt) Ltd, India, 2008.
Data Understanding and Data Prepara on
Introduc on:

Improving the quality of data in databases for use in datamining is a challenging


task. The presence of incorrect and inconsistent data can significantly impact the
result of any data mining analysis and therefore the poten al benefits of using
data mining may not be achieved. If the size of data is o en large, it is not
possible to iden fy errors and correct them manually. We must find automa c
and semiautoma c procedures to iden fy errors and correct them. The CRISP
data mining process consists of the following six phases:
1. Business understanding
2. Data understanding
3. Data prepara on
4. Modelling
5. Evalua on
6. Deployment

1
The basic requirements include:
1. The data must be available
2. The data must be relevant, adequate and clean
3. There mut be a well-defined problem.
Data Collection is the systematic process of gathering, measuring, and recording
data for research, analysis, or decision-making. It involves collecting data from
various sources, such as surveys, interviews, observations, experiments,
documents, or existing databases, to obtain relevant and reliable information.

Data prepara on is the process of cleaning and transforming raw data prior to
processing and analysis. It is an important step prior to processing and o en
involves reforma ng data, making correc ons to data, and combining datasets
to enrich data.
Data Cleaning is a process used to determine, inaccurate, incomplete or
unreasonable data items of a dataset and then improving the data quality
through correc ons of the detected errors and omissions.
Data required for data mining needs to be extracted from a number of databases
integrated and perhaps cleansed and transformed. This process is called
ETL(Extrac on, Transforma on and loading). Errors have to be detected and
removed. Missing data values may have to be dealt with.

When datasets are large, it is inevitable that there will be errors in the data
because data collec on techniques are never perfect. Data collec on involves
humans and machines and both can be prone to errors. A variety of errors
include:
a. Noise (any data that is received as such cannot be used by the program)
b. Missing data
c. Duplicate data
d. Input results
e. Outliers (some data values that are too large/too small or missing values)
The level of errors in data is usually between 1% and 10%. The datamining
technique should not be used without pre-processing data.
There is need to preprocess data even when all the data comes from one source,
the need for pre-processing increases significantly when the data comes from
mul ple sources. Integra ng data from mul ple sources can also lead to

2
redundant data because the same data may be represented in different ways in
different data sources. The aim of data cleaning is not only iden fying errors and
correct errors, we have to remove the data errors.
Data preprocessing involves the following steps:
a. Data understanding:
It is difficult to start data preprocessing without a good understanding of the
data and the types of errors that may be present in the data. Careful study of
metadata for each data source may be required.
b. Mapping rules ( process of matching fields from one database to another,
ex: Coimbatore saved as CBE in one database COIM in another database)
When the data has been obtained by integra ng a number of data sources, it is
necessary to device mapping rules to transform data that uses different
conven ons. Data pre-processing so ware help in suppor ng this step.
c. Verifica on
The single step of verifica on may not be sufficient in some cases because some
errors are not immediately visible.
d. Transforma on
The meta data (ex: index of the book serves as meta data for contents in the
book, data about data) that has been collected should be saved so that it can be
used next me when it is required. Informa on about transforma on also needs
to be documented and saved.
e. Backflow
Once the errors in the source system data are iden fied in the above process,
the errors may be reported back to staff members who manage the source
systems. These staff should replace the dirty data with clean data.
Extrac on, Transforma on and Loading (ETL)
ETL is a process that extracts the data from different source systems, then
transforms the data (like applying calculations, concatenations, etc.) and finally
loads the data into the Data Warehouse system.
It’s tempting to think a creating a Data warehouse is simply extracting data from
multiple sources and loading into database of a Data warehouse. This is far from
the truth and requires a complex ETL process. The ETL process requires active
inputs from various stakeholders including developers, analysts, testers, top
executives and is technically challenging.

3
Successful implementation of an ETL system involves resolving many issues
including the following:

What are the source systems?


These systems include relational database systems (e.g., Oracle, SQL server,
MySQL) legacy systems, flat files as well as other systems.
To what extent are the source systems and the target systems interoperable?
The more different the sources and the target, the more complex the ETL
process is likely to be.
What ETL technology will be used?
In-house custom built solution
Acquire a generic ETL package that will either meet the needs of the enterprise
or can be modified to meet these needs.
How will the quality and integrity of the data be monitored?
Data cleaning is required to deal with issues like missing values, date formats,
code values, primary key and referential integrity.
How will a log of the dataset be maintained?
Once the dataset is being used, a dispute may arise about the origin of some
data. It is therefore necessary to be able to log which information came from
and when.
How will recovery take place?
Although database systems normally have a well-defined recovery procedure,
the dataset may also need to be recovered if it fails.
How will the transformation be carried out?
Useful to have a temporary database separate from the source systems so that
the data can be copied across to it and then transformed and copied to another
area. If anything is going to go wrong, it is likely to happen to the staging
database without creating any major problems.
How will data be copied from non-relational legacy systems that may still be
operational?
It is necessary to work out a sensible methodology of copying data from such
systems and transforming it to eventually load it across to the staging area.

ETL Func ons


ETL process consists of data extrac on from source systems, data transforma on
which includes data cleaning, and loading data in a separate staging area. The
tables 2.1, 2.2 and 2.3 from three source systems and the three records belong
to the same supplier.

4
There are many errors and missing values in the records extracted from the three
different source systems. A er ETL process the record will look like table 2.4. the
table is s ll have errors but cannot be corrected without addi onal informa on.

Sources of Errors in the Data


Building an integrated database from a number of source systems may involve
dealing with some or all of the following data errors.
a. Instance iden ty errors
In the collected three sources the same individual may be represented slightly
differently in different source systems. For ex: the business name is represented
as krishna so ware Inc/krishna Inc. it is some mes misspelled as gopal gopta or
gopal gupka. The name can be represented as GK Gupta, Dr GK gupta or Mr GK
Gupta. There is high possibility of mismatching between different systems the
needs to be iden fied and corrected. But achieving 100% consistency requires
huge amount of resources.
b. Data Errors
5
Many different types of data errors other than iden ty errors are possible. For
example:
 Data may have some missing a ribute values
 Coding of some values in one database may not match with coding in
other databases. (different codes with the same meaning or same code
for different meanings)
 Meaning of some code values may not be known
 There may be duplicate records
 There may be wrong aggrega ons
 There may be inconsistent use of nulls, spaces and empty values.
 Some a ribute values may be inconsistent (outside their domain)
 Some data may be wrong because of input errors.
 There may be inappropriate use of address lines.
 There may be non-unique iden fiers.
All these types of errors can be cleared using ETL process.
c. Record linkage problem
Record linkage relates to the problem of linking informa on from different
databases that relates to the same customer or client. The problem can arise if
a unique iden fier is not available in all databases that are being linked. Record
linkage can involve a large number of record comparisons to ensure linkages that
have a high level of accuracy.
d. Seman c integra on problem
Data integra on is a problem of combining the data residing at different sources
and building a unified view of the data. The unified view is called global schema.
Some sources of data may be rela onal, some may be text documents and some
may be character strings or may be integers.
e. Data Integrity problem
The database is expressed in the rela onal model. During data integra on
integrity constrains should also be available. Data integrity deals with issues like
referen al integrity, null values, domain of values etc., overcoming all these
issues is difficult. Many techniques for correc ng errors have been developed.
Telephone numbers in most countries have a given format and can be checked
against that format. Similarly, postcodes have a fixed number of digits and/or
le ers in every country and values may be checked for legality. Many different
formats for dates are used and it is rela vely easier to check for invalid dates.

6
f. Data Entry Errors
When data is being entered by humans, errors are not uncommon. Data errors
arise because of unmo vated data entry staff. Because data entry posi ons are
o en poorly paid and the work of such staff is rarely recognized, fa gue can also
be an issue. Data entry errors may also be reduced if high quality data entry
so ware is used.
g. Measurement errors
Data is o en captured by instruments since there is an increasing tendency to
use sensor technology, ex. Capturing car speeds or temperatures . Errors can
creep in because of the instrument malfunc oning, poor calibra on or poor
tes ng of the so ware used in the instrument.
h. Filtering errors
data collection include filtering, smoothening and summarization, perhaps to
reduce the volume of data. Each of these steps has the potential to produce
errors. Data often cannot be cleaned without human involvement. We can use
data visualization software. Exploratory data mining help in understanding of
the data and the data sources.

Selec ng an ETL Tool


ETL tool is required to provide coordinated access to mul ple data sources so
that relevant data may be extracted from them. An ETL tool normally include
tools for data cleaning, reorganiza on, transforma on, aggrega on, calcula on,
and automa c loading of data into the target database. By point-and-click
approach ETL tool should provide an easy user interface.
OUTLIERS
An outlier is a data point that is remarkably dis nct from other data points in a
dataset. It can be caused by errors, such as measurement errors or data entry
errors, or it can be a legi mate observa on that is simply rare or unusual. In data
mining, outliers can significantly impact the results of an analysis or model.
Sca er plots o en have a pa ern. We call a data point an outlier if it doesn't fit
the pa ern.

7
Consider the sca er plot above, which shows data for students on a backpacking
trip. (Each point represents a student.)

No ce how two of the points don't fit the pa ern very well. These points have
been labeled Brad and Sharon, which are the names of the students they
represent. Sharon could be considered an outlier because she is carrying a much
heavier backpack than the pa ern predicts. Brad could be considered an
outlier because he is carrying a much lighter backpack than the pa ern predicts.
1. Mul variate outlier is a combina on of extreme values(variables) which
is inappropriate for data. Let's say in a class of students the average height
and weight of a student are 1.7 m and 50 kg respec vely. If a student in a
class have height and weight of 10 kg and 4 m it is considered as a
mul variate outlier. Mul variate outliers are ND (mul -dimension
outliers). From the sca er plot below we can clearly see two variables(two
dimensions) are involved (i.e. (Weight, Height) for outlier. In respect to
that red circled data point((10,4)) is an outlier.

2. Univariate outlier are extreme values in a distribu on of specific variable.


Considering the temperature of the city in summer, temperature of
8
around 100 degree Celsius on a par cular day can be a Univariate outlier.
Univariate outliers are 1D outliers. From the sca er plot below we can
clearly see only one variable is involved (i.e. temperature) for

outlier. In respect to that red circled data point(100 degree Celsius) is an outlier.

3. Distance-based outliers

An object o in a data set S is a distance-based (DB) outlier with parameters p


and d, i.e., DB (p, d), if minimum a fraction p of the objects in S lie at a distance
higher than d from o. In other words, instead of depending on statistical tests,
it can think of distance-based outliers as those objects who do not have enough
neighbors.
The neighbors are represented based on distance from the given object. In
comparison with statistical-based methods, distance-based outlier detection
generalizes or merges the ideas behind discordancy testing for standard
distributions. Hence, a distance-based outlier is also known as a unified outlier
or UO-outlier.
MISSING DATA
In a large dataset it is inevitable that some data values are missing. There can
be a number of reasons for missing values including:
 The particular data has no value associated with it.
 The field was not applicable, the event did not happen, or the data was
not available.
 It could be that the person who entered the data did not know the right
value, or did not care if a field was filled in.
 The value is to be provided by a later step of the process.

9
For example, an employee may not have a date of birth because he/she comes
from an area where birth records are not maintained properly. May be he/she
was a refugee(agathigal) has no papers when he/she came.
Missing values can be represented as nulls, as the value N/A or some other code,
or an artificial value such as 9999. But nulls are considered as missing values.
The ways to handle the missing data:
 Ignore the tuple: this is done when the class label is missing. This method
is not effective.
 Fill in the missing value manually: time consuming and may not be
feasible given a large dataset with many missing values.
 Use the attribute mean to fill in the missing value: ex: suppose the
average income of customer is Rs.5,00,000 lakhs. Use this value to replace
the missing value for income.
 Use the most probable value to fill in the missing value: this is
determined with Bayesian formulae, decision tree induction. For example
using the other customer attributes in the dataset, it may be possible to
replace the missing value.

TYPES OF DATA
Quantitative data
Quantitative data is information that can be counted or measured—or, in other
words, quantified—and given a numerical value. Quantitative data is data that
can be counted or measured in numerical values. The two main types of
quantitative data are discrete data and continuous data. Height in feet, age in
years, and weight in pounds are examples of quantitative data. Qualitative data
is descriptive data that is not expressed numerically. Ex: Revenue in
dollars,Weight in kilograms or pounds, Age in months or years, Distance in miles
or kilometers, Time in days or weeks, Experiment results, Website conversion
rates, Website page load speed.

Binary data
A binary variable is a categorical variable that can only take one of two values,
usually represented as a Boolean — True or False — or an integer variable —

10
0 or 1 — where 0 typically indicates that the attribute is absent,
and 1 indicates that it is present.
Some examples of binary variables, i.e. attributes, are:

 Smoking is a binary variable with only two possible values: yes or no


 A medical test has two possible outcomes: posi ve or nega ve
 Gender is tradi onally described as male or female
 Health status can be defined as diseased or healthy
 Company types may have two values: private or public
 E-mails can be assigned into two categories: spam or not
 Credit card transac ons can be fraud or not

Qualitative data or categorical data


A categorical variable is made up of a categorical characteristic such as a
person's gender, hometown, etc. Examples of the categorical data includes
Travel method to school, Favourite sport, School Postcode, Birthdate, and many
more. The birthdate and postcode in the example above both contain a number
system. Categorical data is a form of qualitative data that can be grouped into
categories instead of measured numerically, like pet preference (dogs or cats).
Categorical data does not have intrinsic value, as in, the categories can only be
compared, but not measurably ranked; one is not greater than the other.

Qualitative ranked data


Rank data are the results of assigning ranks to specified order by the integers
1, 2, 3, …, n. Ranks can be assigned according to the level of performance in
a test, a contest, a competition, an interview, or a show. As a result of their
performance in an interview, the candidates may be assigned ranks ranging
from 1 to n. In this instance, the ranks correspond to continuous values
involving performance as a quality characteristic.

COMPUTING DISTANCE
A number of data mining techniques require distance between two sets
of data be compared. There are number of methods for computing
distance in multidimensional environment.
Properties of distance:

11
 Distance is always positive
 Distance from point x to itself is always zero
 Distance from point x to point y cannot be greater than the sum if
the distance from x to some intermediate point and distance from
z to y.
 Distance from x to y is always the same as from y to x.

a)Euclidean Distance

In Mathematics, the Euclidean distance is defined as the distance between two


points. In other words, the Euclidean distance between two points in the
Euclidean space is defined as the length of the line segment between two points.
Let us assume two points, such as (x1, y1) and (x2, y2) in the two-dimensional
coordinate plane.

Thus, the Euclidean distance formula is given by:

d =√[(x2 – x1)2 + (y2 – y1)2]

Where,

“d” is the Euclidean distance


(x1, y1) is the coordinate of the first point
(x2, y2) is the coordinate of the second point

b)Manhattan distance is a distance measure that is calculated by taking the sum


of distances between the x and y coordinates.

The Manhattan distance is also known as Manhattan length. In other words, it


is the distance between two points measured along axes at right angles.
c)Chebyshev Distance
On a chessboard, where one is using a discrete Chebyshev distance, rather than
a continuous one, the circle of radius r is a square of side lengths 2r, measuring
from the centers of squares, and thus each side contains 2r+1 squares; for
example, the circle of radius 1 on a chess board is a 3×3 square.

d) Distance and Similarity


Similarity functions are used to measure the ‘distance’ between two vectors or
numbers or pairs. Its a measure of how similar the two objects being measured

12
are. The two objects are deemed to be similar if the distance between them is
small, and vice-versa.

DATA SUMMARIZATION USING BASIC STATISTICAL MEASUREMENTS


Statistical measures are a descriptive analysis technique used to summarise the
characteristics of a data set. This data set can represent the whole population
or a sample of it. Statistical measures can be classified as measures of central
tendency and measures of spread.
Mean
calculate the mean, we add all the numbers in the data and divide the sum by
the number of entries in the list.
Mean = ( 5 + 9 + 4 + 6 ) / 4 \
=6

Median is the value that lies in the center of the data set. To calculate the
median, we first sort the data in ascending order and then choose the middle
value. If the number of entries is odd, then the median is simply the center value.
If this number is even, we take the average of the two central values to get the
median.
Mode represents the most frequent value of a data set. If no values are repeated
in the data, then there is no mode.
5, 5, 7, 8, 9, 1, 2, 5, 8

Then the mode would be 5, since it is repeated three times.

Standard deviation and Variance

A variance is the average of the squared differences from the mean. To figure
out the variance, calculate the difference between each point within the data
set and the mean. Once you figure that out, square and average the results.

For example, if a group of numbers ranges from one to 10, you get a mean of
5.5. If you square the differences between each number and the mean and find
their sum, the result is 82.5. To figure out the variance:

σ2 = Σ (xi – x̅)2/n

 Divide the sum, 82.5, by N-1, which is the sample size (in this case 10)
minus 1.
 The result is a variance of 82.5/9 = 9.17.

13
Covariance and correlation

Covariance and correlation both primarily assess the relationship between


variables. Covariance measures the total variation of two random variables
from their expected values. Using covariance, we can only gauge the direction
of the relationship. correlation measures the strength of the relationship
between variables. Correlation is the scaled measure of covariance. It is
dimensionless.

A posi ve correla on is a rela onship between two variables in which both


variables move in the same direc on. Therefore, when one variable increases as
the other variable increases or one variable decreases while the other decreases.
An example of a posi ve correla on would be height and weight. Taller people
tend to be heavier.

A nega ve correla on is a rela onship between two variables in which an


increase in one variable is associated with a decrease in the other. An example
of a nega ve correla on would be the height above sea level and temperature.
As you climb the mountain (increase in height), it gets colder (decrease in
temperature).

Quantiles
The word “quantile” comes from the word quantity. In simple terms, a quantile
is where a sample is divided into equal-sized, adjacent, subgroups (that’s why
it’s sometimes called a “fractile“). It can also refer to dividing a probability
distribution into areas of equal probability.
The median is a quantile; the median is placed in a probability distribution so
that exactly half of the data is lower than the median and half of the data is
above the median. The median cuts a distribution into two equal areas and so it
is sometimes called 2-quantile.

14
Range of Values

The Range is the difference between the lowest and highest values.

Example: In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9.

DISPLAYING DATA GRAPHICALLY

Box Plot: It is a type of chart that depicts a group of numerical data through
their quartiles. It is a simple way to visualize the shape of our data. It makes
comparing characteristics of data between categories very easy.

A box plot gives a five-number summary of a set of data which is-


 Minimum – It is the minimum value in the dataset excluding the
outliers
 First Quartile (Q1) – 25% of the data lies below the First (lower)
Quartile.
 Median (Q2) – It is the mid-point of the dataset. Half of the values
lie below it and half above.
 Third Quartile (Q3) – 75% of the data lies below the Third (Upper)
Quartile.
 Maximum – It is the maximum value in the dataset excluding the
outliers.

15
The area inside the box (50% of the data) is known as the Inter Quartile
Range. The IQR is calculated as –
IQR = Q3-Q1
Outliers are the data points below and above the lower and upper limit. The
lower and upper limit is calculated as –
Lower Limit = Q1 - 1.5*IQR
Upper Limit = Q3 + 1.5*IQR
The values below and above these limits are considered outliers and the
minimum and maximum values are calculated from the points which lie under
the lower and upper limit.
How to create a box plot
Let us take a sample data to understand how to create a box plot.
Here are the runs scored by a cricket team in a league of 12 matches –
100,120,110,150,110,140,130,170,120,220,140,110.
The area inside the box (50% of the data) is known as the Inter Quartile
Range. The IQR is calculated as –
IQR = Q3-Q1
Outliers are the data points below and above the lower and upper limit. The
lower and upper limit is calculated as –
Lower Limit = Q1 - 1.5*IQR
Upper Limit = Q3 + 1.5*IQR
The values below and above these limits are considered outliers and the
minimum and maximum values are calculated from the points which lie under
the lower and upper limit.
b.Histogram
A histogram is a graph that shows the frequency of numerical data using
rectangles. The height of a rectangle (the vertical axis) represents the
distribution frequency of a variable (the amount, or how often that variable
appears). The width of the rectangle (horizontal axis) represents the value of
the variable (for instance, minutes, years, or ages).
A histogram is a graphical representa on of a grouped frequency distribu on
with con nuous classes.

16
c.Sca er Plots
A sca erplot is a graphical way to display the rela onship between two
quan ta ve sample variables. It consists of an X axis, a Y axis and a series of
dots where each dot represents one observa on from a data set. The posi on
of the dot refers to its X and Y values.

Multidimensional Data Visualisation


The multi-Dimensional Data Model is a method which is used for ordering data
in the database along with good arrangement and assembling of the contents
in the database.
Let us take the example of the data of a factory which sells products per
quarter in Bangalore. The data is represented in the table given below :

17
Let us consider the data according to item, time and location (like Kolkata,
Delhi, Mumbai). Here is the table :

This data can be represented in the form of three dimensions conceptually,


which is shown in the image below :

UNIT II- ASSOCIATION RULES MINING

Association rule mining is a technique used in machine learning to


discover interesting patterns in large datasets. These patterns are
expressed in the form of association rules, which represent relationships
between different items or attributes in the dataset. This is called
Association Rules Mining (ARM) or Market Basket Analysis. The
applications of ARM include marketing, customer segmentation,
medicine, electronic commerce, classification, clustering, web mining,
bioinformatics and finance.

BASICS

18
A shop sells only a small variety of products:

 Bread
 Cheese
 Juice
 Milk
 Tea
 Biscuits
 Newspaper
 Sugar
 Coffee

The shopkeeper has an electronic register that keeps records of what


each customer purchases. Each row in the table gives the set of items
one customer bought. The shopkeeper wants to find which products are
sold together frequently.

Terminology

 The number of items the shop stocks is n. i.e n=9. These items are
represented by the vector I (i1, i2….in).
 There are N transactions, N=10. We represent as vector(or tuple)
T. (t1, t2….t N).
 Each transaction of m items be (i1, i2….im) with m<n. transactions
differ in the number of items.
 Association rules are written as X Y meaning that whenever
X appears Y also tends to appear. X and Y may be single item or
set of items. X is referred as antecedent and Y as the consequent.
 The support of an itemset is the number of transactions in which the
itemset appears, divided by the total number of transactions.

19
In general, the support of an itemset can be calculated using the
following formula:
Support(X) = (Number of transactions containing X) / (Total number of
transactions)

For example, suppose we have a dataset of 1000 transactions, and


the itemset {milk, bread} appears in 100 of those transactions. The
support of the itemset {milk, bread} would be calculated as follows:

Support({milk, bread}) = Number of transactions containing


{milk, bread} / Total number of
transactions
= 100 / 1000
= 10%

 Confidence is a measure of the likelihood that an itemset will


appear if another itemset appears. For example, suppose we have
a dataset of 1000 transactions, and the itemset {milk, bread}
appears in 100 of those transactions. The itemset {milk} appears
in 200 of those transactions. The confidence of the rule “If a
customer buys milk, they will also buy bread” would be calculated
as follows:
Confidence("If a customer buys milk, they will also buy
bread")
= Number of transactions containing
{milk, bread} / Number of transactions containing {milk}
= 100 / 200
= 50%

So the confidence of the rule “If a customer buys milk, they will
also buy bread” is 50%. This means that in 50% of the
transactions where milk was purchased, bread was also
purchased.
In general, the confidence of a rule can be calculated using the
following formula:
Confidence(X => Y) = (Number of transactions containing X and Y)
/ (Number of transactions containing X)

20
 Association rules are typically represented as a set of items, such
as {A, B, C}, where A, B, and C are different items in the dataset.
The goal of association rule mining is to identify rules that have a
high level of support and confidence. Support is defined as the
number of times a particular itemset appears in the dataset, while
confidence measures the probability of observing item B given that
item A is already observed.
 Lift, on the other hand, measures the degree to which two items are
dependent on each other, compared to what would be expected if
they were independent. It is calculated as the ratio of the observed
support for both items to the expected support if they were
independent. A lift value of 1 indicates that the two items are
independent, a value greater than 1 indicates a positive association,
while a value less than 1 indicates a negative association.
 For example, if we have a dataset containing information about
purchases made by customers, we can use association rule mining
to identify which items are commonly purchased together. Suppose
that we find an association rule {beer, chips} -> {salsa} with a lift
value of 2. This means that customers who buy beer and chips are
twice as likely to also buy salsa, compared to what would be
expected if beer, chips, and salsa were purchased independently.

NAÏVE ALGORITHM

Consider naïve brute force algorithm to do the task. The shopkeeper only
has 4 items for sale, the list of transactions showing the purchases of
one customer.

The four items and all combinations of the four items and their
frequencies of occurrence in the database is given below:

21
From the problem we have to find minimum support of 50% and
minimum confidence of 75%. Since the minimum support is 50%, we
must find the item sets that occur in at least two transactions. In the
above given table the frequency for 4 items(bread, cheese, juice, milk)
are frequent. The frequency goes down as we look at 2-itemsets( 6 are
frequent), 3-itemsets(none are frequent), 4-itemsets(none). The
frequent itemsets are given below table:

We now determine if the two 2-itemsets(bread, cheese) and (cheese, juice)


lead to association rules with required confidence of 75%. Every 2-
itemset(A,B) can lead to two rules A B and B A if both satisfy the
required confidence. Four possible rules and their confidence as follows:

The last rule has confidence above the minimum 75% required and
qualifies. Rules that have more than user-specified minimum specified
confidence are called confident.

22
IMPROVED NAÏVE ALGORITHM

Here we look at each transaction and count only the combinations that
actually occur ( not consider itemset counts with zero frequency).

This works better since the list of item combination is significantly reduced
from 15 to 11 and this reduction is likely to be much larger for bigger
problems.

THE APRIORI ALGORITHM

An improved algorithm for finding association rules is called Apriori algorithm.


This algorithm consist of two parts.

1. First part, those itemsets that exceed the minimum support


requirement are found.

2. Association rules that meet minimum confidence requirement are


found from the frequent itemsets.

Steps in Apriori Algorithm.

1. Generate a list of frequent 1-itemsets: Scan the entire dataset


and find all frequent items that have support above p%.

23
2. Generate candidate item sets: the algorithm generates a list of
candidate item sets of length k+1 from the frequent k-item sets
identified in the previous step.

3. Count the support of each candidate item set: Scan the dataset
again to count the number of items each candidate item set appears
in the dataset.

4. Prune the candidate item sets: Remove the item sets that do not
meet the minimum support.

5. Repeat the steps 2 to 4 until no more frequent sets can be


generated.

6. Generate Association Rules: Once the frequent item sets have


been identified, the algorithm generates association rules from
them. Association rules are rules of form A->B, where A and B
are in item sets. The rule indicates that if a transaction contains A,
it is also likely to contain B.

7. Evaluate the Association Rules: Finally association rules are


evaluated based on metrics such as confidence and lift.

A simple Apriori Example:

The example consider only five transactions and six items. We find
association rules with 50% support and 75% confidence. The
transactions are given in the below table.

Now we will find frequent items L1.the frequent items are shown
below.

24
The candidate 2-itemsets or C2 therefore has six pairs. These pairs and
their frequencies are given below:

We have only two frequent item pairs that are {bread, juice} and
{cheese, juice}. We do not obtain a candidate 3-itemset since we do not
have two 2-itemsets that have the same first item. The two frequent 2-
itemsets above lead to the following possible rules:

The confidence of the four rules is ¾=75%, ¾=75%, 3/3=100% and


¾=75%. Since all of them have a minimum 75% confidence, they all
qualify.

MINING FREQUENT PATTERNS WITHOUT CANDIDATE GENERATION(FP-


GROWTH)

This algorithm uses an approach that is different than that used by


methods based on Apriori algorithm. The major difference between
frequent pattern-growth(FP-growth) and other algorithms is that FP-
growth does not generate the candidates, it only tests. The Apriori
algorithm generate the candidate itemset and then tests.

25
Generating a FP-tree

the algorithm works as follows:

1. Scan the transaction database once, as in the Apriori algorithm, to


find all frequent items and their support.

2. Sort the frequent items in descending order of their support.

3. Initially, start creating the FP-tree with a root “null”

4. Get the first transaction from the transaction database. Remove


all non-frequent items and list the remaining items according to
the order in the sorted frequent items.

5. Use the transaction to construct the first branch of the tree with
each node corresponding to a frequent item and showing that
items frequency, which is 1 for the first transaction.

6. Get the next transaction from the transaction database. Remove


all non-frequent items and list the remaining items according to
the order in the sorted frequent items.

7. Insert the transaction in the tree using any common prefix that
may appear. Increase the item counts.

8. Continue with step 6 until all transactions in the database are


processed.

Example FP-tree

The minimum support required is again 50% and confidence is 75%.

The frequent items sorted by their frequency is shown below

26
Now remove the items that are not frequent from the transactions and
order the items according to their frequency .

Now start building the FP-tree. We use symbols B, J, C, M for the items.
The item header table on the left-hand side of the tree is to help in tree
traversal. Each item in this list points to its first occurrence in the FP-tree.
Nodes with the same item name in the tree are linked via dotted node-
links. Once the FP-tree is built the transaction database is not required.

An FP-tree consists of nodes. A node contains 3 fields: an item name, a


count and a node link. The count tells the number of occurrences. The
node link is a link to the next node in the FP-tree containing the same
item name or a null pointer if this node is the last one with this name. The
FP-tree consists of a header table with the entry of each item set and link
to the first item in the tree with the same name.

27
The tree is built by making a root node labelled null. A node is made for
each frequent item in the first transaction and the count is set to 1. The
nodes near the root in the tree are more frequent than those down the
tree. The algorithm goes through all sorted transactions. If the path
already exists in the tree, a counter is increased. Otherwise a new node
is inserted into tree with a count of 1. The height of the FP-tree is always
equal to the maximum number of itemsets in a transaction excluding the
root node.

Mining the FP-tree for frequent items

To find the frequent itemsets we should note that for any frequent item
a, all the frequent itemsets containing a can be obtained by following a’s
node-links, starting from a’s head in the FP-tree.

In the FP-tree we start with item M and find the following patters:

BM(1), BJM(1),JCM(1)

No frequent itemset is discovered from these since no itemset appears


three times. Next we look at C and find the following:

BJC(2), JC(1)

Looking at j, the next frequent item in the table

BJ(3), J(1)

There is no need to follow links from item B and there are no other
frequent itemsets. The process is represented by the following
conditional trees for M,C and J in the below diagrams.

The algorithm finds the conditional pattern base for each item. The
conditional pattern base of an item consists of all patterns leading to a
specific item. The conditional frequent pattern tree is constructed from
the conditional pattern base where only the frequent items are included.

28
PERFORMANCE EVALUATION OF ALGORITHMS:

Evaluation of priori, CHARM, and FP-growth methods using real world


data as well as artificial data. It is concluded that:

 FP-growth method is better than the Apriori algorithm.

 CHARM was better than priori. CHARM as better than FP-growth


method.

 Apriori algorithm better than other algorithms if the support


required is high.

 At the low support. The number of frequent items became large


and none of the algorithm was able to handle large frequent sets
gracefully.

CLASSIFICATION

Classification is a task in data mining that involves assigning a class


label to each instance in a dataset based on its features. The goal of
classification is to build a model that accurately predicts the class labels
of new instances based on their features.

DECISION TREE

 The decision tree is a tree which is used to make some decisions.

 The decision tree is a predictive model which is used for both


classification and regression.

 It is a supervised learning model. Supervised means Labelled


data set is used to train our algorithm .

29
 Decision tree algorithms are

 CART(used for Regression)

 ID3(used for classification)

 The training data set contains two features height and weight.
And class labels are male and female. each feature is classified
as male or female.

 At each level which attribute is selected as a splitting attribute.

 Best method for splitting the attribute is -feature selection


measure is

 Information gain

 Gini index

 Entropy means a measure of impurity in the dataset. In a subset if


all the samples are in the same class then we call it as pure node.
if the entropy is zero the impurity is less.

 Consider a dataset having three samples, where all the three


samples are female. Then it is pure node. The entropy of a pure
node is zero

 Consider another dataset having 4 samples of different values as


male and female then it is called impure node. The number of
samples of female=the number of samples of male node. Then this
node is called impure node.. The entropy of a impure node is 1. If
the entropy is 1 the impurity is more. The value of entropy lies
between two values 0 and 1.

For Two class data set entropy is 0 to 1 ( example the class label is
male and female)

More than two class entropy is 0 to log2n (n= number of classes)

30
S=subset of a dataset

Pi = how many samples belong to class i . proportion of the ith class.

C= no. of class labels in S.

Information gain

The measure of reduction in entropy. If the entropy is zero that node


become the leaf node. at each level we have to find the feature which
gives the best split. It is used to select the best splitting attribute when
constructing the decision tree. If entropy is reduced means we gain
more information.

31
S= total number before the split

A= height or weight

In the given formula:

 I is the Information before split

 information after split


We consider the below example for constructing the decision tree:

In the given dataset there are 10 samples (s=10) and 3


classes(A,B,C). the frequencies of the classes are

A=3 B=3 C=4

Information in the data due to uncertainty regarding the risk


class each person belongs to is given by:

32
Now consider each attribute as a candidate to split the sample:

1. Attribute “owns home”

Value=yes. There are 5 applicants who own their home.

A=1 B=2 C=2

Value=No. there are 5 applicants who do not own their


home.

A=2, B=1, C=2.

Compute information gain for those who own their home


and those who do not.

Before split 5(who have home)/10(total set)=0.5*I calculation


is 0.5*1.52+0.5*1.52=1.52

2. Attribute Married

3. Attribute Gender

33
4. Attribute “employed”

5. Attribute “credit rating”

In the above table information gain is provided by attribute “gender”


and that is the attribute that is used for the split. Now we can reduce
the data by removing the attribute gender and removing the class B
since all class B have gender= male.

The information in this data of two classes due to uncertainty of


outcome regarding the class each person belongs to is given by :
34
Consider each attribute in turn as a candidate to split the sample:

1. Attribute “owns home”

Value=Yes. There are three applicants who own their home

A=1 and C=2

Value =no. there are 4 applicants who do not own their home

A=2 and C=2

2. Attribute “married”

There is no need to consider other attributes now, since no other


attributes can be better. The split of the attribute is “married”. Now we
obtain the following decision tree.

35
SPLIT ALGORITHM BASED ON THE GINI INDEX

The Gini index is a measure of impurity in a set of data. It is calculated by


summing the squared probabilities of each class. A lower Gini index
indicates a more pure set of data. Impurity is a measure of how mixed up
the classes are in a set of data. A more impure set of data will have a
higher Gini index. The Gini index is calculated using the formula:

where pi is the probability of a thing having a place with


a specific class.

Lorenz Curve

Developed by the American economist Max O. Lorenz in 1905, it is a


graphical curve representing inequality of the wealth distribution. This
curve is connected with the Gini index defined as a measure of statistical
dispersion of income distribution of the residents in a country.

OVERFITTING AND PRUNING

36
Overfitting is a common problem that needs to be handled while training
a decision tree model. Overfitting occurs when a model fits too closely
to the training data and may become less accurate when encountering
new data or predicting future outcomes. In an overfit condition, a model
memorizes the noise of the training data and fails to capture essential
patterns

In decision trees, In order to fit the data (even noisy data), the model keeps
generating new nodes and ultimately the tree becomes too complex to
interpret. The decision tree predicts well for the training data but can be
inaccurate for new data. If a decision tree model is allowed to train to its
full potential, it can overfit the training data.

Pruning is a technique that removes parts of the decision tree and


prevents it from growing to its full depth. Pruning removes those parts of
the decision tree that do not have the power to classify instances. Pruning
can be of two types — Pre-Pruning and Post-Pruning.

The above example highlights the differences between a pruned and an


unpruned decision tree. The unpruned tree is denser, more complex, and
has a higher variance — resulting in overfitting.

37
Using Gini Index

38
The attribute with the largest reduction in the gini index is selected as the split
attribute. So the split attribute is gender. Take gender out of consideration and
take the three instances where the gender value was male . the same approach
is used with the remaining data until the algorithm terminates.

DECISION TREE RULES

1. Selecting the Best Attribute: Using a metric like Gini


impurity, entropy, or information gain, the best attribute to
split the data is selected.
2. Splitting the Dataset: The dataset is split into subsets based
on the selected attribute.

39
3. Repeating the Process: The process is repeated recursively for
each subset, creating a new internal node or leaf node until a
stopping criterion is met
4. Simplicity and Interpretability: Decision trees are easy to
understand and interpret. The visual representation closely
mirrors human decision-making processes.
5. Versatility: Can be used for both classification and regression
tasks.
6. No Need for Feature Scaling: Decision trees do not require
normalization or scaling of the data.
7. Handles Non-linear Relationships: Capable of capturing non-
linear relationships between features and target variables.
NAÏVE BAYES METHOD
o Naïve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification
problems.
o It is mainly used in text classification that includes a high-
dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis
of the probability of an object.
o Some popular examples of Naïve Bayes Algorithm are spam
filtration, Sentimental analysis, and classifying articles.

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes,
Which can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of


a certain feature is independent of the occurrence of other features.
Such as if the fruit is identified on the bases of color, shape, and
taste, then red, spherical, and sweet fruit is recognized as an apple.
Hence each feature individually contributes to identify that it is an
apple without depending on each other.

40
o Bayes: It is called Bayes because it depends on the principle
of Bayes' Theorem.
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which
is used to determine the probability of a hypothesis with prior
knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the


observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that


the probability of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the


evidence.

P(B) is Marginal Probability: Probability of Evidence.

Working of Naïve Bayes' Classifier can be understood with the help of


the below example:

Suppose we have a dataset of weather conditions and corresponding


target variable "Play". So using this dataset we need to decide that
whether we should play or not on a particular day according to the
weather conditions. So to solve this problem, we need to follow the
below steps:

1. Convert the given dataset into frequency tables.


2. Generate Likelihood table by finding the probabilities of given
features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

41
Solution: To solve this, first consider the below dataset:

Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Suny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny yes
11 Rainy no
12 Overcast yes
13 overcast Yes

Frequency table for weather conditions

Weather Yes no
Sunny 3 2 5/14=0.35
Rainy 2 2 4/14=0.29
overcast 5 0 5/14=0.35
10/14=0.71 4/14=0.29

Applying Bayes'theorem:

P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3


P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35

42
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation


that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

ESTIMATING PREDICTIVE ACCURACY OF CLASSIFICATION METHODS

The confusion matrix is a matrix used to determine the performance of


the classification models for a given set of test data. It can only be
determined if the true values for test data are known.

o The matrix is divided into two dimensions, that are predicted


values and actual values along with the total number of
predictions.
o Predicted values are those values, which are predicted by the
model, and actual values are the true values for the given
observations.
o It looks like the below table:

TP= instances correctly predicted positive


TN= instances correctly predicted as negative
FP=instances incorrectly predicted as positive
FN= instances incorrectly predicted as negative.

This means that, out of 100 people, 20 of them are correctly told that
they have the disease; 5 people are told they have the disease when
really they don’t; 60 people correctly receive a negative result; and a

43
further 15 people get a negative result when they do in fact have the
disease. Accuracy is the proportion of results that are correct.

accuracy = (TP+TN) / (TP+TN+FP+FN).

In our example where we have tested 100 people, the accuracy is


(20+60) / (20+60+5+15) = 80/100 = 4/5. This means that 4 out of 5
predictions made by our test were correct.

The error rate is the proportion of our results that are incorrect. It can be
calculated using the formula: error rate = (FP+FN) / (TP+TN+FP+FN). For
our example, error rate = (5+15) / (20+60+5+15) = 20/100 = 1/5. error
rate = 1 – accuracy.

Sensitivity is the true positive rate. the proportion of those who have the
disease that get a positive result. TP / (TP+FN). For our example, the
sensitivity would be 20 / (20+15) = 20/35 = 4/7. In other words, 4 out of
7 people with the disease were correctly identified as being infected.

The specificity, with formula TN / (TN+FP), tells us the true negative rate
– the proportion of people that don’t have the disease and are correctly
given a negative result. For our example: specificity = 60 / (60+5) = 60/65
= 12/13. That is, 12 out of 13 of those without the disease were given a
correct result.

The following methods used for estimating accuracy : or improving


accuracy of classification methods

Holdout Method In this method, the data set is separated into two sets,
called the Training set and Test set. In the holdout method, data set is
partitioned, such that – maximum data belongs to training set and
remaining data belongs to test set. If there are 20 data items present,
12 are placed in training set and remaining 8 are placed in test set. If
maximum possible data items are placed in training set for construction
of model/classifier, classifier’s error rates and estimates would be very
low and accuracy would be high. This is sign of a good classifier/model.

44
Random sub-sampling method Random subsampling, which is also
known as Monte Carlo crossvalidation a set is based on randomly
splitting the data into subsets, whereby the size of the subsets is defined
by the user . random sub sampling is likely produce better error estimates
than the holdout method.

Leave-one-out method In this method, we perform training on the


whole dataset but leaves only one data-point of the available dataset
and then iterates for each data-point. In LOOCV, the model is trained
on samples and tested on the one omitted sample, repeating
this process for each data point in the dataset. An advantage of using
this method is that we make use of all data points and hence it is low
bias.

K-fold cross-validation is a powerful technique for evaluating


predictive models in data science. It involves splitting the dataset into
k subsets or folds, where each fold is used as the validation set in turn
while the remaining k-1 folds are used for training. This process is
repeated k timesand performance metrics such as accuracy, precision,
and recall are computed for each fold.

A bootstrap method is a graphical representation of the distribution of a


statistic calculated from a sample of data. It is often used to visualize the
variability and uncertainty of a statistic, such as the mean or standard
deviation, by showing the distribution of the statistic over many
bootstrapped samples of the data.

OTHER EVALUATION CRITERIA FOR CLASSIFICATION METHODS

Speed − This refers to the computational cost in generating and using


the classifier or predictor.
Robustness − It refers to the ability of classifier or predictor to make
correct predictions from given noisy data.
Scalability − Scalability refers to the ability to construct the classifier or
predictor efficiently; given large amount of data.
Interpretability − It refers to what extent the classifier or predictor
understands.

45
Goodness of the model- for a model to be effective, it needs to fit the
problem that is being solved.

CLASSIFICATION SOFTWARE
C4.5 and CART

46
47

You might also like