New DM
New DM
UNIT I: Introduc on: Data Mining Applica ons – Data Mining Techniques – The Future of
 Data Mining - Data Mining So ware. Data understanding and data prepara on: introduc on
 - data collec on and preprocessing - Outliers - Mining Outliers - Missing data - Types of Data
 - Compu ng Distance - Data summarising using basic sta s cal measurements - Displaying
 data graphically - Mul dimensional Data Visualisa on.
 UNIT II: Associa on Rule Mining: Introduc on – Basics – The Task and a Naïve Algorithm –
 The Apriori Algorithm – Improving the Efficiency of the Apriori Algorithm – Mining Frequent
 pa erns without Candidate Genera on (FP-Growth) – Performance Evalua on of
 Algorithms.
 Classifica on: Introduc on – Decision Tree – Over fi ng and Pruning – Decision Tree Rules
 – Naïve Bayes Method – Es ma ng Predic ve Accuracy of Classifica on Methods –
 Improving Accuracy of Classifica on Methods – Other Evalua on Criteria for Classifica on
 Methods – Classifica on So ware.
 UNIT III: Cluster Analysis: Introduc on – features – Types of Data – Compu ng Distance -
 Types of cluster Analysis Methods – Par oned Methods – Hierarchical Methods – Density
 Based Methods – Quality and validity of Cluster Analysis Methods – Cluster Analysis
 So ware.
TEXT BOOK
 1. G.K Gupta, “Introduc on to Data Mining with Case Studies”, Pren ce Hall of
 India(Pvt) Ltd, India, 2008.
Data Understanding and Data Prepara on
Introduc on:
                                              1
The basic requirements include:
   1. The data must be available
   2. The data must be relevant, adequate and clean
   3. There mut be a well-defined problem.
Data Collection is the systematic process of gathering, measuring, and recording
data for research, analysis, or decision-making. It involves collecting data from
various sources, such as surveys, interviews, observations, experiments,
documents, or existing databases, to obtain relevant and reliable information.
Data prepara on is the process of cleaning and transforming raw data prior to
processing and analysis. It is an important step prior to processing and o en
involves reforma ng data, making correc ons to data, and combining datasets
to enrich data.
Data Cleaning is a process used to determine, inaccurate, incomplete or
unreasonable data items of a dataset and then improving the data quality
through correc ons of the detected errors and omissions.
Data required for data mining needs to be extracted from a number of databases
integrated and perhaps cleansed and transformed. This process is called
ETL(Extrac on, Transforma on and loading). Errors have to be detected and
removed. Missing data values may have to be dealt with.
When datasets are large, it is inevitable that there will be errors in the data
because data collec on techniques are never perfect. Data collec on involves
humans and machines and both can be prone to errors. A variety of errors
include:
   a. Noise (any data that is received as such cannot be used by the program)
   b. Missing data
   c. Duplicate data
   d. Input results
   e. Outliers (some data values that are too large/too small or missing values)
The level of errors in data is usually between 1% and 10%. The datamining
technique should not be used without pre-processing data.
There is need to preprocess data even when all the data comes from one source,
the need for pre-processing increases significantly when the data comes from
mul ple sources. Integra ng data from mul ple sources can also lead to
                                       2
redundant data because the same data may be represented in different ways in
different data sources. The aim of data cleaning is not only iden fying errors and
correct errors, we have to remove the data errors.
Data preprocessing involves the following steps:
    a. Data understanding:
It is difficult to start data preprocessing without a good understanding of the
data and the types of errors that may be present in the data. Careful study of
metadata for each data source may be required.
   b. Mapping rules ( process of matching fields from one database to another,
      ex: Coimbatore saved as CBE in one database COIM in another database)
When the data has been obtained by integra ng a number of data sources, it is
necessary to device mapping rules to transform data that uses different
conven ons. Data pre-processing so ware help in suppor ng this step.
   c. Verifica on
The single step of verifica on may not be sufficient in some cases because some
errors are not immediately visible.
   d. Transforma on
The meta data (ex: index of the book serves as meta data for contents in the
book, data about data) that has been collected should be saved so that it can be
used next me when it is required. Informa on about transforma on also needs
to be documented and saved.
   e. Backflow
Once the errors in the source system data are iden fied in the above process,
the errors may be reported back to staff members who manage the source
systems. These staff should replace the dirty data with clean data.
Extrac on, Transforma on and Loading (ETL)
ETL is a process that extracts the data from different source systems, then
transforms the data (like applying calculations, concatenations, etc.) and finally
loads the data into the Data Warehouse system.
It’s tempting to think a creating a Data warehouse is simply extracting data from
multiple sources and loading into database of a Data warehouse. This is far from
the truth and requires a complex ETL process. The ETL process requires active
inputs from various stakeholders including developers, analysts, testers, top
executives and is technically challenging.
                                        3
Successful implementation of an ETL system involves resolving many issues
including the following:
                                       4
There are many errors and missing values in the records extracted from the three
different source systems. A er ETL process the record will look like table 2.4. the
table is s ll have errors but cannot be corrected without addi onal informa on.
                                         6
   f. Data Entry Errors
When data is being entered by humans, errors are not uncommon. Data errors
arise because of unmo vated data entry staff. Because data entry posi ons are
o en poorly paid and the work of such staff is rarely recognized, fa gue can also
be an issue. Data entry errors may also be reduced if high quality data entry
so ware is used.
   g. Measurement errors
Data is o en captured by instruments since there is an increasing tendency to
use sensor technology, ex. Capturing car speeds or temperatures . Errors can
creep in because of the instrument malfunc oning, poor calibra on or poor
tes ng of the so ware used in the instrument.
   h. Filtering errors
data collection include filtering, smoothening and summarization, perhaps to
reduce the volume of data. Each of these steps has the potential to produce
errors. Data often cannot be cleaned without human involvement. We can use
data visualization software. Exploratory data mining help in understanding of
the data and the data sources.
                                        7
Consider the sca er plot above, which shows data for students on a backpacking
trip. (Each point represents a student.)
No ce how two of the points don't fit the pa ern very well. These points have
been labeled Brad and Sharon, which are the names of the students they
represent. Sharon could be considered an outlier because she is carrying a much
heavier backpack than the pa ern predicts. Brad could be considered an
outlier because he is carrying a much lighter backpack than the pa ern predicts.
   1. Mul variate outlier is a combina on of extreme values(variables) which
      is inappropriate for data. Let's say in a class of students the average height
      and weight of a student are 1.7 m and 50 kg respec vely. If a student in a
      class have height and weight of 10 kg and 4 m it is considered as a
      mul variate outlier. Mul variate outliers are ND (mul -dimension
      outliers). From the sca er plot below we can clearly see two variables(two
      dimensions) are involved (i.e. (Weight, Height) for outlier. In respect to
      that red circled data point((10,4)) is an outlier.
outlier. In respect to that red circled data point(100 degree Celsius) is an outlier.
3. Distance-based outliers
                                         9
For example, an employee may not have a date of birth because he/she comes
from an area where birth records are not maintained properly. May be he/she
was a refugee(agathigal) has no papers when he/she came.
Missing values can be represented as nulls, as the value N/A or some other code,
or an artificial value such as 9999. But nulls are considered as missing values.
The ways to handle the missing data:
    Ignore the tuple: this is done when the class label is missing. This method
     is not effective.
    Fill in the missing value manually: time consuming and may not be
     feasible given a large dataset with many missing values.
    Use the attribute mean to fill in the missing value: ex: suppose the
     average income of customer is Rs.5,00,000 lakhs. Use this value to replace
     the missing value for income.
    Use the most probable value to fill in the missing value: this is
     determined with Bayesian formulae, decision tree induction. For example
     using the other customer attributes in the dataset, it may be possible to
     replace the missing value.
TYPES OF DATA
Quantitative data
Quantitative data is information that can be counted or measured—or, in other
words, quantified—and given a numerical value. Quantitative data is data that
can be counted or measured in numerical values. The two main types of
quantitative data are discrete data and continuous data. Height in feet, age in
years, and weight in pounds are examples of quantitative data. Qualitative data
is descriptive data that is not expressed numerically. Ex: Revenue in
dollars,Weight in kilograms or pounds, Age in months or years, Distance in miles
or kilometers, Time in days or weeks, Experiment results, Website conversion
rates, Website page load speed.
Binary data
A binary variable is a categorical variable that can only take one of two values,
usually represented as a Boolean — True or False — or an integer variable —
                                       10
0 or 1 — where 0 typically indicates that the attribute is absent,
and 1 indicates that it is present.
Some examples of binary variables, i.e. attributes, are:
COMPUTING DISTANCE
A number of data mining techniques require distance between two sets
of data be compared. There are number of methods for computing
distance in multidimensional environment.
Properties of distance:
                                       11
    Distance is always positive
    Distance from point x to itself is always zero
    Distance from point x to point y cannot be greater than the sum if
     the distance from x to some intermediate point and distance from
     z to y.
    Distance from x to y is always the same as from y to x.
a)Euclidean Distance
Where,
                                        12
are. The two objects are deemed to be similar if the distance between them is
small, and vice-versa.
Median is the value that lies in the center of the data set. To calculate the
median, we first sort the data in ascending order and then choose the middle
value. If the number of entries is odd, then the median is simply the center value.
If this number is even, we take the average of the two central values to get the
median.
Mode represents the most frequent value of a data set. If no values are repeated
in the data, then there is no mode.
5, 5, 7, 8, 9, 1, 2, 5, 8
A variance is the average of the squared differences from the mean. To figure
out the variance, calculate the difference between each point within the data
set and the mean. Once you figure that out, square and average the results.
For example, if a group of numbers ranges from one to 10, you get a mean of
5.5. If you square the differences between each number and the mean and find
their sum, the result is 82.5. To figure out the variance:
σ2 = Σ (xi – x̅)2/n
         Divide the sum, 82.5, by N-1, which is the sample size (in this case 10)
          minus 1.
         The result is a variance of 82.5/9 = 9.17.
                                         13
Covariance and correlation
Quantiles
The word “quantile” comes from the word quantity. In simple terms, a quantile
is where a sample is divided into equal-sized, adjacent, subgroups (that’s why
it’s sometimes called a “fractile“). It can also refer to dividing a probability
distribution into areas of equal probability.
The median is a quantile; the median is placed in a probability distribution so
that exactly half of the data is lower than the median and half of the data is
above the median. The median cuts a distribution into two equal areas and so it
is sometimes called 2-quantile.
                                       14
Range of Values
The Range is the difference between the lowest and highest values.
Box Plot: It is a type of chart that depicts a group of numerical data through
their quartiles. It is a simple way to visualize the shape of our data. It makes
comparing characteristics of data between categories very easy.
                                       15
The area inside the box (50% of the data) is known as the Inter Quartile
Range. The IQR is calculated as –
IQR = Q3-Q1
Outliers are the data points below and above the lower and upper limit. The
lower and upper limit is calculated as –
Lower Limit = Q1 - 1.5*IQR
Upper Limit = Q3 + 1.5*IQR
The values below and above these limits are considered outliers and the
minimum and maximum values are calculated from the points which lie under
the lower and upper limit.
How to create a box plot
Let us take a sample data to understand how to create a box plot.
Here are the runs scored by a cricket team in a league of 12 matches –
100,120,110,150,110,140,130,170,120,220,140,110.
The area inside the box (50% of the data) is known as the Inter Quartile
Range. The IQR is calculated as –
IQR = Q3-Q1
Outliers are the data points below and above the lower and upper limit. The
lower and upper limit is calculated as –
Lower Limit = Q1 - 1.5*IQR
Upper Limit = Q3 + 1.5*IQR
The values below and above these limits are considered outliers and the
minimum and maximum values are calculated from the points which lie under
the lower and upper limit.
b.Histogram
A histogram is a graph that shows the frequency of numerical data using
rectangles. The height of a rectangle (the vertical axis) represents the
distribution frequency of a variable (the amount, or how often that variable
appears). The width of the rectangle (horizontal axis) represents the value of
the variable (for instance, minutes, years, or ages).
A histogram is a graphical representa on of a grouped frequency distribu on
with con nuous classes.
                                      16
c.Sca er Plots
A sca erplot is a graphical way to display the rela onship between two
quan ta ve sample variables. It consists of an X axis, a Y axis and a series of
dots where each dot represents one observa on from a data set. The posi on
of the dot refers to its X and Y values.
                                       17
Let us consider the data according to item, time and location (like Kolkata,
Delhi, Mumbai). Here is the table :
BASICS
                                     18
A shop sells only a small variety of products:
     Bread
     Cheese
     Juice
     Milk
     Tea
     Biscuits
     Newspaper
     Sugar
     Coffee
Terminology
   The number of items the shop stocks is n. i.e n=9. These items are
    represented by the vector I (i1, i2….in).
   There are N transactions, N=10. We represent as vector(or tuple)
    T. (t1, t2….t N).
   Each transaction of m items be (i1, i2….im) with m<n. transactions
    differ in the number of items.
   Association rules are written as X        Y meaning that whenever
    X appears Y also tends to appear. X and Y may be single item or
    set of items. X is referred as antecedent and Y as the consequent.
   The support of an itemset is the number of transactions in which the
    itemset appears, divided by the total number of transactions.
                                    19
     In general, the support of an itemset can be calculated using the
     following formula:
     Support(X) = (Number of transactions containing X) / (Total number of
     transactions)
     So the confidence of the rule “If a customer buys milk, they will
     also buy bread” is 50%. This means that in 50% of the
     transactions where milk was purchased, bread was also
     purchased.
     In general, the confidence of a rule can be calculated using the
     following formula:
     Confidence(X => Y) = (Number of transactions containing X and Y)
     / (Number of transactions containing X)
                                   20
   Association rules are typically represented as a set of items, such
    as {A, B, C}, where A, B, and C are different items in the dataset.
    The goal of association rule mining is to identify rules that have a
    high level of support and confidence. Support is defined as the
    number of times a particular itemset appears in the dataset, while
    confidence measures the probability of observing item B given that
    item A is already observed.
   Lift, on the other hand, measures the degree to which two items are
    dependent on each other, compared to what would be expected if
    they were independent. It is calculated as the ratio of the observed
    support for both items to the expected support if they were
    independent. A lift value of 1 indicates that the two items are
    independent, a value greater than 1 indicates a positive association,
    while a value less than 1 indicates a negative association.
   For example, if we have a dataset containing information about
    purchases made by customers, we can use association rule mining
    to identify which items are commonly purchased together. Suppose
    that we find an association rule {beer, chips} -> {salsa} with a lift
    value of 2. This means that customers who buy beer and chips are
    twice as likely to also buy salsa, compared to what would be
    expected if beer, chips, and salsa were purchased independently.
NAÏVE ALGORITHM
Consider naïve brute force algorithm to do the task. The shopkeeper only
has 4 items for sale, the list of transactions showing the purchases of
one customer.
The four items and all combinations of the four items and their
frequencies of occurrence in the database is given below:
                                   21
From the problem we have to find minimum support of 50% and
minimum confidence of 75%. Since the minimum support is 50%, we
must find the item sets that occur in at least two transactions. In the
above given table the frequency for 4 items(bread, cheese, juice, milk)
are frequent. The frequency goes down as we look at 2-itemsets( 6 are
frequent), 3-itemsets(none are frequent),       4-itemsets(none). The
frequent itemsets are given below table:
The last rule has confidence above the minimum 75% required and
qualifies. Rules that have more than user-specified minimum specified
confidence are called confident.
                                     22
IMPROVED NAÏVE ALGORITHM
Here we look at each transaction and count only the combinations that
actually occur ( not consider itemset counts with zero frequency).
This works better since the list of item combination is significantly reduced
from 15 to 11 and this reduction is likely to be much larger for bigger
problems.
                                       23
  2. Generate candidate item sets: the algorithm generates a list of
     candidate item sets of length k+1 from the frequent k-item sets
     identified in the previous step.
  3. Count the support of each candidate item set: Scan the dataset
     again to count the number of items each candidate item set appears
     in the dataset.
  4. Prune the candidate item sets: Remove the item sets that do not
     meet the minimum support.
The example consider only five transactions and six items. We find
association rules with 50% support and 75% confidence. The
transactions are given in the below table.
Now we will find frequent items L1.the frequent items are shown
below.
                                    24
The candidate 2-itemsets or C2 therefore has six pairs. These pairs and
their frequencies are given below:
We have only two frequent item pairs that are {bread, juice} and
{cheese, juice}. We do not obtain a candidate 3-itemset since we do not
have two 2-itemsets that have the same first item. The two frequent 2-
itemsets above lead to the following possible rules:
                                   25
Generating a FP-tree
  5. Use the transaction to construct the first branch of the tree with
     each node corresponding to a frequent item and showing that
     items frequency, which is 1 for the first transaction.
  7. Insert the transaction in the tree using any common prefix that
     may appear. Increase the item counts.
Example FP-tree
                                     26
Now remove the items that are not frequent from the transactions and
order the items according to their frequency .
Now start building the FP-tree. We use symbols B, J, C, M for the items.
The item header table on the left-hand side of the tree is to help in tree
traversal. Each item in this list points to its first occurrence in the FP-tree.
Nodes with the same item name in the tree are linked via dotted node-
links. Once the FP-tree is built the transaction database is not required.
                                      27
The tree is built by making a root node labelled null. A node is made for
each frequent item in the first transaction and the count is set to 1. The
nodes near the root in the tree are more frequent than those down the
tree. The algorithm goes through all sorted transactions. If the path
already exists in the tree, a counter is increased. Otherwise a new node
is inserted into tree with a count of 1. The height of the FP-tree is always
equal to the maximum number of itemsets in a transaction excluding the
root node.
To find the frequent itemsets we should note that for any frequent item
a, all the frequent itemsets containing a can be obtained by following a’s
node-links, starting from a’s head in the FP-tree.
In the FP-tree we start with item M and find the following patters:
BM(1), BJM(1),JCM(1)
BJC(2), JC(1)
BJ(3), J(1)
There is no need to follow links from item B and there are no other
frequent itemsets. The process is represented by the following
conditional trees for M,C and J in the below diagrams.
The algorithm finds the conditional pattern base for each item. The
conditional pattern base of an item consists of all patterns leading to a
specific item. The conditional frequent pattern tree is constructed from
the conditional pattern base where only the frequent items are included.
                                     28
PERFORMANCE EVALUATION OF ALGORITHMS:
CLASSIFICATION
DECISION TREE
                                     29
   Decision tree algorithms are
   The training data set contains two features height and weight.
    And class labels are male and female. each feature is classified
    as male or female.
 Information gain
 Gini index
For Two class data set entropy is   0 to 1 ( example the class label is
male and female)
                                    30
S=subset of a dataset
Information gain
                                      31
S= total number before the split
A= height or weight
   
       We consider the below example for constructing the decision tree:
                                    32
Now consider each attribute as a candidate to split the sample:
2. Attribute Married
3. Attribute Gender
                           33
        4. Attribute “employed”
Value =no. there are 4 applicants who do not own their home
2. Attribute “married”
                                    35
SPLIT ALGORITHM BASED ON THE GINI INDEX
Lorenz Curve
                                       36
Overfitting is a common problem that needs to be handled while training
a decision tree model. Overfitting occurs when a model fits too closely
to the training data and may become less accurate when encountering
new data or predicting future outcomes. In an overfit condition, a model
memorizes the noise of the training data and fails to capture essential
patterns
In decision trees, In order to fit the data (even noisy data), the model keeps
generating new nodes and ultimately the tree becomes too complex to
interpret. The decision tree predicts well for the training data but can be
inaccurate for new data. If a decision tree model is allowed to train to its
full potential, it can overfit the training data.
                                      37
Using Gini Index
                   38
The attribute with the largest reduction in the gini index is selected as the split
attribute. So the split attribute is gender. Take gender out of consideration and
take the three instances where the gender value was male . the same approach
is used with the remaining data until the algorithm terminates.
                                        39
  3. Repeating the Process: The process is repeated recursively for
     each subset, creating a new internal node or leaf node until a
     stopping criterion is met
  4. Simplicity and Interpretability: Decision trees are easy to
     understand and interpret. The visual representation closely
     mirrors human decision-making processes.
  5. Versatility: Can be used for both classification and regression
     tasks.
  6. No Need for Feature Scaling: Decision trees do not require
     normalization or scaling of the data.
  7. Handles Non-linear Relationships: Capable of capturing non-
     linear relationships between features and target variables.
NAÏVE BAYES METHOD
  o Naïve Bayes algorithm is a supervised learning algorithm, which is
     based on Bayes theorem and used for solving classification
     problems.
  o   It is mainly used in text classification that includes a high-
      dimensional training dataset.
  o   Naïve Bayes Classifier is one of the simple and most effective
      Classification algorithms which helps in building the fast machine
      learning models that can make quick predictions.
  o   It is a probabilistic classifier, which means it predicts on the basis
      of the probability of an object.
  o   Some popular examples of Naïve Bayes Algorithm are spam
      filtration, Sentimental analysis, and classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes,
Which can be described as:
                                     40
  o   Bayes: It is called Bayes because it depends on the principle
      of Bayes' Theorem.
  o   Bayes' theorem is also known as Bayes' Rule or Bayes' law, which
      is used to determine the probability of a hypothesis with prior
      knowledge. It depends on the conditional probability.
  o   The formula for Bayes' theorem is given as:
Where,
Problem: If the weather is sunny, then the Player should play or not?
                                   41
Solution: To solve this, first consider the below dataset:
      Outlook     Play
0     Rainy       Yes
1     Sunny       Yes
2     Overcast    Yes
3     Overcast    Yes
4     Sunny       No
5     Rainy       Yes
6     Suny        Yes
7     Overcast    Yes
8     Rainy       No
9     Sunny       No
10    Sunny       yes
11    Rainy       no
12    Overcast    yes
13    overcast    Yes
Weather     Yes          no
Sunny       3            2         5/14=0.35
Rainy       2            2         4/14=0.29
overcast    5            0         5/14=0.35
            10/14=0.71   4/14=0.29
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
                                     42
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
This means that, out of 100 people, 20 of them are correctly told that
they have the disease; 5 people are told they have the disease when
really they don’t; 60 people correctly receive a negative result; and a
                                       43
further 15 people get a negative result when they do in fact have the
disease. Accuracy is the proportion of results that are correct.
The error rate is the proportion of our results that are incorrect. It can be
calculated using the formula: error rate = (FP+FN) / (TP+TN+FP+FN). For
our example, error rate = (5+15) / (20+60+5+15) = 20/100 = 1/5. error
rate = 1 – accuracy.
Sensitivity is the true positive rate. the proportion of those who have the
disease that get a positive result. TP / (TP+FN). For our example, the
sensitivity would be 20 / (20+15) = 20/35 = 4/7. In other words, 4 out of
7 people with the disease were correctly identified as being infected.
The specificity, with formula TN / (TN+FP), tells us the true negative rate
– the proportion of people that don’t have the disease and are correctly
given a negative result. For our example: specificity = 60 / (60+5) = 60/65
= 12/13. That is, 12 out of 13 of those without the disease were given a
correct result.
Holdout Method In this method, the data set is separated into two sets,
called the Training set and Test set. In the holdout method, data set is
partitioned, such that – maximum data belongs to training set and
remaining data belongs to test set. If there are 20 data items present,
12 are placed in training set and remaining 8 are placed in test set. If
maximum possible data items are placed in training set for construction
of model/classifier, classifier’s error rates and estimates would be very
low and accuracy would be high. This is sign of a good classifier/model.
                                     44
Random sub-sampling method Random subsampling, which is also
known as Monte Carlo crossvalidation a set is based on randomly
splitting the data into subsets, whereby the size of the subsets is defined
by the user . random sub sampling is likely produce better error estimates
than the holdout method.
                                     45
Goodness of the model- for a model to be effective, it needs to fit the
problem that is being solved.
CLASSIFICATION SOFTWARE
C4.5 and CART
                                  46
47