ASSIGNMENT
ON
DATA MINING
Submitted by
Name: Manjula.T
Roll.no:16suca003
Class: III BCA
1.Association rule:
Association rule learning is a rule-based machine learning method for
discovering interesting relations between variables in large databases. It is intended
to identify strong rules discovered in databases using some measures of
interestingness. For example, the rule found in the sales data of a supermarket
would indicate that if a customer buys onions and potatoes together, they are likely
to also buy hamburger meat. Such information can be used as the basis for
decisions about marketing activities such as, e.g., promotional pricing or product
placements
Example database with 5 transactions and 5 items
transaction ID milk Bread butter beer diapers
1 1 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 1
4 1 1 1 0 0
5 0 1 0 0 0
Following the original definition by Agrawal, Imieliński, Swami[2] the problem of
association rule mining is defined as:
Let be a set of I={i1,i2,…..,in}binary attributes called items.
Let D={t1,t2,……,tm} be a set of transactions called the database.
Each transaction in has a unique transaction ID and contains a subset of the items
in .
A rule is defined as an implication of the form:
Every rule is composed by two different sets of items, also known as itemsets, X
and Y , where X is called antecedent or left-hand-side (LHS) and
consequent or right-hand-side (RHS).
To illustrate the concepts, we use a small example from the supermarket domain.
The set of items is I={milk,butter,bread,beer,diapers} and in the table is shown a
small database containing the items, where, in each entry, the value 1 means the
presence of the item in the corresponding transaction, and the value 0 represents
the absence of an item in that transaction.
An example rule for the supermarket could be {butter,bread}=>milk meaning that
if butter and bread are bought, customers also buy milk.
Note: this example is extremely small. In practical applications, a rule needs a
support of several hundred transactions before it can be considered statistically
significant, and datasets often contain thousands or millions of transactions.
Useful Concept
In order to select interesting rules from the set of all possible rules, constraints on
various measures of significance and interest are used. The best-known constraints
are minimum thresholds on support and confidence.
Let X be an itemset, X=>Y an association rule and T a set of transactions of a
given database.
Support
Support is an indication of how frequently the itemset appears in the dataset.
The support of X with respect to T is defined as the proportion of transactions
in the dataset which contains the itemset X.
In the example dataset, the itemset X={beer,diapers} has a support of 1/5=0.2
since it occurs in 20% of all transactions (1 out of 5 transactions). The argument
supp() of X is a set of preconditions, and thus becomes more restrictive as it grows
(instead of more inclusive).[3]
Confidence
Confidence is an indication of how often the rule has been found to be true.
The confidence value of a rule, X=>Y , with respect to a set of transactions T, is
the proportion of the transactions that contains X which also contains Y.
Confidence is defined as:
Conf(X=>Y)=supp(X U Y) / supp(X)
For example, the rule {butter,bread}=> milk has a confidence of 0.2/0.2=1 in the
database, which means that for 100% of the transactions containing butter and
bread the rule is correct (100% of the times a customer buys butter and bread, milk
is bought as well).
Lift
The lift of a rule is defined as:
lift(x,y)=supp(XUY) /supp(X) *supp(Y)
or the ratio of the observed support to that expected if X and Y were independent.
Process
Frequent itemset lattice, where the color of the box indicates how many
transactions contain the combination of items. Note that lower levels of the lattice
can contain at most the minimum number of their parents' items; e.g. {ac} can have
only at most items. This is called the downward-closure property.[2]
Association rules are usually required to satisfy a user-specified minimum support
and a user-specified minimum confidence at the same time. Association rule
generation is usually split up into two separate steps:
1. A minimum support threshold is applied to find all frequent itemsets in a
database.
2. A minimum confidence constraint is applied to these frequent itemsets in
order to form rules.
While the second step is straightforward, the first step needs more attention.
Finding all frequent itemsets in a database is difficult since it involves searching all
possible itemsets (item combinations). The set of possible itemsets is the power
set over I and has size 2n-1 (excluding the empty set which is not a valid itemset).
Although the size of the power-set grows exponentially in the number of items n in
I efficient search is possible using the downward-closure property of support (also
called anti-monotonicity) which guarantees that for a frequent itemset, all its
subsets are also frequent and thus no infrequent itemset can be a subset of a
frequent itemset. Exploiting this property, efficient algorithms (e.g., Apriori[8] and
Eclat) can find all frequent itemsets.
2.K-Medoid Algorithm:
K-medoids Clustering Algorithm Partitioning Around Medoids or
the K-medoids algorithm is a partitional clustering algorithm which is slightly
modified from the K-means algorithm. They both attempt to minimize the squared-
error but the K-medoids algorithm is more robust to noise than K-means algorithm.
In K-means algorithm, they choose means as the centroids but in the K-medoids,
data points are chosen to be the medoids. A medoid can be defined as that object of
a cluster, whose average dissimilarity to all the objects in the cluster is minimal.
The difference between k-means and k-medoids is analogous to the difference
between mean and median: where mean indicates the average value of all data
items collected, while median indicates the value around that which all data items
are evenly distributed around it. The basic idea of this algorithm is to first compute
the K representative objects which are called as medoids. After finding the set of
medoids, each object of the data set is assigned to the nearest medoid. That is,
object i is put into cluster vi, when medoid mvi is nearer than any other medoid
mw. The algorithm proceeds in two steps: BUILD-step: This step sequentially
selects k "centrally located" objects, to be used as initial medoids SWAP-step: If
the objective function can be reduced by interchanging (swapping) a selected
object with an unselected object, then the swap is carried out. This is continued till
the objective function can no longer be decreased. The algorithm is as follows: 1.
Initially select k random points as the medoids from the given n data points of the
data set. 2. Associate each data point to the closest medoid by using any of the
most common distance metrics. 3. For each pair of non-selected object h and
selected object i, calculate the total swapping cost TCih. o If TCih < 0, i is replaced
by h 4. Repeat the steps 2-3 until there is no change of the medoids. There are four
situations to be considered in this process: i. Shift-out membership: an object pi
may need to be shifted from currently considered cluster of oj to another cluster; ii.
Update the current medoid: a new medoid oc is found to replace the current
medoid oj ; iii. No change: objects in the current cluster result have the same or
even smaller square error criterion(SEC) measure for all the possible
redistributions considered; iv. Shift-in membership: an outside object pi is assigned
to the current cluster with the new (replaced) medoid ocPAM works efficiently for
small data sets but does not scale well for large data sets. The complexity of this
algorithm is O(k(n-k)2 ). Example: For a given k=2, cluster the following data set
using Pam.
Let us choose that (3, 4) and (7, 4) are the medoids. Suppose considering the
Manhattan distance metric as the distance measure, So, now if we calculate the
distance from each point: For (7, 6), Calculating the distance from the medoids
chosen, this point is nearest to (7, 4) For (2, 6) , Calculating the distance from the
medoids chosen, this point is nearest to (3, 4) For (3, 8) , Calculating the distance
from the medoids chosen, this point is at same distance from both the points. So
choosing that it is nearest to (3, 4) For (8, 5) , Calculating the distance from the
medoids chosen, this point is nearest to (7, 4) For (4, 7) , Calculating the distance
from the medoids chosen, this point is nearest to (3, 4) For (6, 2) , Calculating the
distance from the medoids chosen, this point is nearest to (7, 4) For (7, 3) ,
Calculating the distance from the medoids chosen, this point is nearest to (7, 4) For
(6, 4) , Calculating the distance from the medoids chosen, this point is nearest to
(7, 4) So, now after the clustering, the clusters formed are: {(3,4), (2,6), (3,8),
(4,7)} and{(7,4), (6,2), (6,4), (7,3), (8,5), (7,6)}. Now calculating the cost which is
nothing but the sum of distance of each non-selected point from the selected point
which is medoid of the cluster it belongs to. Total Cost = cost((3, 4), (2, 6)) +
cost((3, 4), (3, 8)) + cost((3, 4), (4, 7)) + cost((7, 4), (6, 2))+ cost((7, 4), (6, 4))+
cost((7, 4), (7, 3))+ cost((7, 4), (8, 5))+ cost((7, 4), (7, 6)) = 3 + 4 + 4 + 3 + 1 + 1 +
2 + 2 = 20. So, now let us choose some other point to be a medoid instead of (7, 4).
Let us randomly choose (7, 3). Not the new medoid set is: (3, 4) and (7, 3). Now
repeating the same task as earlier:
So, now if we calculate the distance from each point: For (7, 6), Calculating the
distance from the medoids chosen, this point is nearest to (7, 3) For (2, 6) ,
Calculating the distance from the medoids chosen, this point is nearest to (3, 4) For
(3, 8) , Calculating the distance from the medoids chosen, this point is nearest to
(3, 4) For (8, 5) , Calculating the distance from the medoids chosen, this point is
nearest to (7, 3) For (4, 7) , Calculating the distance from the medoids chosen, this
point is nearest to (3, 4) For (6, 2) , Calculating the distance from the medoids
chosen, this point is nearest to (7, 3) For (7, 4) , Calculating the distance from the
medoids chosen, this point is nearest to (7, 3) For (6, 4) , Calculating the distance
from the medoids chosen, this point is nearest to (7, 3) Calculating the total cost =
cost((3, 4), (2, 6)) + cost((3, 4), (3, 8)) + cost((3, 4), (4, 7)) + cost((7, 3), (7, 6)) +
cost((7, 3), (8, 5)) + cost((7, 3), (6, 2)) + cost((7, 3), (7, 4)) + cost((7, 3), (6, 4)) = 3
+ 4 + 4 + 3 + 3 + 2 + 1 + 2 = 22. The total cost when (7, 3) is the medoid > the
total cost when (7, 4) was the medoid earlier. Hence, (7, 4) should be chosen
instead of (7, 3) as the medoid. Since there is no change in the medoid set, the
algorithm ends here. Hence the clusters obtained finally are: {(3,4), (2,6), (3,8),
(4,7)} and{(7,4), (6,2), (6,4), (7,3), (8,5), (7,6)}.
3.CLARA concept
Instead of finding medoids for the entire data set, CLARA considers a small sample of the data with fixed
size (sampsize) and applies the PAM algorithm (Chapter @ref(k-medoids)) to generate an optimal set of
medoids for the sample. The quality of resulting medoids is measured by the average dissimilarity
between every object in the entire data set and the medoid of its cluster, defined as the cost function.
CLARA repeats the sampling and clustering processes a pre-specified number of times in order to
minimize the sampling bias. The final clustering results correspond to the set of medoids with the minimal
cost. The CLARA algorithm is summarized in the next section.
CLARA Algorithm:
Input Database of D objects
Repeat for m times
Draw a sample S c D randomly from D
Call pam (S,k) to get K medoids
Classify the entire data set D to c1,c2,c3……Ck
Calculate the quality of clustering as the average dissimilarity
end.
The algorithm is as follow:
1. Split randomly the data sets in multiple subsets with fixed size (sampsize)
2. Compute PAM algorithm on each subset and choose the corresponding k
representative objects (medoids). Assign each observation of the entire data set
to the closest medoid.
3. Calculate the mean (or the sum) of the dissimilarities of the observations to
their closest medoid. This is used as a measure of the goodness of the
clustering.
4. Retain the sub-dataset for which the mean (or sum) is minimal. A further
analysis is carried out on the final partition.
Note that, each sub-data set is forced to contain the medoids obtained from the best
sub-data set until then. Randomly drawn observations are added to this set until
sampsize has been reached.
medoids: Objects that represent clusters
clustering: a vector containing the cluster number of each object
sample: labels or case numbers of the observations in the best sample, that is,
the sample used by the clara algorithm for the final partition,
CLARANS ALGORITHM:
CLARANS Algorithm CLARANS 1. Input parameters numlocal and
maxneighbor.
Initialize i to 1, and mincost to a large number. 2. Set current to an arbitrary
node in Gn;k. 3. Set j to 1. 4.
Consider a random neighbor S of current, and based on 5, calculate the cost
differential of the two nodes. 5. If S has a lower cost, set current to S.
Otherwise, increment j by 1. If j maxneighbor, go to Step 4. 7.
Otherwise, when j > maxneighbor, compare the cost of current with mincost.
If the former is less than mincost, set mincost to the cost of current and set
bestnode to current. 8. Increment i by 1.
If i > numlocal, output bestnode and halt. Otherwise, go to Step 2.
Steps 3 to 6 above search for nodes with progressively lower costs.
But, if the current node has already been compared with the maximum number
of the neighbors of the node (specified by maxneighbor) and is still of the
lowest cost, the current node is declared to be a “local” minimum.
Then, in Step 7, the cost of this local minimum is compared with the lowest
cost obtained so far.
The lower of the two costs above is stored in mincost. Algorithm CLARANS
then repeats to search for other local minima, until numlocal of them have been
found.
As shown above, CLARANS has two parameters: the maximum number of
neighbors examined (maxneighbor) and the number of local minima obtained
(numlocal). The higher the value of maxneighbor, the closer is CLARANS to
PAM, and the longer is each search of a local minima. But, the quality of such
a local minima is higher and fewer local minima needs to be obtained.
Like many applications of randomized search [16], [17], we rely on
experiments to determine the appropriate values of these parameters
5. Decision tree:
A decision tree is a structure that includes a root node, branches, and leaf nodes.
Each internal node denotes a test on an attribute, each branch denotes the outcome
of a test, and each leaf node holds a class label. The topmost node in the tree is the
root node.
The following decision tree is for the concept buy_computer that indicates
whether a customer at a company is likely to buy a computer or not. Each internal
node represents a test on an attribute. Each leaf node represents a class.
The benefits of having a decision tree are as follows −
It does not require any domain knowledge.
It is easy to comprehend.
The learning and classification steps of a decision tree are simple and fast.
Decision trees used in data mining are of two main types:
Classification tree analysis is when the predicted outcome is the class to which
the data belongs.
Regression tree analysis is when the predicted outcome can be considered a
real number (e.g. the price of a house, or a patient's length of stay in a hospital).
The term Classification And Regression Tree (CART) analysis is an umbrella
term used to refer to both of the above procedures, first introduced by Breiman et
al. in 1984.[3] Trees used for regression and trees used for classification have some
similarities - but also some differences, such as the procedure used to determine
where to split.[3]
Some techniques, often called ensemble methods, construct more than one decision
tree:
Boosted trees Incrementally building an ensemble by training each new
instance to emphasize the training instances previously mis-modeled. A typical
example is AdaBoost. These can be used for regression-type and classification-
type problems.[4][5]
Bootstrap aggregated (or bagged) decision trees, an early ensemble method,
builds multiple decision trees by repeatedly resampling training data with
replacement, and voting the trees for a consensus prediction.[6]
A random forest classifier is a specific type of bootstrap aggregating
Rotation forest – in which every decision tree is trained by first
applying principal component analysis (PCA) on a random subset of the input
features.[7]
A special case of a decision tree is a decision list,[8] which is a one-sided decision
tree, so that every internal node has exactly 1 leaf node and exactly 1 internal node
as a child (except for the bottommost node, whose only child is a single leaf node).
While less expressive, decision lists are arguably easier to understand than general
decision trees due to their added sparsity, permit non-greedy learning
methods[9] and monotonic constraints to be imposed.[10]
Decision tree learning is the construction of a decision tree from class-labeled
training tuples. A decision tree is a flow-chart-like structure, where each internal
(non-leaf) node denotes a test on an attribute, each branch represents the outcome
of a test, and each leaf (or terminal) node holds a class label. The topmost node in a
tree is the root node
******************************************************************