ML Unit-2
ML Unit-2
(6CS4-02)
Unit-2 Notes
Text Book:
Machine learning- Tom M Mitchell
The term „K‟ is a number. You need to tell the system how many clusters you need to create.
For example, K = 2 refers to two clusters. There is a way of finding out what is the best or
optimum value of K for a given data.
For a better understanding of k-means, let's take an example from cricket. Imagine you received
data on a lot of cricket players from all over the world, which gives information on the runs
scored by the player and the wickets taken by them in the last ten matches. Based on this
information, we need to group the data into two clusters, namely batsman and bowlers.
Solution:
Here, we have our data set plotted on „x‟ and „y‟ coordinates. The information on the y-axis
is about the runs scored, and on the x-axis about the wickets taken by the players.
The first step in k-means clustering is the allocation of two centroids randomly (as K=2). Two
points are assigned as centroids. Note that the points can be anywhere, as they are random
points. They are called centroids, but initially, they are not the central point of a given data set.
The next step is to determine the distance between each of the data points from the randomly
assigned centroids. For every point, the distance is measured from both the centroids, and
whichever distance is less, that point is assigned to that centroid. You can see the data points
attached to the centroids and represented here in blue and yellow.
The next step is to determine the actual centroid for these two clusters. The original randomly
allocated centroid is to be repositioned to the actual centroid of the clusters.
This process of calculating the distance and repositioning the centroid continues until we obtain
our final cluster. Then the centroid repositioning stops.
As seen above, the centroid doesn't need anymore repositioning, and it means the algorithm has
converged, and we have the two clusters with a centroid.
• Academic performance
• Diagnostic systems
• Search engines
Academic Performance
Diagnostic systems
The medical profession uses k-means in creating smarter medical decision support systems,
especially in the treatment of liver ailments.
Search engines
Clustering forms a backbone of search engines. When a search is performed, the search results
need to be grouped, and the search engines very often use clustering to do this.
The clustering algorithm plays the role of finding the cluster heads, which collects all the data
in its respective cluster.
Distance Measure
Distance measure determines the similarity between two elements and influences the shape of
clusters.
The most common case is determining the distance between two points. If we have a point P
and point Q, the euclidean distance is an ordinary straight line. It is the distance between the
two points in Euclidean space.
This is identical to the Euclidean distance measurement but does not take the square root at the
end. The formula is shown below:
The Manhattan distance is the simple sum of the horizontal and vertical components or the
distance between two points measured along axes at right angles.
Note that we are taking the absolute value so that the negative values don't come into play.
In this case, we take the angle between the two vectors formed by joining the points from the
origin. The formula is shown below:
How Does K-Means Clustering Work?
The flowchart below shows how k-means clustering works:
The goal of the K-Means algorithm is to find clusters in the given input data. There are a couple
of ways to accomplish this. We can use the trial and error method by specifying the value of K
(e.g., 3,4, 5). As we progress, we keep changing the value until we get the best clusters.
Another method is to use the Elbow technique to determine the value of K. Once we get the
value of K, the system will assign that many centroids randomly and measure the distance of
each of the data points from these centroids. Accordingly, it assigns those points to the
corresponding centroid from which the distance is minimum. So each data point will be
assigned to the centroid, which is closest to it. Thereby we have a K number of initial clusters.
For the newly formed clusters, it calculates the new centroid position. The position of the
centroid moves compared to the randomly allocated one.
Once again, the distance of each point is measured from this new centroid point. If required,
the data points are relocated to the new centroids, and the mean position or the new centroid is
calculated once again.
If the centroid moves, the iteration continues indicating no convergence. But once the centroid
stops moving (which means that the clustering process has converged), it will reflect the result.
Step 1:
The Elbow method is the best way to find the number of clusters. The elbow method constitutes
running K-Means clustering on the dataset.
Next, we use within-sum-of-squares as a measure to find the optimum number of clusters that
can be formed for a given data set. Within the sum of squares (WSS) is defined as the sum of
the squared distance between each member of the cluster and its centroid.
The WSS is measured for each value of K. The value of K, which has the least amount of WSS,
is taken as the optimum value.
Step 2:
Let's assume that these are our delivery points:
We can randomly initialize two points called the cluster centroids.
Step 3:
Now the distance of each location from the centroid is measured, and each data point is assigned
to the centroid, which is closest to it.
Step 4:
Compute the actual centroid of data points for the first group.
Step 5:
Compute the actual centroid of data points for the second group.
Step 7:
Step 8:
Once the cluster becomes static, the k-means algorithm is said to be converged.
Step 2: Assign each x(i) to the closest cluster by implementing euclidean distance (i.e.,
calculating its distance to each centroid)
Step 3: Identify new centroids by taking the average of the assigned points.
Step 4: Keep repeating step 2 and step 3 until convergence is achieved Let's
Step 1:
We randomly pick K (centroids). We name them c1,c2,..... ck, and we can say that
Step 2:
We assign each data point to its nearest center, which is accomplished by calculating the
euclidean distance.
Here, we calculate the distance of each x value from each c value, i.e. the distance between x1-
c1, x1-c2, x1-c3, and so on. Then we find which is the lowest value and assign x1 to that
particular centroid.
We identify the actual centroid by taking the average of all the points assigned to that cluster.
It means the original point, which we thought was the centroid, will shift to the new position,
which is the actual centroid for each of these groups.
Step 4:
Hierarchical Clustering
Divisive clustering is a top-down approach. We begin with the whole set and proceed to divide
it into successively smaller clusters.
How K-means works:
1. Decide the number of clusters (k)
2. Select k random points from the data as centroids
3. Assign all the points to the nearest cluster centroid
4. Calculate the centroid of newly formed clusters
5. Repeat steps 3 and 4
It is an iterative process. It will keep on running until the centroids of newly formed clusters do
not change or the maximum number of iterations are reached.
But there are certain challenges with K-means. It always tries to make clusters of the same size.
Also, we have to decide the number of clusters at the beginning of the algorithm. Ideally, we would
not know how many clusters should we have, in the beginning of the algorithm and hence it a
challenge with K-means.
This is a gap hierarchical clustering bridges with aplomb. It takes away the problem of having to
pre-define the number of clusters. Sounds like a dream! So, let‟s see what hierarchical clustering
is and how it improves on K-means.
We are essentially building a hierarchy of clusters. That‟s why this algorithm is called hierarchical
clustering. I will discuss how to decide the number of clusters in a later section. For now, let‟s look
at the different types of hierarchical clustering.
We are merging (or adding) the clusters at each step, right? Hence, this type of clustering is also
known as additive hierarchical clustering.
So, it doesn‟t matter if we have 10 or 1000 data points. All these points will belong to the same
cluster at the beginning:
Now, at each iteration, we split the farthest point in the cluster and repeat this process until each
cluster only contains a single point:
We are splitting (or dividing) the clusters at each step, hence the name divisive hierarchical
clustering.
Agglomerative Clustering is widely used in the industry and that will be the focus in this article.
Divisive hierarchical clustering will be a piece of cake once we have a handle on the agglomerative
type.
Here‟s one way to calculate similarity – Take the distance between the centroids of these clusters.
The points having the least distance are referred to as similar points and we can merge them. We
can refer to this as a distance-based algorithm as well (since we are calculating the distances
between the clusters).
In hierarchical clustering, we have a concept called a proximitymatrix. This stores the distances
between each point. Let‟s take an example to understand this matrix as well as the steps to perform
hierarchical clustering.
Market Basket Analysis is one of the key techniques used by large retailers to uncover
associations between items. They try to find out associations between different items and
products that can be sold together, which gives assisting in right product placement. Typically,
it figures out what products are being bought together and organizations can place products in
a similar manner. Let‟s understand this better with an example:
People who buy Bread usually buy Butter too. The Marketing teams at retail stores should
target customers who buy bread and butter and provide an offer to them so that they buy the
third item, like eggs.
So if customers buy bread and butter and see a discount or an offer on eggs, they will be
encouraged to spend more and buy the eggs. This is what market basket analysis is all about.
This is just a small example. So, if you take 10000 items data of your Supermart to a Data
Scientist, Just imagine the number of insights you can get. And that is why Association Rule
mining is so important.
Antecedent (IF): This is an item/group of items that are typically found in the Itemsets or
Datasets.
But here comes a constraint. Suppose you made a rule about an item, you still have around
9999 items to consider for rule-making. This is where the Apriori Algorithm comes into play.
So before we understand the Apriori Algorithm, let‟s understand the math behind it. There are
3 ways to measure association:
• Support
• Confidence
• Lift
Support: It gives the fraction of transactions which contains item A and B. Basically Support
tells us about the frequently bought items or the combination of items bought frequently.
So with this, we can filter out the items that have a low frequency.
Confidence: It tells us how often the items A and B occur together, given the number times A
occurs.
Typically, when you work with the Apriori Algorithm, you define these terms accordingly. But
how do you decide the value? Honestly, there isn‟t a way to define these terms. Suppose
you‟ve assigned the support value as 2. What this means is, until and unless the item/s
frequency is not 2%, you will not consider that item/s for the Apriori algorithm. This makes
sense as considering items that are bought less frequently is a waste of time.
Now suppose, after filtering you still have around 5000 items left. Creating association rules
for them is a practically impossible task for anyone. This is where the concept of lift comes
into play.
Lift: Lift indicates the strength of a rule over the random occurrence of A and B. It basically
tells us the strength of any rule.
Focus on the denominator, it is the probability of the individual support values of A and B and
not together. Lift explains the strength of a rule. More the Lift more is the strength. Let‟s say
for A -> B, the lift value is 4. It means that if you buy A the chances of buying B is 4 times.
Let‟s get started with the Apriori Algorithm now and see how it works.
Apriori Algorithm
Apriori algorithm uses frequent itemsets to generate association rules. It is based on the concept
that a subset of a frequent itemset must also be a frequent itemset. Frequent Itemset is an itemset
whose support value is greater than a threshold value(support).
Iteration 1: Let‟s assume the support value is 2 and create the item sets of the size of 1 and
calculate their support values.
item 4 has a support value of 1 which is less than the min support value. So we are going to
discard {4} in the upcoming iterations. We have the final Table F1.
Iteration 2: Next we will create itemsets of size 2 and calculate their support values. All the
combinations of items set in F1 are used in this iteration.
Itemsets having Support less than 2 are eliminated again. In this case {1,2}. Now, Let‟s
understand what is pruning and how it makes Apriori one of the best algorithm for finding
frequent itemsets.
Pruning: We are going to divide the itemsets in C3 into subsets and eliminate the subsets that
are having a support value less than 2.
Iteration 3: We will discard {1,2,3} and {1,2,5} as they both contain {1,2}. This is the main
highlight of the Apriori Algorithm.
Iteration 4: Using sets of F3 we will create C4.
Since the Support of this itemset is less than 2, we will stop here and the final itemset we will
have is F3.
Note: Till now we haven‟t calculated the confidence values yet.
Applying Rules: We will create rules and apply them on itemset F3. Now let‟s assume a
minimum confidence value is 60%.
{1,3,5}
This tree structure will maintain the association between the itemsets. The database is fragmented
using one frequent item. This fragmented part is called “pattern fragment”. The itemsets of these
fragmented patterns are analyzed. Thus with this method, the search for frequent itemsets is
reduced comparatively.
FP Tree
Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of the database.
The purpose of the FP tree is to mine the most frequent pattern. Each node of the FP tree
represents an item of the itemset.
The root node represents null while the lower nodes represent the itemsets. The association of the
nodes with the lower nodes that is the itemsets with the other itemsets are maintained while
forming the tree.
Frequent Pattern Algorithm Steps
The frequent pattern growth method lets us find the frequent pattern without candidate
generation.
Let us see the steps followed to mine the frequent pattern using frequent pattern growth
algorithm:
1) The first step is to scan the database to find the occurrences of the itemsets in the
database. This step is the same as the first step of Apriori. The count of 1-itemsets in the database
is called support count or frequency of 1-itemset.
2) The second step is to construct the FP tree. For this, create the root of the tree. The root
is represented by null.
3) The next step is to scan the database again and examine the transactions. Examine the
first transaction and find out the itemset in it. The itemset with the max count is taken at the top,
the next itemset with lower count and so on. It means that the branch of the tree is constructed
with transaction itemsets in descending order of count.
Transaction List of items
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
4) The next transaction in the database is examined. The itemsets are ordered in descending
order of count. If any itemset of this transaction is already present in another branch (for example
in the 1st transaction), then this transaction branch would share a common prefix to the root. This
means that the common itemset is linked to the new node of another itemset in this transaction.
5) Also, the count of the itemset is incremented as it occurs in the transactions. Both the
common node and new node count is increased by 1 as they are created and linked according to
transactions.
6) The next step is to mine the created FP Tree. For this, the lowest node is examined first
along with the links of the lowest nodes. The lowest node represents the frequency pattern length
1. From this, traverse the path in the FP Tree. This path or paths are called a conditional pattern
base.
Conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring with
the lowest node (suffix).
7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The
itemsets meeting the threshold support are considered in the Conditional FP Tree. 8) Frequent
Patterns are generated from the Conditional FP Tree.
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Solution:
Support threshold=50% => 0.5*6= 3 =>min_sup=3
5. T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.
1. Count of each item
Table 2
Item Count
I1 4
I2 5
I3 4
I4 4
I5 2
2. Sort the itemset in descending order.
Table 3
Item Count
I2 5
I1 4
I3 4
I4 4
3. Build FP Tree
6. T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the root node,
hence it will be incremented by 1. Similarly I1 will be incremented by 1 as it is already
linked with I2 in T1, thus {I2:3}, {I1:2}, {I4:1}.
7. T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3}, {I3:2},
{I5:1}.
8. T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4}, {I3:3}, {I4
1}.
4. Mining of FP-tree is summarized below:
1. The lowest node item I5 is not considered as it does not have a min support count, hence
it is deleted.
2. The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},{I2,I3,I4:1}. Therefore
considering I4 as suffix the prefix paths will be {I2, I1, I3:1}, {I2, I3: 1}. This forms the
conditional pattern base.
3. The conditional pattern base is considered a transaction database, an FP-tree is
constructed. This will contain {I2:2, I3:2}, I1 is not considered as it does not meet the min
support count.
4. This path will generate all combinations of frequent patterns :
{I2,I4:2},{I3,I4:2},{I2,I3,I4:2}
5. For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node FP-tree :
{I2:4, I1:3} and frequent patterns are generated: {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}.
6. For I1, the prefix path would be: {I2:4} this will generate a single node FP-tree: {I2:4}
and frequent patterns are generated: {I2, I1:4}.
The diagram given below depicts the conditional FP tree associated with the conditional node I3.
Advantages of FP Growth Algorithm
1. This algorithm needs to scan the database only twice when compared to Apriori which
scans the transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent patterns.
FP Growth vs Apriori
FP Growth Apriori
Pattern Generation
FP growth generates pattern by constructing a FP Apriori generates pattern by pairing the items into
tree singletons, pairs and triplets.
Candidate Generation
Process
The process is faster as compared to Apriori. The The process is comparatively slower than FP
runtime of process increases linearly with Growth, the runtime increases exponentially with
increase in number of itemsets. increase in number of itemsets
Memory Usage
A compact version of database is saved The candidates combinations are saved in memory
Gaussian Mixture Model
Gaussian Mixture Model or Mixture of Gaussian as it is sometimes called, is not so much a
model as it is a probability distribution. It is a universally used model for generative
unsupervised learning or clustering. It is also called Expectation-Maximization Clustering or
EM Clustering and is based on the optimization strategy. Gaussian Mixture models are used
for representing Normally Distributed subpopulations within an overall population. The
advantage of Mixture models is that they do not require which subpopulation a data point
belongs to. It allows the model to learn the subpopulations automatically. This constitutes a
form of unsupervised learning.
A Gaussian is a type of distribution, and it is a popular and mathematically convenient type of
distribution. A distribution is a listing of outcomes of an experiment and the probability
associated with each outcome. Let‟s take an example to understand. We have a data table that
lists a set of cyclist‟s speeds.
1 4
2 9
3 6
4 7
5 3
6 2
A cyclist reaches the speed of 1 Km/h four times, 2Km/h nine times, 3 Km/h and so on. We
can notice how this follows, the frequency goes up and then it goes down. It looks like it follows
a kind of bell curve the frequencies go up as the speed goes up and then it has a peak value and
then it goes down again, and we can represent this using a bell curve otherwise known as a
Gaussian distribution.
A Gaussian distribution is a type of distribution where half of the data falls on the left of it, and
the other half of the data falls on the right of it. It‟s an even distribution, and one can notice
just by the thought of it intuitively that it is very mathematically convenient.
Gaussian distribution would be a great distribution to model the data in those cases where the
data reaches a peak and then decreases. Similarly, in Multi Gaussian Distribution, we will have
multiple peaks with multiple means and multiple standard deviations.
The formula for Gaussian distribution using the mean and the standard deviation called the
Probability Density Function:
For a given point X, we can compute the associated Y values. Y values are the probabilities for
those X values. So, for any X value, we can calculate the probability of that X value being a
part of the curve or being a part of the dataset.
This is a function of a continuous random variable whose integral across an interval gives the
probability that the value of the variable lies within the same interval.