MODULE 3
What is Machine Learning?
Machine Learning is a concept which allows the machine to learn from examples and
experience, and that too without being explicitly programmed. So instead of you writing the
code, what you do is you feed data to the generic algorithm, and the algorithm/ machine
builds the logic based on the given data.
How does Machine Learning Work?
Machine Learning algorithm is trained using a training data set to create a model. When new
input data is introduced to the ML algorithm, it makes a prediction on the basis of the model.
The prediction is evaluated for accuracy and if the accuracy is acceptable, the Machine
Learning algorithm is deployed. If the accuracy is not acceptable, the Machine Learning
algorithm is trained again and again with an augmented training data set
Machine learning is sub-categorized to three types:
Supervised Learning – Train Me!
Unsupervised Learning – I am self-sufficient in learning
Reinforcement Learning – My life My rules! (Hit & Trial)
What is Supervised Learning?
Supervised Learning is the one, where you can consider the learning is guided by a teacher.
We have a dataset which acts as a teacher and its role is to train the model or the machine.
Once the model gets trained it can start making a prediction or decision when new data is
given to it.
➢ Classification: A machine learning task where the model predicts categorical labels
(e.g., spam vs. not spam).
➢ Naïve Bayes: A probabilistic classifier based on Bayes' Theorem, often used for text
classification.
➢ K-Nearest Neighbours (KNN): A non-parametric, instance-based learning algorithm
that classifies based on the majority vote of its k-nearest neighbours.
➢ Linear Regression: A regression algorithm that models relationships between
independent and dependent variables using a straight line (y = mx + b).
Naïve Bayes Classifier Algorithm
• Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
• It is mainly used in text classification that includes a high-dimensional training
dataset.
• Naïve Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of the probability of
an object.
• Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
Why is it called Naive Bayes?
The Naive Bayes algorithm is comprised of two words Naïve and Bayes, which can be
described as:
Naive: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other.
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Advantages of Naïve Bayes Classifier:
Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
It can be used for Binary as well as Multi-class Classifications.
It performs well in multi-class predictions as compared to the other Algorithms.
It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.
Applications of Naïve Bayes Classifier:
It is used for Credit Scoring.
It is used in medical data classification.
It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
It is used in Text classification such as Spam filtering and Sentiment analysis
K-Nearest Neighbour (KNN) Algorithm
K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well suite
category by using K- NN algorithm.
K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an action
on the dataset.
KNN algorithm at the training phase just stores the dataset and when it gets new data, then
it classifies that data into a category that is much similar to the new data.
Why do we need a KNN algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of problem,
we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or
class of a particular dataset. Consider the below diagram:
Advantages of KNN Algorithm:
It is simple to implement.
It is robust to the noisy training data
It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
Always needs to determine the value of K which may be complex some time.
The computation cost is high because of calculating the distance between the data points
for all the training samples.
Linear Regression
Linear regression is one of the most popular and simple machine learning algorithms that is
used for predictive analysis. Here, predictive analysis defines prediction of something, and
linear regression makes predictions for continuous numbers such as salary, age, etc.
It shows the linear relationship between the dependent and independent variables, and shows
how the dependent variable(y) changes according to the independent variable (x).
It tries to best fit a line between the dependent and independent variables, and this best fit line
is knowns as the regression line.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
The equation for the regression line is:
y= a0+ a*x+ b
Here, y= dependent variable
x= independent variable
a0 = Intercept of line.
Linear regression is further divided into two types:
• Simple Linear Regression: In simple linear regression, a single independent variable
is used to predict the value of the dependent variable.
• Multiple Linear Regression: In multiple linear regression, more than one
independent variable is used to predict the value of the dependent variable
What is Unsupervised Learning?
The model learns through observation and finds structures in the data. Once the model is
given a dataset, it automatically finds patterns and relationships in the dataset by creating
clusters in it. What it cannot do is add labels to the cluster, like it cannot say this a group of
apples or mangoes, but it will separate all the apples from mangoes.
Suppose we presented images of apples, bananas and mangoes to the model, so what it does,
based on some patterns and relationships it creates clusters and divides the dataset into those
clusters. Now if a new data is fed to the model, it adds it to one of the created clusters.
• Clustering: A technique to group similar data points together without predefined
labels.
• Hierarchical Algorithms:
o Agglomerative Clustering: A bottom-up approach where each data point starts
as a cluster and merges iteratively.
• Partitional Algorithms:
o K-Means Clustering: A centroid-based clustering technique where k clusters
are formed by iteratively assigning data points to the nearest cluster center.
What is clustering?
Cluster analysis or simply clustering is the process of partitioning a set of data objects (or
observations) into subsets. Each subset is a cluster, such that objects in a cluster are similar to
one another, yet dissimilar to objects in other clusters. The set of clusters resulting from a
cluster analysis can be referred to as a clustering.
For example, consider a dataset of vehicles given in which it contains information about
different vehicles like cars, buses, bicycles, etc.
As it is unsupervised learning there are no class labels like Cars, Bikes, etc for all the
vehicles, all the data is combined and is not in a structured manner.
Clustering is also called data segmentation in some applications because clustering partitions
large data sets into groups according to their similarity. Clustering can also be used for outlier
detection, where outliers (values that are “far away” from any cluster) may be more
interesting than common cases.
Applications of outlier detection include the detection of credit card fraud and the monitoring
of criminal activities in electronic commerce. For example, exceptional cases in credit card
transactions, such as very expensive and infrequent purchases, may be of interest as possible
fraudulent activities.
The clustering methods can be classified into the following categories:
• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
Partitioning Method
It is used to make partitions on the data in order to form clusters.
If “n” partitions are done on “p” objects of the database then each partition is represented by
a cluster and n < p.
The two conditions which need to be satisfied with this Partitioning Clustering Method are:
• One objective should only belong to only one group.
• There should be no group without even a single purpose.
• In the partitioning method, there is one technique called iterative relocation, which means
the object will be moved from one group to another to improve the partitioning
K-means clustering
• K-means is a partitional clustering algorithm
• Let the set of data points (or instances) D be {x1, x2, …, xn}, where xi = (xi1, xi2, …,
xir) is a vector in a real valued space X R r, and r is the number of attributes
(dimensions) in the data.
• The k-means algorithm partitions the given data into k clusters.
o Each cluster has a cluster center, called centroid.
o k is specified by the user
K-means algorithm
Given k, the k-means algorithm works as follows:
1) Randomly choose k data points (seeds) to be the initial centroids, cluster centers
2) Assign each data point to the closest centroid
3) Re-compute the centroids using the current cluster memberships.
4) If a convergence criterion is not met, go to 2.
Strengths of k-means
• Strengths:
– Simple: easy to understand and to implement
– Efficient: Time complexity: O(tkn), where n is the number of data points, k is the number
of clusters, and t is the number of iterations.
– Since both k and t are small. k-means is considered a linear algorithm.
• K-means is the most popular clustering algorithm.
Weaknesses of k-means
The algorithm is only applicable if the mean is defined.
– For categorical data, k-mode - the centroid is represented by most frequent values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
– Outliers are data points that are very far away from other data points.
– Outliers could be errors in the data recording or some special data points with very
different values.
Hierarchical Clustering
• Produce a nested sequence of clusters, a tree, also called Dendrogram.
Types of hierarchical clustering
• Agglomerative (bottom up) clustering: It builds the dendrogram (tree) from the bottom
level, and
– merges the most similar (or nearest) pair of clusters
– stops when all the data points are merged into a single cluster (i.e., the root cluster).
• Divisive (top down) clustering: It starts with all data points in one cluster, the root.
– Splits the root into a set of child clusters. Each child cluster is recursively divided further
– stops when only singleton clusters of individual data points remain, i.e., each cluster with
only a single point
Agglomerative Hierarchical Clustering
• An agglomerative hierarchical clustering method uses a bottom-up strategy.
• It starts by letting each object form its own cluster and iteratively merges clusters into larger
and larger clusters, until all the objects are in a single cluster or certain termination conditions
are satisfied.
• The single cluster becomes the hierarchy’s root.
• For the merging step, it finds the two clusters that are closest to each other (according to
some similarity measure), and combines the two to form one cluster. Because two clusters are
merged per iteration, where each cluster contains at least one object, an agglomerative
method requires at most n iterations.
Divisive Hierarchical Clustering:
A divisive hierarchical clustering method employs a top-down strategy.
It starts by placing all objects in one cluster, which is the hierarchy’s root. It then divides
the root cluster into several smaller subclusters, and recursively partitions those clusters into
smaller ones.
The partitioning process continues until each cluster at the lowest level is coherent
enough—either containing only one object, or the objects within a cluster are sufficiently
similar to each other.
In either agglomerative or divisive hierarchical clustering, a user can specify the desired
number of clusters as a termination condition.
Example :
Agglomerative versus divisive hierarchical clustering. Figure shows the application of
AGNES (Agglomerative NESting), an agglomerative hierarchical clustering method, and
DIANA (DIvisive ANAlysis), a divisive hierarchical clustering method, on a data set of five
objects, {a,b,c,d, e}.
Initially, AGNES, the agglomerative method, places each object into a cluster of its own.
The clusters are then merged step-by-step according to some criterion.
For example, clusters C1 and C2 may be merged if an object in C1 and an object in C2 form
the minimum Euclidean distance between any two objects from different clusters.
This is a single-linkage approach in that each cluster is represented by all the objects in the
cluster, and the similarity between two clusters is measured by the similarity of the closest
pair of data points belonging to different clusters.
The cluster-merging process repeats until all the objects are eventually merged to form one
cluster.
DIANA, the divisive method, proceeds in the contrasting way. All the objects are used to
form one initial cluster. The cluster is split according to some principle such as the maximum
Euclidean distance between the closest neighbouring objects in the cluster. The cluster-
splitting process repeats until, eventually, each new cluster contains only a single object.
Dendrogram:
A tree structure called a dendrogram is commonly used to represent the process of
hierarchical clustering. It shows how objects are grouped together (in an agglomerative
method) or partitioned (in a divisive method) step-by-step. Figure-2 shows a dendrogram for
the five objects presented in Figure-1, where l = 0 shows the five objects as singleton clusters
at level 0. At l = 1, objects a and b are grouped together to form the first cluster, and they stay
together at all subsequent levels. We can also use a vertical axis to show the similarity scale
between clusters. For example, when the similarity of two groups of objects, {a,b} and {c,d,
e}, is roughly 0.16, they are merged together to form a single cluster.
A challenge with divisive methods is how to partition a large cluster into several smaller
ones. For example, there are 2n−1 − 1 possible ways to partition a set of n objects into two
exclusive subsets, where n is the number of objects.
Advantages of Hierarchical clustering:
• It is simple to implement and gives the best output in some cases.
• It is easy and results in a hierarchy, a structure that contains more information.
• It does not need us to pre-specify the number of clusters.
Disadvantages of hierarchical clustering
• It breaks the large clusters.
• It is Difficult to handle different sized clusters and convex shapes.
• It is sensitive to noise and outliers.
• The algorithm can never be changed or deleted once it was done previously.
Association Rule Mining
Association rule mining finds interesting associations and relationships among large sets of
data items. This rule shows how frequently an itemset occurs in a transaction.
Apriori Algorithm
The Apriori Algorithm is used to find frequent item sets (items that appear together often
in a dataset) and generate association rules. It is based on the Apriori Principle, which
states:
"If an itemset is frequent, then all of its subsets must also be frequent."
This means that if a set of items frequently appears in transactions, then its smaller subsets
must also appear frequently.
Key Metrics of Apriori Algorithm
• Support: This metric measures how frequently an item appears in the dataset relative
to the total number of transactions. A higher support indicates a more significant
presence of the itemset in the dataset. Support tells us how often a particular item or
combination of items appears in all the transactions (“Bread is bought in 20% of all
transactions.”)
• Confidence: Confidence assesses the likelihood that an item Y is purchased when
item X is purchased. It provides insight into the strength of the association between
two items.
• Confidence tells us how often items go together. (“If bread is bought, butter is bought
75% of the time.”)
• Lift: Lift evaluates how much more likely two items are to be purchased together
compared to being purchased independently. A lift greater than 1 suggests a strong
positive association. Lift shows how strong the connection is between items. (“Bread
and butter are much more likely to be bought together than by chance.”)