Unit 4
Unit 4
Clustering is like sorting a bunch of similar items into different groups based on their characteristics. In
data mining and machine learning, it’s a powerful technique used to group similar data points together,
making it easier to find patterns or understand large datasets.
You can cluster almost anything, and the more similar the items are in the cluster, the better your
clusters are.
Clustering is sometimes called unsupervised classification because it produces the same result as
classification but without having predefined classes.
This notion of similarity depends on a similarity measurement. Essentially, clustering helps identify
natural groupings in your data.
Types of Clustering
Clustering is a type of unsupervised learning wherein data points are grouped into different sets based
on their degree of similarity.
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science.
It is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters.
Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2,
there will be two clusters, and for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way
that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the categories
of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim
of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters,
and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.
Hence each cluster has data points with some commonalities, and it is away from other clusters.
Objective of k means Clustering
The main objective of k-means clustering is to partition your data into a specific number (k) of groups,
where data points within each group are similar and dissimilar to points in other groups. It achieves this
by minimizing the distance between data points and their assigned cluster’s center, called the centroid.
Here’s an objective:
Grouping similar data points: K-means aims to identify patterns in your data by grouping data
points that share similar characteristics together. This allows you to discover underlying structures
within the data.
Minimizing within-cluster distance: The algorithm strives to make sure data points within a cluster
are as close as possible to each other, as measured by a distance metric (usually Euclidean distance).
This ensures tight-knit clusters with high cohesiveness.
Maximizing between-cluster distance: Conversely, k-means also tries to maximize the separation
between clusters. Ideally, data points from different clusters should be far apart, making the clusters
distinct from each other.
The k-means algorithm works like this.
1. Initialization: Start by randomly selecting K points from the dataset. These points will act as the
initial cluster centroids.
2. Assignment: For each data point in the dataset, calculate the distance between that point and each of
the K centroids. Assign the data point to the cluster whose centroid is closest to it. This step
effectively forms K clusters.
3. Update centroids: Once all data points have been assigned to clusters, recalculate the centroids of
the clusters by taking the mean of all data points assigned to each cluster.
4. Repeat: Repeat steps 2 and 3 until convergence. Convergence occurs when the centroids no longer
change significantly or when a specified number of iterations is reached.
5. Final Result: Once convergence is achieved, the algorithm outputs the final cluster centroids and the
assignment of each data point to a cluster..
K means Clustering
The algorithm will categorize the items into k groups or clusters of similarity. To calculate that similarity,
we will use the Euclidean distance as a measurement.
We categorize each item to its closest mean, and we update the mean’s coordinates, which are the
averages of the items categorized in that cluster so far.
We repeat the process for a given number of iterations and at the end, we have our clusters.
The “points” mentioned above are called means because they are the mean values of the items categorized
in them. To initialize these means, we have a lot of options. An intuitive method is to initialize the means
at random items in the data set. Another method is to initialize the means at random values between the
boundaries of the data set. For example for a feature x the items have values in [0,3] we will initialize the
means with values for x at [0,3].
Euclidean distance is mostly used distance measure
Advantages of K-Means Clustering Algorithm
Image Segmentation
K-Means clustering can be used to segment an image into different regions based on the color or texture of
the pixels. This technique is widely used in computer vision applications, such as object recognition, image
retrieval, and medical imaging.
Customer Segmentation
K-Means clustering can be used to segment customers into different groups based on their purchasing
behavior or demographic characteristics. This technique is widely used in marketing applications, such as
customer retention, loyalty programs, and targeted advertising.
Anomaly Detection
K-Means clustering can be used to detect anomalies in a dataset by identifying data points that do not belong
to any cluster. This technique is widely used in fraud detection, network intrusion detection, and predictive
maintenance.
Looking at items commonly purchased together can give stores an idea of customers’ purchasing
behavior.
This knowledge, extracted from the sea of data, can be used for pricing, marketing promotions,
inventory management, and so on.
Looking for hidden relationships in large datasets is known as association analysis or association
rule learning.
The problem is, finding different combinations of items can be a time-consuming task and
prohibitively expensive in terms of computing power.
Brute-force solutions aren’t capable of solving this problem, so a more intelligent approach is
required to find frequent itemsets in a reasonable amount of time.
Association rule mining finds interesting associations and relationships among large sets of
data items.
This rule shows how frequently a item set occurs in a transaction.
A typical example is a Market Based Analysis. Market Based Analysis is one of the key
techniques used by large relations to show associations between items.
It allows retailers to identify relationships between the items that people buy together
frequently.
Given a set of transactions, we can find rules that will predict the occurrence of an item based
on the occurrences of other items in the transaction.
Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be
more profitable.
It tries to find some interesting relations or associations among the variables of dataset. It is
based on different rules to discover the interesting relations between variables in the database.
The association rule learning is one of the very important concepts of machine learning, and
it is employed in Market Basket analysis, Web usage mining, continuous production, etc.
Here market basket analysis is a technique used by the various big retailer to discover the
associations between items.
We can understand it by taking an example of a supermarket, as in a supermarket, all products that are purchased
together are put together
Association analysis is the task of finding interesting relationships in large datasets.
These interesting relationships can take two forms: frequent item sets or association rules.
Frequent item sets are a collection of items that frequently occur together.
The second way to view interesting relationships is association rules. Association rules suggest
that a strong relationship exists between two items.
• Frequent items sets are lists of items that commonly appear together.
• From the dataset we can also find an association rule such as bread ➞ milk. This means that if
someone buys bread, there’s a good chance they’ll buy milk.
• With the frequent item sets and association rules, retailers have a much better understanding of
their customers.
• Although common examples of association analysis are from the retail industry, it can be applied
to a number of other industries, such as website traffic analysis and medicine.
If lift = 1, it would imply that the probability of occurrence of the antecedent and that of
the consequent are independent of each other.
When two events are independent of each other no rule can be drawn involving those
two events.
If lift > 1, that lets us know the degree to which two occurrences are dependent on one
another and makes those rules potentially useful for predicting the consequent in future
data sets.
Types of Association Rule Learning: Association rule learning can be divided into three algorithms:
Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is designed to work on the databases
that contain transactions. This algorithm uses a breadth-first search and Hash Tree to calculate the itemset
efficiently.
It is mainly used for market basket analysis and helps to understand the products that can be bought
together. It can also be used in the healthcare field to find drug reactions for patients.
Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-first search
technique to find frequent itemsets in a transaction database. It performs faster execution than Apriori
Algorithm.
In association analysis two things are important i.e. Frequent item sets and association rules.
The way to find frequent itemsets is the Apriori algorithm.
The Apriori algorithm needs a minimum support level as an input and a data set.
The algorithm will generate a list of all candidate itemsets with one item.
The transaction data set will then be scanned to see which sets meet the minimum support level.
Sets that don’t meet the minimum support level will get tossed out. The remaining sets will then be
combined to make itemsets with two elements.
Again, the transaction dataset will be scanned and itemsets not meeting the minimum support level
will get tossed.
This procedure will be repeated until all sets are tossed out.
Procedure
Apriori principle to reduce the number of sets that are checked against the database.
The Apriori principle states that if an item is infrequent, then supersets containing that item will also be
infrequent.
The Apriori algorithm starts from single itemsets and creates larger sets by combining sets that meet the
minimum support measure.
Support is used to measure how often a set appears in the original data.
Once frequent itemsets have been found, you can use the frequent itemsets to generate association rules.
The significance of an association rule is measured by confidence.
Confidence tells you how many times this rule applies to the frequent itemsets.
Lets understand the concept of apriori Algorithm with the help of an example. Consider the
following dataset and we will find frequent itemsets and generate association rules for them:
The Apriori Algorithm, as demonstrated in the bread-butter example, is widely used in modern
startups like Zomato, Swiggy, and other food delivery platforms. These companies use it to
perform market basket analysis, which helps them identify customer behavior patterns and
optimize recommendations.
Applications
• Association analysis can be performed on many different items. Some
common examples are items in a store and pages visited on a website.
• Association analysis has also been used to look at the voting history of
elected officials and judges.
Efficiently finding frequent itemsets with FP-growth Algorithm
The Frequent Pattern Growth (FP-Growth) algorithm improves upon the Apriori algorithm by
eliminating the need for multiple database scans and reducing computational overhead.
By using a Trie data structure and focusing on ordered-item sets, FP-Growth efficiently mines
frequent itemsets, making it a faster and more scalable solution for large datasets.
This approach ensures faster processing while maintaining accuracy, making FP-Growth a
powerful tool for data mining.
Why fp-growth is faster ?
The FP-growth algorithm is an efficient way of finding frequent patterns in a dataset.
The FP-growth algorithm works with the Apriori principle but is much faster.
The Apriori algorithm generates candidate itemsets and then scans the dataset to see if they’re
frequent.
FP-growth is faster because it goes over the dataset only twice.
The dataset is stored in a structure called an FP-tree.
After the FP-tree is built, you can find frequent itemsets by finding conditional bases for an item
and building a conditional FP-tree.
This process is repeated, conditioning on more items until the conditional Fptree has only one
item.
Applications
• The FP-growth algorithm can be used to find frequent words in a series of text documents.
• The microblogging site Twitter provides a number of APIs for developers to use their services.
• The Python module Python-Twitter allows easy access to Twitter.
• Applying the FP-growth algorithm to a Twitter feed on a certain topic can give you some summary
information for that topic.
• There are a number of other uses for frequent itemset generation such as shopping transactions,
medical diagnosis, and study of the atmosphere.
Dimensionality-reduction Techniques
While working with machine learning models, we often encounter datasets with a large number of
features. These datasets can lead to problems such as increased computation time and overfitting. To
address these issues, we use dimensionality reduction techniques.
The number of input features, variables, or columns present in a given dataset is known as dimensionality,
and the process to reduce these features is called dimensionality reduction.
To preprocess the inputs before using any of the ML algorithms
These techniques reduce noise and improve the performance of machine learning algorithms and works
on labeled and unlabeled data
The Curse of Dimensionality
Handling the high-dimensional data is very difficult in practice, commonly known as the curse of
dimensionality.
If the dimensionality of the input dataset increases, any machine learning algorithm and model
becomes more complex.
As the number of features increases, the number of samples also gets increased proportionally, and the
chance of overfitting also increases.
If the machine learning model is trained on high-dimensional data, it becomes overfitted and results
in poor performance.
Hence, it is often required to reduce the number of features, which can be done with dimensionality
reduction.
What is dimensionality reduction?
In machine learning classification problems, there are often too many factors on the basis of which the
final classification is done.
These factors are basically variables called features. The higher the number of features, the
harder it gets to visualize the training set and then work on it.
Sometimes, most of these features are correlated, and hence redundant. This is where dimensionality
reduction algorithms come into play. Eg: humidity and rainfall, e-mail spam classification etc
Dimensionality reduction is the process of reducing the number of random variables under
consideration, by obtaining a set of principal variables.
It can be divided into feature selection and feature extraction.
A 3-D classification problem can be hard to visualize, whereas a 2-D one can be mapped to a
simple 2 dimensional space, and a 1-D problem to a simple line.
The below figure illustrates this concept, where a 3-D feature space is split into two 1-D feature
spaces, and later, if found to be correlated, the number of features can be reduced even further.
On the left, data points exist in a 3D space (X, Y, Z), but the Z-
dimension appears unnecessary since the data primarily varies
along the X and Y axes. The goal of dimensionality reduction is
to remove less important dimensions without losing valuable
information.
Feature selection: find a subset of the original set of variables, or features, to get a smaller subset
which can be used to model the problem.
It usually involves three ways:
Filter
Wrapper
Embedded
Feature extraction: This reduces the data in a high dimensional space to a lower dimension space, i.e.
a space with lesser no. of dimensions. Eg: PCA and SVD
Feature Selection
Feature selection chooses the most relevant features from the dataset without altering them. It helps
remove redundant or irrelevant features, improving model efficiency. There are several methods for
feature selection including filter methods, wrapper methods, and embedded methods.
1. Filter methods rank the features based on their relevance to the target variable.
2. Wrapper methods use the model performance as the criteria for selecting features.
3. Embedded methods combine feature selection with the model training process.
Feature Extraction
Feature extraction involves creating new features by combining or transforming the original features.
PCA is a popular technique that projects the original features onto a lower-dimensional space while
preserving as much of the variance as possible.
Benefits of applying Dimensionality Reduction
Some benefits of applying dimensionality reduction technique to the given dataset are given below:
Faster Computation: With fewer features, machine learning algorithms can process data more
quickly. This results in faster model training and testing, which is particularly useful when working
with large datasets.
Better Visualization: As we saw in the earlier figure, reducing dimensions makes it easier to visualize
data, revealing hidden patterns.
Prevent Overfitting: With fewer features, models are less likely to memorize the training data and
overfit. This helps the model generalize better to new, unseen data, improving its ability to make
accurate predictions.
By reducing the dimensions of the features, the space required to store the dataset also gets reduced.
Less Computation training time is required for reduced dimensions of features.
Reduced dimensions of features of the dataset help in visualizing the data quickly.
It removes the redundant features (if present) by taking care of multicollinearity.
Disadvantages of dimensionality Reduction
Data Loss & Reduced Accuracy – Some important information may be lost during
dimensionality reduction, potentially affecting model performance.
Interpretability Challenges – The transformed features (e.g., principal components) may not
have clear meanings, making it harder to understand relationships in the original data.
Choosing the Right Components – Deciding how many dimensions to keep is difficult, as
keeping too few may lose valuable information, while keeping too many can lead to overfitting.
Some data may be lost due to dimensionality reduction.
In the PCA dimensionality reduction technique, sometimes the principal components required to
consider are unknown
Principal Component Analysis (PCA)
This method was introduced by Karl Pearson.
Principal Component Analysis is a statistical process that converts the observations of correlated features
into a set of linearly uncorrelated features with the help of orthogonal transformation.
These new transformed features are called the Principal Components.
It works on a condition that while the data in a higher dimensional space is mapped to data in a lower
dimension space, the variance of the data in the lower dimensional space should be maximum.
It is one of the popular tools that is used for exploratory data analysis , image processing, movie
recommendation system, optimizing the power allocation in various communication channels.
• It involves the following steps:
i. Construct the covariance matrix of the data.
ii. Compute the eigenvectors of this matrix.
iii. Eigenvectors corresponding to the largest eigenvalues
are used to reconstruct a large fraction of variance of
the original data.
• As we are left with a lesser number of eigenvectors, and
there might have been some data loss in the process.
• But, the most important variances should be retained by
the remaining eigenvectors.
Key Concepts of PCA
Dimensionality Reduction
PCA reduces the number of features while preserving as much variance (information)
as possible.
Orthogonal Transformation
PCA transforms the original correlated features into a set of new uncorrelated features
called principal components.
Variance Maximization
The first principal component captures the largest variance in the data.
The second principal component captures the next highest variance orthogonal to the
first.
This process continues for all components.
Principal Components in PCA
The transformed new features or the output of PCA are the Principal Components.
The number of these PCs are either equal to or less than the original features present in the dataset.
Because sometimes, variables are highly correlated in such a way that they contain redundant
information. So, in order to identify these correlations, we compute the covariance matrix.
The aim of this step is to understand how the variables of the input data set are varying from the
mean with respect to each other, or in other words, to see if there is any relationship between
them.
Because sometimes, variables are highly correlated in such a way that they contain redundant
information. So, in order to identify these correlations, we compute the covariance matrix.
If the value of the Covariance Matrix is positive, then it indicates that the variables are
correlated. ( If X increases, Y also increases and vice versa)
If the value of the Covariance Matrix is negative, then it indicates that the variables are
inversely correlated. ( If X increases, Y also decreases and vice versa).
It is eigenvectors and eigenvalues who are behind all the magic of principal components because the
eigenvectors of the Covariance matrix are actually the directions of the axes where there is the most
variance (most information) and that we call Principal Components.
And eigenvalues are simply the coefficients attached to eigenvectors, which give the amount of
variance carried in each Principal Component.
By ranking your eigenvectors in order of their eigenvalues, highest to lowest, you get the principal
components in order of significance.
Step 4: Create a Feature Vector
In the previous step, computing the eigenvectors and ordering them by their eigenvalues in
descending order, allow us to find the principal components in order of significance.
In this step, what we do is, to choose whether to keep all these components or discard those of
lesser significance (of low eigenvalues), and form with the remaining ones a matrix of vectors
that we call Feature vector.
So, the feature vector is simply a matrix that has as columns the eigenvectors of the components
that we decide to keep.
This makes it the first step towards dimensionality reduction, because if we choose to keep
only p eigenvectors (components) out of n, the final data set will have only p dimensions.
Advantages of Principal Component Analysis
Sometimes, PCA is difficult to interpret. In rare cases, you may feel difficult to identify the
most important features even after computing the principal components.
You may face some difficulties in calculating the covariances and covariance matrices.
Sometimes, the computed principal components can be more difficult to read rather than
the original set of components.
Singular Value Decomposition
A matrix decomposition method for reducing a matrix to its constituent parts in order to make
certain subsequent matrix calculations simpler.
represents our original data set with a much smaller data set
extracts the relevant features from a collection of noisy data
powerful tool used to distill information in a number of applications, from bioinformatics to finance
including selection of restaurant etc.
Dimensionality reduction: SVD can reduce the number of dimensions in high-dimensional data,
making it easier to visualize and analyze.
Applications in various fields: SVD has various applications in fields like image processing, natural
language processing, and recommendation systems.
Interpretability: The singular values and singular vectors obtained from SVD can provide insight into
the structure of the data and the relationship between features.
Disadvantages of SVD
Limitations in non-linear relationships: SVD assumes that the relationships between features are linear,
which can limit its ability to capture complex non-linear relationships.
SVD is sensitive to the scale of the features.
Applications of SVD in Image Processing
Image Compression
SVD allows for efficient storage and compression of images by keeping only the most significant
singular values. Reduces storage requirements while maintaining image quality.
Recommender Systems
Used in collaborative filtering for building recommendation systems (e.g., movie or product
recommendations). Netflix and Amazon use SVD-based techniques to improve user recommendations.
Signal Processing
Helps in noise reduction and feature extraction in signals like speech or EEG signals.Used in
filtering and denoising audio/video signals.
Anomaly Detection
Identifies outliers in data by analyzing the contribution of singular values.
Used in fraud detection and fault diagnosis.
Data Clustering
Helps in discovering hidden patterns in high-dimensional datasets.10. Quantum
ComputingSVD is used in quantum algorithms, including quantum principal component
analysis
Feature selection
The process of selecting the subset of the relevant features and leaving out the irrelevant
features present in a dataset to build a model of high accuracy.
In other words, it is a way of selecting the optimal features from the input dataset.
To find a subset of the original set of variables, or features, to get a smaller subset which can
be used to model the problem.
Feature or variable ranking
It is the process of ordering the features by the value of some scoring function, which usually
measures feature-relevance.
The score S(fi) is computed from the training data, measuring some criteria of feature fi.
By convention a high score is indicative for a valuable (relevant) feature.
A simple method for feature selection using variable ranking is to select the k highest ranked
features according to S.
This is usually not optimal, but often preferable to other, more complicated methods. It is
computationally efficient — only calculation and sorting of n scores.
Ranking Criteria: Correlation Criteria and Information Theoretic Criteria
Important features of feature ranking method of dimensionality reduction
Feature ranking is a popular method of dimensionality reduction that involves selecting a subset of
features from a larger set of variables.
The goal is to identify the most important features that contribute to the prediction or classification
task. Here are some important features of feature ranking methods .
Ranking Criteria: Feature ranking methods use various criteria to rank the importance of
features, such as correlation, mutual information, entropy, and so on. The ranking criteria
should reflect the relevance of each feature to the problem at hand.
Selection Method: Once the features are ranked, a selection method is used to choose a subset
of the top-ranked features. The selection method can be based on a threshold value, such as
selecting the top k features, or it can be a more sophisticated method, such as a greedy
algorithm or a genetic algorithm.
Performance Evaluation: Feature ranking methods should be evaluated based on their ability
to improve the performance of the prediction or classification model. The selected subset of
features should lead to better performance compared to using all the features.
Robustness: Feature ranking methods should be robust to noise and outliers in the data. They
should also be able to handle missing values and deal with redundant or correlated features.
Interpretability: Feature ranking methods should provide interpretable results that can be easily
understood by domain experts. The importance of each feature should be explained in terms of its
relevance to the problem at hand.
Overall, feature ranking is a useful technique for reducing the dimensionality of high-dimensional
data and improving the performance of machine learning models
Feature Subset Selection
The Goal of Feature Subset Selection is to find the optimal feature subset.
Feature Subset Selection Methods can be classified into three broad categories
Filter Methods
Wrapper Methods
Embedded Methods
Filter Methods
In this method, the dataset is filtered, and a subset that contains only the relevant features is taken.
Select subsets of variables as a pre-processing step, independently of the used classifier.
Filter methods evaluate each feature independently with target variable.
Feature with high correlation with target variable are selected as it means this feature has some relation
and can help us in making predictions.
These methods are used in the preprocessing phase to remove irrelevant or redundant features based on
statistical tests (correlation) or other criteria.
Key features of Filter Methods for Feature Subset Selection:
Filter Methods are usually fast
Filter Methods provide generic selection of features, not tuned by given learner (universal)
Filter Methods are also often criticized (feature set not optimized for used classifier)
Filter Methods are sometimes used as a pre-processing step for other methods.
Some common techniques of filters method are:
Correlation
Chi-Square Test
ANOVA
Information Gain, etc.
Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine learning model
for its evaluation.
the Learner is considered a black-box.
In this method, some features are fed to the ML model, and evaluate the performance.
The performance decides whether to add those features or remove to increase the accuracy of the
model.
This method is more accurate than the filtering method but complex to work.
Some common techniques of wrapper methods are:
Forward Selection
Backward Selection
Bi-directional Elimination
Wrapper methods are also referred as greedy algorithms that train algorithm.
They use different combination of features and compute relation between these subset features and
target variable and based on conclusion addition and removal of features are done.
Stopping criteria for selecting the best subset are usually pre-defined by the person training the model
such as when the performance of the model decreases or a specific number of features are achieved.
Wrappers Methods
Embedded Methods
Embedded Methods are specific to a given learning machine
Performs variable selection (implicitly) in the process of training
Embedded methods check the different training iterations of the machine learning model and
evaluate the importance of each feature.
E.g. WINNOW-algorithm (linear unit with multiplicative updates).
Some common techniques of Embedded methods are:
LASSO
Elastic Net
Ridge Regression, etc.