0% found this document useful (0 votes)
40 views125 pages

Unit 4

The document discusses key concepts in Unsupervised Learning, focusing on techniques such as K-Means Clustering, Association Analysis, and the Apriori algorithm. K-Means is an iterative algorithm used to group unlabeled data into clusters based on similarity, while Association Analysis identifies relationships between items in large datasets. The document also covers the FP-Growth algorithm, which efficiently finds frequent itemsets without multiple database scans, enhancing the performance of association rule mining.

Uploaded by

aelurigowri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views125 pages

Unit 4

The document discusses key concepts in Unsupervised Learning, focusing on techniques such as K-Means Clustering, Association Analysis, and the Apriori algorithm. K-Means is an iterative algorithm used to group unlabeled data into clusters based on similarity, while Association Analysis identifies relationships between items in large datasets. The document also covers the FP-Growth algorithm, which efficiently finds frequent itemsets without multiple database scans, enhancing the performance of association rule mining.

Uploaded by

aelurigowri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 125

Unit-IV

Machine Learning II: Unsupervised Learning

 Grouping unlabeled items using k-means clustering,


 Association analysis
 Association analysis with the Apriori algorithm,
 Efficiently finding frequent item sets with FP-growth.
 Dimensionality reduction: Feature extraction - Principal component analysis,
 Singular value decomposition.
 Feature selection – feature ranking and subset selection,
 filter,
 wrapper and embedded methods.
Grouping unlabeled items using k-means clustering
Clustering
 Clustering is a type of unsupervised learning that automatically forms clusters of similar things. It’s like
automatic classification.

 Clustering is like sorting a bunch of similar items into different groups based on their characteristics. In
data mining and machine learning, it’s a powerful technique used to group similar data points together,
making it easier to find patterns or understand large datasets.

 You can cluster almost anything, and the more similar the items are in the cluster, the better your
clusters are.

 Clustering is sometimes called unsupervised classification because it produces the same result as
classification but without having predefined classes.

 This notion of similarity depends on a similarity measurement. Essentially, clustering helps identify
natural groupings in your data.
Types of Clustering
 Clustering is a type of unsupervised learning wherein data points are grouped into different sets based
on their degree of similarity.

 The various types of clustering are:


1. Hierarchical clustering
2. Partitioning clustering

 Hierarchical clustering is further subdivided into:


 Agglomerative clustering
 Divisive clustering
 Partitioning clustering is further subdivided into:
 K-Means clustering
 Fuzzy C-Means clustering
Hierarchical Clustering
Hierarchical clustering uses a tree-like structure.
Partitioning Clustering Partitioning clustering is split into two subtypes - K-Means clustering and
Fuzzy C-Means.
K-Means Clustering Algorithm

 K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science.

 It is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters.
Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2,
there will be two clusters, and for K=3, there will be three clusters, and so on.

 It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way
that each dataset belongs only one group that has similar properties.

 It allows us to cluster the data into different groups and a convenient way to discover the categories
of groups in the unlabeled dataset on its own without the need for any training.

 It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim
of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
 The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters,
and repeats the process until it does not find the best clusters. The value of k should be
predetermined in this algorithm.

 The k-means clustering algorithm mainly performs two tasks:


 Determines the best value for K center points or centroids by an iterative process.
 Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

 Hence each cluster has data points with some commonalities, and it is away from other clusters.
Objective of k means Clustering
The main objective of k-means clustering is to partition your data into a specific number (k) of groups,
where data points within each group are similar and dissimilar to points in other groups. It achieves this
by minimizing the distance between data points and their assigned cluster’s center, called the centroid.

Here’s an objective:

 Grouping similar data points: K-means aims to identify patterns in your data by grouping data
points that share similar characteristics together. This allows you to discover underlying structures
within the data.
 Minimizing within-cluster distance: The algorithm strives to make sure data points within a cluster
are as close as possible to each other, as measured by a distance metric (usually Euclidean distance).
This ensures tight-knit clusters with high cohesiveness.
 Maximizing between-cluster distance: Conversely, k-means also tries to maximize the separation
between clusters. Ideally, data points from different clusters should be far apart, making the clusters
distinct from each other.
The k-means algorithm works like this.

1. First, the k centroids are randomly assigned to a point.


2. Next, each point in the dataset is assigned to a cluster.
3. The assignment is done by finding the closest centroid and assigning the point to that
cluster.
4. After this step, the centroids are all updated by taking the mean value of all the points in
that cluster.
5. Closest centroid refers to some sort of distance measure
6. Performance on a dataset is determined by the distance measure used
The working of the K-Means algorithm is explained in the below steps:

1. Initialization: Start by randomly selecting K points from the dataset. These points will act as the
initial cluster centroids.
2. Assignment: For each data point in the dataset, calculate the distance between that point and each of
the K centroids. Assign the data point to the cluster whose centroid is closest to it. This step
effectively forms K clusters.
3. Update centroids: Once all data points have been assigned to clusters, recalculate the centroids of
the clusters by taking the mean of all data points assigned to each cluster.
4. Repeat: Repeat steps 2 and 3 until convergence. Convergence occurs when the centroids no longer
change significantly or when a specified number of iterations is reached.
5. Final Result: Once convergence is achieved, the algorithm outputs the final cluster centroids and the
assignment of each data point to a cluster..
K means Clustering

The algorithm will categorize the items into k groups or clusters of similarity. To calculate that similarity,
we will use the Euclidean distance as a measurement.

The algorithm works as follows:

First, we randomly initialize k points, called means or cluster centroids.

We categorize each item to its closest mean, and we update the mean’s coordinates, which are the
averages of the items categorized in that cluster so far.

We repeat the process for a given number of iterations and at the end, we have our clusters.

The “points” mentioned above are called means because they are the mean values of the items categorized
in them. To initialize these means, we have a lot of options. An intuitive method is to initialize the means
at random items in the data set. Another method is to initialize the means at random values between the
boundaries of the data set. For example for a feature x the items have values in [0,3] we will initialize the
means with values for x at [0,3].
Euclidean distance is mostly used distance measure
Advantages of K-Means Clustering Algorithm

 It is very easy to understand and implement.


 If we have large number of variables then, K-means would be faster than Hierarchical clustering.
 On re-computation of centroids, an instance can change the cluster.
 Tighter clusters are formed with K-means as compared to Hierarchical clustering.

Disadvantages of K-Means Clustering Algorithm

 It is a bit difficult to predict the number of clusters i.e. the value of k.


 Output is strongly impacted by initial inputs like number of clusters (value of k).
 Order of data will have strong impact on the final output.
 It is very sensitive to rescaling. If we will rescale our data by means of normalization or
standardization, then the output will completely change final output.
 It is not good in doing clustering job if the clusters have a complicated geometric shape.
Applications of K-Means Clustering

Image Segmentation
K-Means clustering can be used to segment an image into different regions based on the color or texture of
the pixels. This technique is widely used in computer vision applications, such as object recognition, image
retrieval, and medical imaging.

Customer Segmentation
K-Means clustering can be used to segment customers into different groups based on their purchasing
behavior or demographic characteristics. This technique is widely used in marketing applications, such as
customer retention, loyalty programs, and targeted advertising.

Anomaly Detection
K-Means clustering can be used to detect anomalies in a dataset by identifying data points that do not belong
to any cluster. This technique is widely used in fraud detection, network intrusion detection, and predictive
maintenance.

Genomic Data Analysis


K-Means clustering can be used to analyze gene expression data to identify different groups of genes that are
co-regulated or co-expressed. This technique is widely used in bioinformatics applications, such as drug
discovery, disease diagnosis, and personalized medicine.
Association analysis or association rule learning

 Looking at items commonly purchased together can give stores an idea of customers’ purchasing
behavior.
 This knowledge, extracted from the sea of data, can be used for pricing, marketing promotions,
inventory management, and so on.
 Looking for hidden relationships in large datasets is known as association analysis or association
rule learning.
 The problem is, finding different combinations of items can be a time-consuming task and
prohibitively expensive in terms of computing power.
 Brute-force solutions aren’t capable of solving this problem, so a more intelligent approach is
required to find frequent itemsets in a reasonable amount of time.
 Association rule mining finds interesting associations and relationships among large sets of
data items.
 This rule shows how frequently a item set occurs in a transaction.

 A typical example is a Market Based Analysis. Market Based Analysis is one of the key
techniques used by large relations to show associations between items.

 It allows retailers to identify relationships between the items that people buy together
frequently.

 Given a set of transactions, we can find rules that will predict the occurrence of an item based
on the occurrences of other items in the transaction.
 Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be
more profitable.

 It tries to find some interesting relations or associations among the variables of dataset. It is
based on different rules to discover the interesting relations between variables in the database.

 The association rule learning is one of the very important concepts of machine learning, and
it is employed in Market Basket analysis, Web usage mining, continuous production, etc.

 Here market basket analysis is a technique used by the various big retailer to discover the
associations between items.
We can understand it by taking an example of a supermarket, as in a supermarket, all products that are purchased
together are put together
 Association analysis is the task of finding interesting relationships in large datasets.

 These interesting relationships can take two forms: frequent item sets or association rules.

 Frequent item sets are a collection of items that frequently occur together.

 The second way to view interesting relationships is association rules. Association rules suggest
that a strong relationship exists between two items.
• Frequent items sets are lists of items that commonly appear together.
• From the dataset we can also find an association rule such as bread ➞ milk. This means that if
someone buys bread, there’s a good chance they’ll buy milk.
• With the frequent item sets and association rules, retailers have a much better understanding of
their customers.
• Although common examples of association analysis are from the retail industry, it can be applied
to a number of other industries, such as website traffic analysis and medicine.
 If lift = 1, it would imply that the probability of occurrence of the antecedent and that of
the consequent are independent of each other.

 When two events are independent of each other no rule can be drawn involving those
two events.

 If lift > 1, that lets us know the degree to which two occurrences are dependent on one
another and makes those rules potentially useful for predicting the consequent in future
data sets.
Types of Association Rule Learning: Association rule learning can be divided into three algorithms:

Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is designed to work on the databases
that contain transactions. This algorithm uses a breadth-first search and Hash Tree to calculate the itemset
efficiently.

It is mainly used for market basket analysis and helps to understand the products that can be bought
together. It can also be used in the healthcare field to find drug reactions for patients.

Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-first search
technique to find frequent itemsets in a transaction database. It performs faster execution than Apriori
Algorithm.

F-P Growth Algorithm


The F-P growth algorithm stands for Frequent Pattern, and it is the improved version of the Apriori
Algorithm. It represents the database in the form of a tree structure that is known as a frequent pattern or
tree. The purpose of this frequent tree is to extract the most frequent patterns.
Apriori algorithm
• Name of the algorithm is Apriori because it uses prior knowledge of frequent item set properties.
• How do we define these so-called interesting relationships? Who defines what’s interesting?
• When we’re looking for frequent item sets, what’s the definition of frequent?
• There are a number of concepts we can use to select these things, but the two most important are
support and confidence.
• This entire process breaks down into two steps: 1) find all frequent item sets within the
dataset, and 2) find rules, using the frequent item sets, that also have high confidence.
• The Apriori algorithm scans and determines the frequency of individual items and excludes
those that do not meet the threshold.
• The Apriori principle says that if an itemset is frequent, then all of its subsets are frequent.
• The rule turned around says that if an itemset is infrequent, then its supersets are also infrequent.
Finding frequent itemsets with the Apriori algorithm

 In association analysis two things are important i.e. Frequent item sets and association rules.
 The way to find frequent itemsets is the Apriori algorithm.
 The Apriori algorithm needs a minimum support level as an input and a data set.
 The algorithm will generate a list of all candidate itemsets with one item.
 The transaction data set will then be scanned to see which sets meet the minimum support level.
 Sets that don’t meet the minimum support level will get tossed out. The remaining sets will then be
combined to make itemsets with two elements.
 Again, the transaction dataset will be scanned and itemsets not meeting the minimum support level
will get tossed.
 This procedure will be repeated until all sets are tossed out.
Procedure
 Apriori principle to reduce the number of sets that are checked against the database.
 The Apriori principle states that if an item is infrequent, then supersets containing that item will also be
infrequent.
 The Apriori algorithm starts from single itemsets and creates larger sets by combining sets that meet the
minimum support measure.
 Support is used to measure how often a set appears in the original data.
 Once frequent itemsets have been found, you can use the frequent itemsets to generate association rules.
 The significance of an association rule is measured by confidence.
 Confidence tells you how many times this rule applies to the frequent itemsets.
Lets understand the concept of apriori Algorithm with the help of an example. Consider the
following dataset and we will find frequent itemsets and generate association rules for them:
The Apriori Algorithm, as demonstrated in the bread-butter example, is widely used in modern
startups like Zomato, Swiggy, and other food delivery platforms. These companies use it to
perform market basket analysis, which helps them identify customer behavior patterns and
optimize recommendations.
Applications
• Association analysis can be performed on many different items. Some
common examples are items in a store and pages visited on a website.
• Association analysis has also been used to look at the voting history of
elected officials and judges.
Efficiently finding frequent itemsets with FP-growth Algorithm

FPGrowth is an algorithm for discovering itemsets (group of items) occurring frequently in a


transaction database (frequent itemsets). A frequent itemset is an itemset appearing in at least minsup
transactions from the transaction database, where minsup is a parameter given by the user.
Note:

 The Frequent Pattern Growth (FP-Growth) algorithm improves upon the Apriori algorithm by
eliminating the need for multiple database scans and reducing computational overhead.

 By using a Trie data structure and focusing on ordered-item sets, FP-Growth efficiently mines
frequent itemsets, making it a faster and more scalable solution for large datasets.

 This approach ensures faster processing while maintaining accuracy, making FP-Growth a
powerful tool for data mining.
Why fp-growth is faster ?
 The FP-growth algorithm is an efficient way of finding frequent patterns in a dataset.
 The FP-growth algorithm works with the Apriori principle but is much faster.
 The Apriori algorithm generates candidate itemsets and then scans the dataset to see if they’re
frequent.
 FP-growth is faster because it goes over the dataset only twice.
 The dataset is stored in a structure called an FP-tree.
 After the FP-tree is built, you can find frequent itemsets by finding conditional bases for an item
and building a conditional FP-tree.
 This process is repeated, conditioning on more items until the conditional Fptree has only one
item.
Applications

• The FP-growth algorithm can be used to find frequent words in a series of text documents.
• The microblogging site Twitter provides a number of APIs for developers to use their services.
• The Python module Python-Twitter allows easy access to Twitter.
• Applying the FP-growth algorithm to a Twitter feed on a certain topic can give you some summary
information for that topic.
• There are a number of other uses for frequent itemset generation such as shopping transactions,
medical diagnosis, and study of the atmosphere.
Dimensionality-reduction Techniques
 While working with machine learning models, we often encounter datasets with a large number of
features. These datasets can lead to problems such as increased computation time and overfitting. To
address these issues, we use dimensionality reduction techniques.
 The number of input features, variables, or columns present in a given dataset is known as dimensionality,
and the process to reduce these features is called dimensionality reduction.
 To preprocess the inputs before using any of the ML algorithms
 These techniques reduce noise and improve the performance of machine learning algorithms and works
on labeled and unlabeled data
The Curse of Dimensionality

 Handling the high-dimensional data is very difficult in practice, commonly known as the curse of
dimensionality.
 If the dimensionality of the input dataset increases, any machine learning algorithm and model
becomes more complex.
 As the number of features increases, the number of samples also gets increased proportionally, and the
chance of overfitting also increases.
 If the machine learning model is trained on high-dimensional data, it becomes overfitted and results
in poor performance.
 Hence, it is often required to reduce the number of features, which can be done with dimensionality
reduction.
What is dimensionality reduction?

 In machine learning classification problems, there are often too many factors on the basis of which the
final classification is done.
 These factors are basically variables called features. The higher the number of features, the
harder it gets to visualize the training set and then work on it.
 Sometimes, most of these features are correlated, and hence redundant. This is where dimensionality
reduction algorithms come into play. Eg: humidity and rainfall, e-mail spam classification etc
 Dimensionality reduction is the process of reducing the number of random variables under
consideration, by obtaining a set of principal variables.
 It can be divided into feature selection and feature extraction.
 A 3-D classification problem can be hard to visualize, whereas a 2-D one can be mapped to a
simple 2 dimensional space, and a 1-D problem to a simple line.
 The below figure illustrates this concept, where a 3-D feature space is split into two 1-D feature
spaces, and later, if found to be correlated, the number of features can be reduced even further.

On the left, data points exist in a 3D space (X, Y, Z), but the Z-
dimension appears unnecessary since the data primarily varies
along the X and Y axes. The goal of dimensionality reduction is
to remove less important dimensions without losing valuable
information.

On the right, after reducing the dimensionality, the data is


represented in lower-dimensional spaces. The top plot (X-Y)
maintains the meaningful structure, while the bottom plot (Z-Y)
shows that the Z-dimension contributed little useful
information.

This process makes data analysis more efficient, improving


computation speed and visualization while minimizing
redundancy
 There are two components of dimensionality reduction:

 Feature selection: find a subset of the original set of variables, or features, to get a smaller subset
which can be used to model the problem.
 It usually involves three ways:

 Filter
 Wrapper
 Embedded
 Feature extraction: This reduces the data in a high dimensional space to a lower dimension space, i.e.
a space with lesser no. of dimensions. Eg: PCA and SVD
Feature Selection
Feature selection chooses the most relevant features from the dataset without altering them. It helps
remove redundant or irrelevant features, improving model efficiency. There are several methods for
feature selection including filter methods, wrapper methods, and embedded methods.

1. Filter methods rank the features based on their relevance to the target variable.
2. Wrapper methods use the model performance as the criteria for selecting features.
3. Embedded methods combine feature selection with the model training process.

Feature Extraction
Feature extraction involves creating new features by combining or transforming the original features.

PCA is a popular technique that projects the original features onto a lower-dimensional space while
preserving as much of the variance as possible.
Benefits of applying Dimensionality Reduction
Some benefits of applying dimensionality reduction technique to the given dataset are given below:
 Faster Computation: With fewer features, machine learning algorithms can process data more
quickly. This results in faster model training and testing, which is particularly useful when working
with large datasets.
 Better Visualization: As we saw in the earlier figure, reducing dimensions makes it easier to visualize
data, revealing hidden patterns.
 Prevent Overfitting: With fewer features, models are less likely to memorize the training data and
overfit. This helps the model generalize better to new, unseen data, improving its ability to make
accurate predictions.
 By reducing the dimensions of the features, the space required to store the dataset also gets reduced.
 Less Computation training time is required for reduced dimensions of features.
 Reduced dimensions of features of the dataset help in visualizing the data quickly.
 It removes the redundant features (if present) by taking care of multicollinearity.
Disadvantages of dimensionality Reduction

 Data Loss & Reduced Accuracy – Some important information may be lost during
dimensionality reduction, potentially affecting model performance.
 Interpretability Challenges – The transformed features (e.g., principal components) may not
have clear meanings, making it harder to understand relationships in the original data.
 Choosing the Right Components – Deciding how many dimensions to keep is difficult, as
keeping too few may lose valuable information, while keeping too many can lead to overfitting.
 Some data may be lost due to dimensionality reduction.
 In the PCA dimensionality reduction technique, sometimes the principal components required to
consider are unknown
Principal Component Analysis (PCA)
 This method was introduced by Karl Pearson.
 Principal Component Analysis is a statistical process that converts the observations of correlated features
into a set of linearly uncorrelated features with the help of orthogonal transformation.
 These new transformed features are called the Principal Components.
 It works on a condition that while the data in a higher dimensional space is mapped to data in a lower
dimension space, the variance of the data in the lower dimensional space should be maximum.
 It is one of the popular tools that is used for exploratory data analysis , image processing, movie
recommendation system, optimizing the power allocation in various communication channels.
• It involves the following steps:
i. Construct the covariance matrix of the data.
ii. Compute the eigenvectors of this matrix.
iii. Eigenvectors corresponding to the largest eigenvalues
are used to reconstruct a large fraction of variance of
the original data.
• As we are left with a lesser number of eigenvectors, and
there might have been some data loss in the process.
• But, the most important variances should be retained by
the remaining eigenvectors.
Key Concepts of PCA
 Dimensionality Reduction
PCA reduces the number of features while preserving as much variance (information)
as possible.

 Orthogonal Transformation
PCA transforms the original correlated features into a set of new uncorrelated features
called principal components.

 Variance Maximization

 The first principal component captures the largest variance in the data.
 The second principal component captures the next highest variance orthogonal to the
first.
 This process continues for all components.
Principal Components in PCA

The transformed new features or the output of PCA are the Principal Components.

The number of these PCs are either equal to or less than the original features present in the dataset.

Some properties of these principal components are given below:


1. The principal component must be the linear combination of the original features.
2. These components are orthogonal, i.e., the correlation between a pair of variables is zero.
3. The importance of each component decreases when going to 1 to n, it means the 1 PC has the most
importance, and n PC will have the least importance.
In the above figure, we have several points plotted on a 2-D plane. There are two principal components.
PC1 is the primary principal component that explains the maximum variance in the data. PC2 is another
principal component that is orthogonal to PC1
 The aim of this step is to understand how the variables of the input data set are varying from the
mean with respect to each other, or in other words, to see if there is any relationship between
them.

 Because sometimes, variables are highly correlated in such a way that they contain redundant
information. So, in order to identify these correlations, we compute the covariance matrix.
 The aim of this step is to understand how the variables of the input data set are varying from the
mean with respect to each other, or in other words, to see if there is any relationship between
them.

 Because sometimes, variables are highly correlated in such a way that they contain redundant
information. So, in order to identify these correlations, we compute the covariance matrix.
 If the value of the Covariance Matrix is positive, then it indicates that the variables are
correlated. ( If X increases, Y also increases and vice versa)

 If the value of the Covariance Matrix is negative, then it indicates that the variables are
inversely correlated. ( If X increases, Y also decreases and vice versa).

 Zero (the variables are not related to each other).


Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix to
identify the principal components
 Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the
covariance matrix in order to determine the principal components of the data.

 It is eigenvectors and eigenvalues who are behind all the magic of principal components because the
eigenvectors of the Covariance matrix are actually the directions of the axes where there is the most
variance (most information) and that we call Principal Components.

 And eigenvalues are simply the coefficients attached to eigenvectors, which give the amount of
variance carried in each Principal Component.

 By ranking your eigenvectors in order of their eigenvalues, highest to lowest, you get the principal
components in order of significance.
Step 4: Create a Feature Vector

 In the previous step, computing the eigenvectors and ordering them by their eigenvalues in
descending order, allow us to find the principal components in order of significance.

 In this step, what we do is, to choose whether to keep all these components or discard those of
lesser significance (of low eigenvalues), and form with the remaining ones a matrix of vectors
that we call Feature vector.

 So, the feature vector is simply a matrix that has as columns the eigenvectors of the components
that we decide to keep.

 This makes it the first step towards dimensionality reduction, because if we choose to keep
only p eigenvectors (components) out of n, the final data set will have only p dimensions.
Advantages of Principal Component Analysis

 Easy to calculate and compute.


 Speeds up machine learning computing processes and algorithms.
 Prevents predictive algorithms from data overfitting issues.
 Increases performance of ML algorithms by eliminating unnecessary correlated variables.
 Principal Component Analysis results in high variance and increases visualization.
 Helps reduce noise that cannot be ignored automatically.

Disadvantages of Principal Component Analysis

 Sometimes, PCA is difficult to interpret. In rare cases, you may feel difficult to identify the
most important features even after computing the principal components.
 You may face some difficulties in calculating the covariances and covariance matrices.
 Sometimes, the computed principal components can be more difficult to read rather than
the original set of components.
Singular Value Decomposition
 A matrix decomposition method for reducing a matrix to its constituent parts in order to make
certain subsequent matrix calculations simpler.
 represents our original data set with a much smaller data set
 extracts the relevant features from a collection of noisy data
 powerful tool used to distill information in a number of applications, from bioinformatics to finance
including selection of restaurant etc.

 Drawbacks of eigen decomposition:


 The concept of eigen values is only applicable to square matrices
 For square matrices, the presence of complex eigen values restricts the amount of real eigen spaces.
 It is basically a matrix factorization technique, which decomposes any matrix into 3 generic and familiar
matrices.
 The multiplication of a matrix and a vector produces another vector which is defined as the transformation
occurred to that vector wrt the specific matrix in the given vector space.
Advantages of SVD

 Dimensionality reduction: SVD can reduce the number of dimensions in high-dimensional data,
making it easier to visualize and analyze.
 Applications in various fields: SVD has various applications in fields like image processing, natural
language processing, and recommendation systems.
 Interpretability: The singular values and singular vectors obtained from SVD can provide insight into
the structure of the data and the relationship between features.

Disadvantages of SVD

 Limitations in non-linear relationships: SVD assumes that the relationships between features are linear,
which can limit its ability to capture complex non-linear relationships.
 SVD is sensitive to the scale of the features.
Applications of SVD in Image Processing

 Dimensionality Reduction (PCA)


SVD is used in Principal Component Analysis (PCA) to reduce the dimensionality of high-dimensional
data while preserving the most important features.Helps in visualization and noise reduction.

 Latent Semantic Analysis (LSA)


 Used in Natural Language Processing (NLP) to extract relationships between words and documents.
Helps in topic modeling and information retrieval.

 Image Compression
 SVD allows for efficient storage and compression of images by keeping only the most significant
singular values. Reduces storage requirements while maintaining image quality.

 Recommender Systems
 Used in collaborative filtering for building recommendation systems (e.g., movie or product
recommendations). Netflix and Amazon use SVD-based techniques to improve user recommendations.
 Signal Processing

Helps in noise reduction and feature extraction in signals like speech or EEG signals.Used in
filtering and denoising audio/video signals.

 Pseudo inverse Calculation


SVD helps in computing the pseudoinverse of a non-square or singular matrix, which is
useful in solving linear least squares problems.

 Solving Linear Systems


Used for solving ill-conditioned or over determined linear systems in numerical
analysis.

 Anomaly Detection
Identifies outliers in data by analyzing the contribution of singular values.
Used in fraud detection and fault diagnosis.

 Data Clustering
 Helps in discovering hidden patterns in high-dimensional datasets.10. Quantum
ComputingSVD is used in quantum algorithms, including quantum principal component
analysis
Feature selection
 The process of selecting the subset of the relevant features and leaving out the irrelevant
features present in a dataset to build a model of high accuracy.
 In other words, it is a way of selecting the optimal features from the input dataset.
 To find a subset of the original set of variables, or features, to get a smaller subset which can
be used to model the problem.
Feature or variable ranking
 It is the process of ordering the features by the value of some scoring function, which usually
measures feature-relevance.
 The score S(fi) is computed from the training data, measuring some criteria of feature fi.
 By convention a high score is indicative for a valuable (relevant) feature.
 A simple method for feature selection using variable ranking is to select the k highest ranked
features according to S.
 This is usually not optimal, but often preferable to other, more complicated methods. It is
computationally efficient — only calculation and sorting of n scores.
 Ranking Criteria: Correlation Criteria and Information Theoretic Criteria
Important features of feature ranking method of dimensionality reduction
 Feature ranking is a popular method of dimensionality reduction that involves selecting a subset of
features from a larger set of variables.

 The goal is to identify the most important features that contribute to the prediction or classification
task. Here are some important features of feature ranking methods .

 Ranking Criteria: Feature ranking methods use various criteria to rank the importance of
features, such as correlation, mutual information, entropy, and so on. The ranking criteria
should reflect the relevance of each feature to the problem at hand.
 Selection Method: Once the features are ranked, a selection method is used to choose a subset
of the top-ranked features. The selection method can be based on a threshold value, such as
selecting the top k features, or it can be a more sophisticated method, such as a greedy
algorithm or a genetic algorithm.
 Performance Evaluation: Feature ranking methods should be evaluated based on their ability
to improve the performance of the prediction or classification model. The selected subset of
features should lead to better performance compared to using all the features.
 Robustness: Feature ranking methods should be robust to noise and outliers in the data. They
should also be able to handle missing values and deal with redundant or correlated features.

 Interpretability: Feature ranking methods should provide interpretable results that can be easily
understood by domain experts. The importance of each feature should be explained in terms of its
relevance to the problem at hand.

 Computational Efficiency: Feature ranking methods should be computationally efficient,


especially for large datasets with a large number of features. They should be able to rank and
select features in a reasonable amount of time

 Overall, feature ranking is a useful technique for reducing the dimensionality of high-dimensional
data and improving the performance of machine learning models
Feature Subset Selection

 The Goal of Feature Subset Selection is to find the optimal feature subset.

 For Feature Subset Selection you’d need:

 A measure for assessing the goodness of a feature subset (scoring function)


 A strategy to search the space of possible feature subsets
 Finding a minimal optimal feature set for an arbitrary target concept is hard. It would need
Good Heuristics.

 Feature Subset Selection Methods can be classified into three broad categories
 Filter Methods
 Wrapper Methods
 Embedded Methods
Filter Methods
 In this method, the dataset is filtered, and a subset that contains only the relevant features is taken.
 Select subsets of variables as a pre-processing step, independently of the used classifier.
 Filter methods evaluate each feature independently with target variable.
 Feature with high correlation with target variable are selected as it means this feature has some relation
and can help us in making predictions.
 These methods are used in the preprocessing phase to remove irrelevant or redundant features based on
statistical tests (correlation) or other criteria.
Key features of Filter Methods for Feature Subset Selection:
 Filter Methods are usually fast
 Filter Methods provide generic selection of features, not tuned by given learner (universal)
 Filter Methods are also often criticized (feature set not optimized for used classifier)
 Filter Methods are sometimes used as a pre-processing step for other methods.
 Some common techniques of filters method are:
Correlation
Chi-Square Test
ANOVA
Information Gain, etc.
Wrappers Methods
 The wrapper method has the same goal as the filter method, but it takes a machine learning model
for its evaluation.
 the Learner is considered a black-box.
 In this method, some features are fed to the ML model, and evaluate the performance.
 The performance decides whether to add those features or remove to increase the accuracy of the
model.
 This method is more accurate than the filtering method but complex to work.
 Some common techniques of wrapper methods are:
 Forward Selection
 Backward Selection
 Bi-directional Elimination
 Wrapper methods are also referred as greedy algorithms that train algorithm.

 They use different combination of features and compute relation between these subset features and
target variable and based on conclusion addition and removal of features are done.

 Stopping criteria for selecting the best subset are usually pre-defined by the person training the model
such as when the performance of the model decreases or a specific number of features are achieved.
Wrappers Methods
Embedded Methods
 Embedded Methods are specific to a given learning machine
 Performs variable selection (implicitly) in the process of training
 Embedded methods check the different training iterations of the machine learning model and
evaluate the importance of each feature.
 E.g. WINNOW-algorithm (linear unit with multiplicative updates).
 Some common techniques of Embedded methods are:
LASSO
Elastic Net
Ridge Regression, etc.

You might also like