0% found this document useful (0 votes)
28 views47 pages

L5 Dimensionality Reduction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views47 pages

L5 Dimensionality Reduction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Oil and Gas Petro chemical Pulp and Paper Water and wastewater Metal Industries

Applications
Data pre- ML model AI algorithm Optimizer DSS
Data
processing development development
Modelling &
APC
Simulation

Dr. Senthilmurugan Subbiah, Department of Chemical Engineering, IITG.

Dimensionality reduction
Dimensionality reduction
Why do we need dimensionality reduction
• Dimensionality reduction is a • Dimensionality reduction techniques help us
comprehensive approach to reduce overcome these problems associated with
dataset complexity. high dimensional data
• Mitigates curse of dimensionality,
enhancing model performance.
• Reduces overfitting by eliminating
irrelevant features.
• Improves computational efficiency,
speeding up training.
• Facilitates data visualization for better
interpretation.
• Focuses on crucial features for clearer
model insights.

February 17, 2024 | Slide 2


Example: Importance of dimensionality reduction
Soft sensor for distillate concentration estimation
Controller
• In the distillation column, top distillate Reflux
concentration is an important parameter Setpoint Distillate
concentration Condenser Drum
to control the product quality
• It will be varying with respect to feed flow, Reflux
composition, temperature, reflux ratio etc. Analyser Top
Feed
• Composition of top distillate can be Distillation
Product
Distillate
controlled by manipulating the reflux flow. column
• Controller use input from the analyser to
control the reflux flow with respect to the
setpoint
• This solution is expensive because the
analyser expensive also continuous
measurement not possible ( every 1s) Bottom
product Reboiler

February 17, 2024 | Slide 3


Soft sensor for distillate concentration estimation

• Alternate solution to replace the expensive Controller


analyser is a soft sensor Setpoint Distillate
concentration
Reflux
• To develop a soft sensor, the model between Condenser Drum
distillation concentration at time t and other
dependent parameters is important T3
Reflux
• Now, how to identify suitable dependent parameter Soft T2
Sensor Top
• Tray temperature, surrounding temperature, T1
Analyser
feed temperature, flow, pressure etc Product
Flow meter Distillation Distillate
• 𝐶𝑑𝑖𝑠𝑡 = 𝑓 𝑡, 𝑇𝑡𝑟𝑎𝑦 , 𝑇𝐸 , 𝑄𝑓 , 𝑄𝑑 , 𝑄𝑏 , 𝑄𝑇 , 𝑇𝑏 , 𝑒𝑡𝑐 column
• Both model input and output is called as features Feed
in ML
• In general, in the first principle model approach,
the feature is chosen based on physics
• General perception is that incorporation more
feature leads better model, but this not true
Bottom
product Reboiler

February 17, 2024 | Slide 4


Dimensionality reduction

t T1 T2 T3 TE Qf Qd Qt Ct • How to Identify the best relation between Ct and


(Kg/h) (Kg/h) (Kg/h) other variables
5 90 80 70 25 92 63 29 • By removing irrelevant feature
• Example: environmental temperature (TE)
10 88.5 78 52.61 25 89 61 28 may not directly affect the distillate
concentration, so the inclusion of this
15 91 72 49.19 25 88 60 28 variable may mislead the model prediction
28.15
• Redundant feature
20 87.5 68 59.35 25 88.25 60.1
• Example inclusion feed, distillate and bottom
29 flow (Qf, Qd, Qb) lead to redundant features
25 93 75 50.85 25 90 61
by knowing two of them third can be
30 87 61 42.88 25 82 53 29 estimated.
• To overcome these challenges, we need to identify
35 89 62 31 25 76 50.4 25.6 the key features of data with knowing the physics
behind the data
40 94 64 44.12 25 80 57 23 • Two approaches
• feature selection
• feature extraction

February 17, 2024 | Slide 5


Dimensionality reduction / Feature Engineering
Feature selection Vs Feature Extraction
Dimensionality reduction can be achieved
through two main strategies:
Feature extraction: This involves
creating new features from the existing
ones (e.g., PCA transforms the original
variables into a new set of uncorrelated
variables representing the same
information).
Feature selection: This strategy involves
selecting a subset of the most relevant
features from the original dataset without
transforming them.
February 17, 2024 | Slide 6
Feature selection Vs Feature Extraction

Feature selection: Optimization involves


Given the initial set of features,
maximization of accuracy
𝐹 = 𝑥1 , 𝑥2 , 𝑥3 … . 𝑥𝑁
Find a subset of features within F by minimization of complexity
optimizing
𝐹 ′ ⊃ 𝐹 = 𝑥1 , 𝑥2 , 𝑥3 … . 𝑥𝑀

Feature extraction:
Given the initial set of features,
𝐹 = 𝑥1 , 𝑥2 , 𝑥3 … . 𝑥𝑁
Find a projected/transformed new M
features which are less than the original N
features by optimizing
𝐹 ′ ⊃ 𝐹 = 𝑥1′ , 𝑥2′ , 𝑥2′ , … . . 𝑥1′
February 17, 2024 | Slide 7
Dimensionality Reduction Techniques : Overview

DIMENSIONALITY
REDUCTION Filter Method Wrapper Method Embedded Method
TECHNIQUES
• Correlation Method • Step Forward • Lasso, Ridge Elastic Net
• Chi Square Test • Step Backward • Random Forest, XG
• Anova • Recursive Feature Boost, Decision Tree
• Variance Inflation Factor Elimination (RFE) Algos

Feature Selection :
Remove less significant
features from data so that
model is trained only on These techniques will form a part of our lecture.
significant features
• PCA : Principle Component Analysis
• SVD : Singular Vector Decomposition
Feature Extraction : It is • ICA : Integrated Component Analysis
a method by which initial • t-SNE : t- distributed Stochastic Neighbour Embedding
set of raw data is reduced • UMAP : Uniform Manifold Approximation Projection
to more manageable • LDA : Linear Discriminant Analysis
groups for processing.

February 17, 2024 | Slide 8


Identification of optimized subset
Feature selection - Filter method
• No of subset possible = 2𝑁 Select all Selecting the best Learning
subset based on
• Identification of the best subset from 2𝑁 features
the score
algorithm
set by evaluation of individual subset Analysis of performance
performance may not be viable
• Methods to find the best subset: The following table for defining statistical scores
Optimization technique, Heuristic ,
Randomized / gird output Continuous/ Categorical/
• Filter method: Select features based on regression Classification
their statistical scores in relation to the Input Feature output output
target variable. These methods are Continuous/ Person’s LDA
usually fast and independent of any regression input coefficient
machine learning algorithms.
• Use filter methods for a quick and rough Categorical/ Anova Anova / Chi-
feature selection before applying more
Classification input squre
complex methods.

February 17, 2024 | Slide 9


Defining statistical scores

• Pearson’s Correlation:It is used as a measure for quantifying linear dependence between two
continuous variables X and Y. Its value varies from -1 to +1. Pearson’s correlation (r) is given as:
σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑟=
σ 𝑥𝑖 − 𝑥ҧ 2 σ 𝑦𝑖 − 𝑦ത 2
• LDA: Linear discriminant analysis is used to find a linear combination of features that characterizes
or separates two or more classes (or levels) of a categorical variable.
• It maximizes the separation between multiple classes while minimizing the variance within
each class.
• LDA assumes that the features are normally distributed and that the classes have identical
covariance matrices
• Anova: ANOVA stands for Analysis of variance.
• It is similar to LDA except that it is operated using one or more categorical independent
features and one continuous dependent feature.
• It provides a statistical test of whether the means of several groups are equal or not.
• Chi-Square: It is a is a statistical test applied to the groups of categorical features to evaluate the
likelihood of correlation or association between them using their frequency distribution.
February 17, 2024 | Slide 10
Feature selection
Wrapper method
Select all
• Principle: Select subsets of features that features
contribute most to the performance of a
given model. This involves searching Select the initial
through different combinations and subset
evaluating their performance.
Optimization
• Select a subset of features and train a
model using that subset.
Objective
• Based on the performance of the subset, function
→decide to add or remove features from
your subset.
• The problem is essentially reduced to a Learning
algorithm
search problem.
• These methods are usually
computationally very expensive Analysis of performance

February 17, 2024 | Slide 11


Feature selection
Wrapper method - Heuristic Approch
• Forward selection Backward selection
• Step1: Start with the first feature • Step1: Start with the full feature

set and estimate the accuracy set and estimate the accuracy
• Step2: Remove features one by
• Step2: Add remaining features one
one to estimate accuracy, i.e.
by one to estimate accuracy i.e. classification/regression error
classification/regression error
• Step3: Drop the feature that gives
• Step3: Select the feature that minimum degradation in accuracy
gives maximum improvement (use the validation set)
using the validation set • Step4: Stop when there is no
• Step4: Stop when there is no significant degradation by
improvement. dropping further features.

February 17, 2024 | Slide 12


Recursive Feature Elimination (RFE)

Principle: RFE is a more systematic Process:


approach compared to the first two • Train a model using all features.
methods. It involves recursively removing • Rank the features based on their
features, building a model, and evaluating importance (e.g., coefficients in
it by using the model itself to estimate regression).
feature importance.
• Remove the least important features.

• Rebuild the model with the remaining


features and repeat the process until
the desired number of features is
reached.

February 17, 2024 | Slide 13


Forward selection, Backward selection, Recursive Feature Elimination (RFE)\

Recursive Feature
Forward selection
Backward selection Elimination (RFE)
February 17, 2024 | Slide 14
Feature Selection
Embedded Method
Select all
• Embedded methods combine the qualities’ of features
filter and wrapper methods.
• It’s implemented by algorithms that have their Select the initial
own built-in feature selection methods. subset
• Some of the most popular examples of these
methods are LASSO and RIDGE regression Optimization
which have inbuilt penalization functions to
reduce overfitting. Objective
• Lasso regression performs L1 regularization function
which adds penalty equivalent to absolute
value of the magnitude of coefficients.
• Ridge regression performs L2 regularization Learning
which adds penalty equivalent to square of the algorithm
magnitude of coefficients.
• For more details and implementation of LASSO
and RIDGE regression will be covered after Analysis of
mid semester performance

February 17, 2024 | Slide 15


Feature Extraction
Principal component analysis (PCA)

▪ Water quality data from a different location ▪ If the case of a single parameter
▪ How to identify the relation between
parameter and location or between parameter L1 L3 L2 L4 L5 L6

source L1 L2 L3 L4 L5 L6
Parameter
pH (-) 2 6 2.4 6.5 7 7.5
▪ If the case of two-parameter
TDS (g/l) 10 20 10.1 12 19 18.5 25
20 L2 L5 L6
BOD (g/l) 1 0.1 1.2 0.2 0.3 0.15 15

TDS
L4
10 L1 L3
COD (g/l) 1.0 0.2 1.5 0.1 0.2 0.3
5
Temp (C) 23 23 21 24 25 20 0
1 2 3 4 5 6 7 8
pH
February 17, 2024 | Slide 16
Feature Extraction
Principal component analysis (PCA)

▪ Water quality data from a different location ▪ If the case of a single parameter
▪ How to identify the relation between
parameter and location or between parameter L1 L3 L2 L4 L5 L6

source L1 L2 L3 L4 L5 L6
Parameter
pH (-) 2 6 2.4 6.5 7 7.5
▪ If the case of two-parameter
TDS (g/l) 10 20 10.1 12 19 18.5 25
20 L2 L5 L6
BOD (g/l) 1 0.1 1.2 0.2 0.3 0.15 15

TDS
L4
10 L1 L3
COD (g/l) 1.0 0.2 1.5 0.1 0.2 0.3
5
Temp (C) 23 23 21 24 25 20 0
1 2 3 4 5 6 7 8
pH
February 17, 2024 | Slide 17
Feature Extraction
Principal component analysis (PCA)
▪ In case of three parameter • If no more than three parameters then
plotting and grouping may not be
L3
possible
L1
• Therefore, a higher dimensional
problem has to be reduced to a lower
BOD dimension problem
• PCA help to reduce the higher
L4
dimension problem lower dimension
L5 through PCs
L6 L2
TDS
pH

February 17, 2024 | Slide 18


How to apply PCA

February 17, 2024 | Slide 20


𝒅𝟏

February 17, 2024 | Slide 22


February 17, 2024 | Slide 23
February 17, 2024 | Slide 24
February 17, 2024 | Slide 25
February 17, 2024 | Slide 26
Variation of PC

▪ From problem in hand,


Variation for PC1 = 1.81 ⇒ Total Variation around PC is 1.81+26.23 =28.04
Variation for PC2 = 26.23

𝑆𝑆 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑓𝑜𝑟 𝑃𝐶1


= 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑓𝑜𝑟 𝑃𝐶1
𝑛−1
∴ 𝑃𝐶1 accounts for 1/18 = 0.83 = 83 %
𝑆𝑆 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑓𝑜𝑟 𝑃𝐶2
= 𝑉𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 𝑓𝑜𝑟 𝑃𝐶2 of the total variation around the PCs.
𝑛−1

(83%)

February 17, 2024 | Slide 27


Scree Plot
▪ A Scree Plot is a graphical
representation of the percentages of (17%)
variation that each PC accounts for.

6.53%
(83%)

February 17, 2024 | Slide 28


Similarly for 3D
L3
source L1 L2 L3 L4 L5 L6
Parameter
pH (-) 2 6 2.4 6.5 7 7.5 L1

TDS (g/l) 10 20 10.1 12 19 18.5

BOD (g/l) 1 0.1 1.2 0.2 0.3 0.15 L6

L5
L2
L4

3D scatter plot

February 17, 2024 | Slide 30


Similarly for 3D

Different view of
3D scatter plot of Centered Data
same 3D scatter plot
February 17, 2024 | Slide 31
Similarly for 3D
Scree Plot
Lastly, we find PC3, the best fitting line that
goes through the origin and is perpendicular 93.39%
to PC1 & PC2

PC3

6.53%
PC1 0.08%
PC2

PC2 & PC3 account for the


vast majority of the variation.

February 17, 2024 | Slide 32


Math behind the PCA
M features
Step 1:Features and records in matrix form Step 2: Compute mean
𝑥11 𝑥21 𝑥31 𝑥.1 𝑥.1 𝑥𝑀1 𝑁
𝑥12 𝑥22 𝑥32 𝑥.2 𝑥.2 𝑥𝑀2 Mean 𝑋ത 1
𝑥13 𝑥23 𝑥33 𝑥.3 𝑥.3 𝑥𝑀3 𝑋ത𝑗𝑖 = ෍ 𝑥𝑗𝑖
𝑁
𝑋= 𝑥 𝑥2 . 𝑥3 . 𝑥𝑟 . 𝑥. 𝑖 𝑥𝑀 . 𝑖=1
1. 𝑗 = 1: 𝑀
𝑥1 . 𝑥2 . 𝑥3 . 𝑥. . 𝑥. 𝑗 𝑥𝑀 .
𝑥1𝑁 𝑥2𝑁 𝑥3𝑁 𝑥.𝑁 𝑥. 𝑁 𝑥𝑀𝑁
N feature record Step 3: Calculate the eigenvector that represents the line
with maximum variance (i.e. calculate the covariance matrix
Step 3: Centre the data with respect to Mean 𝑋ത
of data = Cov(X), select M largest eigenvector)

6.0
6.0
4.0 𝐶𝑜𝑣(𝑋) = 𝐵𝑇 𝐵
4.0 𝐵 = 𝑋 − 𝑋ത 2.0
2.0
𝑃𝐶 = 𝐵𝑉

TDS
0.0
TDS

0.0 -4.0 -3.0 -2.0 -1.0-2.0 0.0 1.0 2.0 3.0


-4.0 -3.0 -2.0 -1.0-2.0 0.0 1.0 2.0 3.0
-4.0 V - eigenvector
-4.0
-6.0
-6.0 pH
pH

February 17, 2024 | Slide 33


Applying PCA for Dimensionality Reduction

Step-by-Step Process
• Standardize the Data: Ensure each feature contributes equally.

• Compute the Covariance Matrix: Understand how variables relate.

• Eigen Decomposition: Find eigenvectors (directions of maximum variance) and


eigenvalues (magnitude) from the covariance matrix.
• Select Principal Components: Choose components that capture most variance
(often visualized through a scree plot).
Constructing Reduced Dataset
• Form a feature vector from the selected eigenvectors.

• Transform the original dataset into a lower-dimensional space using this feature
vector.
February 17, 2024 | Slide 34
Application of PCA
How segregate the image content using PCs
Things like these can be achieved with PCA

Github Link : PCA IRIS DATA VISUALISATION

February 17, 2024 | Slide 35


Singular Vector Decomposition (SVD)
How it works
• Given a matrix A of dimensions m x n, SVD decomposes A into three matrices:
𝐴 = 𝑈Σ𝑉 𝑇
• U is an m x m orthogonal matrix, where the columns are the left singular vectors of
A
• Σ is an m x n diagonal matrix (with additional zeros if m ≠ n), where the diagonal
elements are the singular values of A, sorted in descending order. These values
are non-negative and are a measure of the "importance" or "strength" of the
corresponding vectors in U and 𝑉 𝑇 .
• 𝑉 𝑇 is the transpose of an n x n orthogonal matrix, where the columns are the right
singular vectors of A.
• Essential for revealing the underlying structure of data.

February 17, 2024 | Slide 36


Implementing SVD for Dimensionality Reduction

Data Preparation Advantages of using SVD over PCA


• Normalize or standardize data for equal • Numerically stable, avoids covariance
feature contribution. matrix.
Performing SVD • Works with non-centered data.
• Decompose the data matrix A to get • Efficient for large, sparse matrices.
𝑈, Σ, 𝑎𝑛𝑑 𝑉 𝑇 • More robust to outliers.
Selecting Top Singular Values • Applicable to any m x n covariance
• Choose k numbers of largest singular matrix.
values for dimension k.
• Versatile beyond dimension reduction.
• Retains most significant data variance.
• Facilitates missing data reconstruction.
Constructing Reduced Data
• Informed rank selection for
• 𝐴𝑘 = 𝑈𝑘 Σ𝑘 𝑉𝑘𝑇 ​, where k reflects the reduced dimensionality.
dimensions
February 17, 2024 | Slide 37
Singular Vector Decomposition (SVD)

Singular Vector Decomposition follows a


similar approach to PCA where we find
features Eigen vectors. This problem shows
how involving more and more singular
vectors (r) we can get to a clear image. If
you look that at r = 10 also the image looks
great and nearly interpretable, despite the
fact that this image has far less features
than the original image. It involves only
characteristic features. This is the power of
SVD. Github Link
February 17, 2024 | Slide 38
Improved version of PCA
Linear Discriminant Analysis (LDA) for improved classification
• Instead of only maximization of
convergence of data points
• Addition of following objective will
improve the classification of groups
can be better
• Maximize the distance between
the centroids of different
classes/groups
• Minimize within the class/group
distance

February 17, 2024 | Slide 39


Linear Discriminant Analysis (LDA)

Maximise the combined objective


𝑚1 −𝑚2
J=
𝑠12 −𝑠22
• m1 = Mean or centroid for class 1
• m2 = mean or centroid for class 2
• S1 = Standard deviation of class 1
• S2 = Standard deviation of class 2

Advantages of LDA
Efficiency: LDA is particularly useful when the datasets are
linearly separable.
Simplicity: It’s straightforward to implement and understand.
Performance: Often outperforms other linear classifiers,
especially when the assumptions of common covariance
(Homoscedasticity), Gaussian distributions and statistically
independent feature approximately hold.

February 17, 2024 | Slide 40


Independent Component Analysis (ICA)
ICA Seeks to reduce dimensionality by finding
orthogonal directions that maximize variance
There are two key assumptions in ICA :
• Components we are trying to uncover must be
statistically independent.
• They should be non-gaussian in character.
• Assumes linear combination between observed
and mutually independent variables.

Let’s look at the measured signals in a cocktail party


on the microphones. On the purple microphone, we
will have some linear combination of red and blue
people which are mutually independent of each other.
ICA takes the observed output of your mic and traces
it back to those red and blue people, which are
mutually independent. This is a very important
application for deconstructing an observed variable
into mutually independent variables.

February 17, 2024 | Slide 41


Independent Component Analysis (ICA)

February 17, 2024 | Slide 42 Github Link


Steps to Apply ICA for Dimensionality Reduction

Standardize the Data:


• As with most data preprocessing for dimensionality reduction techniques, start by standardizing your data so that
each feature has zero mean and unit variance.
Choose the Number of Components:
• Decide how many independent components you want to extract. This is analogous to selecting the number of
principal components in PCA and will be the dimensionality of your reduced data.
Whitening:
• Before applying ICA, it's common to whiten the data. Whitening is a transformation that linearly transforms the data
so that the resulting signals are uncorrelated and their variances equal unity.
Apply an ICA Algorithm:
• Apply an ICA algorithm to the whitened data to estimate the independent components. Each independent
component will be a linear combination of the original features.
• FastICA is a popular algorithm for performing ICA, because it's efficient and typically converges quickly.
Form the Reduced Data Set:
• Once the independent components are found, you can form a reduced dataset by selecting a subset of the most
relevant components, according to some criterion relevant to your application; Domain Knowledge, Statistical Tests
and Performance on a Task

February 17, 2024 | Slide 43


t-Distributed Stochastic Neighbor Embedding (t-SNE )

t-SNE is a powerful tool for visualizing high-dimensional data in two or three dimensions. Unlike PCA, which
preserves global structure, t-SNE focuses on preserving local relationships between points, making it excellent
for identifying clusters or groups in the data.
How t-SNE Works:
• Neighborhoods in High Dimensions: Imagine each data point in high-dimensional space has a bubble of
neighbors around it. This bubble isn't rigid; it's more like a cloud of points that are close to it, where
"closeness" is measured by probability (similar points have a higher probability of being neighbors).
• Bringing it Down to Earth: t-SNE aims to represent these high-dimensional neighborhoods faithfully on a
2D or 3D map. It's like taking a complex constellation of stars and drawing it on paper such that stars close
together in space are also close on the paper.
• Attention to the Local: While PCA reduces dimensions by capturing the most variance across the whole
dataset, t-SNE zeroes in on the local structure. It ensures that if two points are neighbors in the high-
dimensional space, they should also be neighbors in the lower-dimensional space.
• Clusters Emerge: By focusing on these local relationships, t-SNE allows clusters of similar data points to
emerge naturally in the visualization, even if those clusters are shaped by complex, nonlinear relationships
that PCA might miss.

February 17, 2024 | Slide 44


t-Distributed Stochastic Neighbor Embedding (t-SNE )

t-SNE

Link

When we need to reduce a non-linear data distribution into lower dimensions t-SNE
becomes a very important dimensionality reduction technique. It is especially used in CNN
networks to flatten images. In this technique, we calculate the similarity score (similarity of
the target data point to another datapoint is a conditional probability that it will pick as its
neighbour) of each point with other points in the mix. That similarity score distribution help
project different points in a lower dimension. Once Projected on a lower dimensional scale
they are clustered as per the different groups they belong to.

February 17, 2024 | Slide 45


t-SNE vs. PCA

t-SNE vs. PCA: Key Differences Practical Tips for Using t-SNE
• Focus: PCA captures the directions of • t-SNE for Visualization: Use t-SNE when your
maximum variance, useful for reducing main goal is to visualize high-dimensional data
dimensions and sometimes for visualization. t- in a way that highlights clusters or groups.
SNE, however, excels in visualization by • Starting with PCA: For very large datasets, it's
preserving local neighbor relationships. often beneficial to first reduce the
• Linearity vs. Non-linearity: PCA is linear, while dimensionality with PCA (to about 50
t-SNE is non-linear, making t-SNE better for dimensions) before applying t-SNE, to make
capturing the true intricacies of data. the computation faster and less noisy.
• Global vs. Local: PCA looks at the big picture • Parameter Tuning: t-SNE has a few
(global structure), while t-SNE focuses on the parameters (like perplexity) that can
detailed local patterns, making it easier to significantly affect the outcome. Experimenting
identify clusters or groups. with these can help achieve a more informative
visualization.

February 17, 2024 | Slide 46


Uniform Manifold Approximation and Projection (UMAP)

UMAP

PCA might not work appropriately with very complex datasets. PCA only work for such visualisations when the first two or
first three components capture the maximum variance in the data. If that’s not the case we may not be able to visualise the
data with the help of PCA. We use UMAP for this purpose. UMAP is very popular because it's relatively faster than other
techniques like t-SNE. It works on a similar idea as that of t-SNE by calculating similarity scores. It does not involve
probability measures as in t-SNE rather it focuses on taking a log base 2 off the nearest neighbours one defines (an
important parameter for the UMAP) to decide characteristic curves. It clusters all the points together whose scores sum up
to the log base two of the number of nearest neighbours. UMAP scales the curve for every point in a way that they all will be
clubbed accurately to the same sum of log base 2 of the nearest neighbours. Once the similarity scores are received it
moves the points accordingly on a lower dimension to the cluster.

February 17, 2024 | Slide 47


Uniform Manifold Approximation and Projection (UMAP)

t-SNE : Time taken : 7.36 sec UMAP: Time taken : 1.23 sec

Github Link
February 17, 2024 | Slide 48
Disadvantages of Dimensionality Reduction

1. Interpretability : The newly formed features using dimensionality reduction


reduces the interpretability of the data as they are not directly related to the
features. For e.g We do not know whether Principal Component 1 corresponds to
the slate value of the crude. Most of the times in Chemical Industry people are more
concerned with interpretability of the models than they are with the model itself.
2. Loss of Information : While doing dimensionality reduction, we lose some of the
information, which can possibly affect the performance of subsequent training
algorithms.
3. Computationally Intensive : Some dimensionality reduction algorithm are quite
computationally extensive. For e.g in case of t-SNE.
4. Limitations : Dimensionality reductions techniques have their limitations. For e.g
PCA cannot work on very complex datasets or t-SNE is a time consuming algorithm
or issue of computational space required by each of these algorithms.
February 17, 2024 | Slide 49

You might also like