L5 Dimensionality Reduction
L5 Dimensionality Reduction
Applications
Data pre- ML model AI algorithm Optimizer DSS
Data
processing development development
Modelling &
APC
Simulation
Dimensionality reduction
Dimensionality reduction
Why do we need dimensionality reduction
• Dimensionality reduction is a • Dimensionality reduction techniques help us
comprehensive approach to reduce overcome these problems associated with
dataset complexity. high dimensional data
• Mitigates curse of dimensionality,
enhancing model performance.
• Reduces overfitting by eliminating
irrelevant features.
• Improves computational efficiency,
speeding up training.
• Facilitates data visualization for better
interpretation.
• Focuses on crucial features for clearer
model insights.
Feature extraction:
Given the initial set of features,
𝐹 = 𝑥1 , 𝑥2 , 𝑥3 … . 𝑥𝑁
Find a projected/transformed new M
features which are less than the original N
features by optimizing
𝐹 ′ ⊃ 𝐹 = 𝑥1′ , 𝑥2′ , 𝑥2′ , … . . 𝑥1′
February 17, 2024 | Slide 7
Dimensionality Reduction Techniques : Overview
DIMENSIONALITY
REDUCTION Filter Method Wrapper Method Embedded Method
TECHNIQUES
• Correlation Method • Step Forward • Lasso, Ridge Elastic Net
• Chi Square Test • Step Backward • Random Forest, XG
• Anova • Recursive Feature Boost, Decision Tree
• Variance Inflation Factor Elimination (RFE) Algos
Feature Selection :
Remove less significant
features from data so that
model is trained only on These techniques will form a part of our lecture.
significant features
• PCA : Principle Component Analysis
• SVD : Singular Vector Decomposition
Feature Extraction : It is • ICA : Integrated Component Analysis
a method by which initial • t-SNE : t- distributed Stochastic Neighbour Embedding
set of raw data is reduced • UMAP : Uniform Manifold Approximation Projection
to more manageable • LDA : Linear Discriminant Analysis
groups for processing.
• Pearson’s Correlation:It is used as a measure for quantifying linear dependence between two
continuous variables X and Y. Its value varies from -1 to +1. Pearson’s correlation (r) is given as:
σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑟=
σ 𝑥𝑖 − 𝑥ҧ 2 σ 𝑦𝑖 − 𝑦ത 2
• LDA: Linear discriminant analysis is used to find a linear combination of features that characterizes
or separates two or more classes (or levels) of a categorical variable.
• It maximizes the separation between multiple classes while minimizing the variance within
each class.
• LDA assumes that the features are normally distributed and that the classes have identical
covariance matrices
• Anova: ANOVA stands for Analysis of variance.
• It is similar to LDA except that it is operated using one or more categorical independent
features and one continuous dependent feature.
• It provides a statistical test of whether the means of several groups are equal or not.
• Chi-Square: It is a is a statistical test applied to the groups of categorical features to evaluate the
likelihood of correlation or association between them using their frequency distribution.
February 17, 2024 | Slide 10
Feature selection
Wrapper method
Select all
• Principle: Select subsets of features that features
contribute most to the performance of a
given model. This involves searching Select the initial
through different combinations and subset
evaluating their performance.
Optimization
• Select a subset of features and train a
model using that subset.
Objective
• Based on the performance of the subset, function
→decide to add or remove features from
your subset.
• The problem is essentially reduced to a Learning
algorithm
search problem.
• These methods are usually
computationally very expensive Analysis of performance
set and estimate the accuracy set and estimate the accuracy
• Step2: Remove features one by
• Step2: Add remaining features one
one to estimate accuracy, i.e.
by one to estimate accuracy i.e. classification/regression error
classification/regression error
• Step3: Drop the feature that gives
• Step3: Select the feature that minimum degradation in accuracy
gives maximum improvement (use the validation set)
using the validation set • Step4: Stop when there is no
• Step4: Stop when there is no significant degradation by
improvement. dropping further features.
Recursive Feature
Forward selection
Backward selection Elimination (RFE)
February 17, 2024 | Slide 14
Feature Selection
Embedded Method
Select all
• Embedded methods combine the qualities’ of features
filter and wrapper methods.
• It’s implemented by algorithms that have their Select the initial
own built-in feature selection methods. subset
• Some of the most popular examples of these
methods are LASSO and RIDGE regression Optimization
which have inbuilt penalization functions to
reduce overfitting. Objective
• Lasso regression performs L1 regularization function
which adds penalty equivalent to absolute
value of the magnitude of coefficients.
• Ridge regression performs L2 regularization Learning
which adds penalty equivalent to square of the algorithm
magnitude of coefficients.
• For more details and implementation of LASSO
and RIDGE regression will be covered after Analysis of
mid semester performance
▪ Water quality data from a different location ▪ If the case of a single parameter
▪ How to identify the relation between
parameter and location or between parameter L1 L3 L2 L4 L5 L6
source L1 L2 L3 L4 L5 L6
Parameter
pH (-) 2 6 2.4 6.5 7 7.5
▪ If the case of two-parameter
TDS (g/l) 10 20 10.1 12 19 18.5 25
20 L2 L5 L6
BOD (g/l) 1 0.1 1.2 0.2 0.3 0.15 15
TDS
L4
10 L1 L3
COD (g/l) 1.0 0.2 1.5 0.1 0.2 0.3
5
Temp (C) 23 23 21 24 25 20 0
1 2 3 4 5 6 7 8
pH
February 17, 2024 | Slide 16
Feature Extraction
Principal component analysis (PCA)
▪ Water quality data from a different location ▪ If the case of a single parameter
▪ How to identify the relation between
parameter and location or between parameter L1 L3 L2 L4 L5 L6
source L1 L2 L3 L4 L5 L6
Parameter
pH (-) 2 6 2.4 6.5 7 7.5
▪ If the case of two-parameter
TDS (g/l) 10 20 10.1 12 19 18.5 25
20 L2 L5 L6
BOD (g/l) 1 0.1 1.2 0.2 0.3 0.15 15
TDS
L4
10 L1 L3
COD (g/l) 1.0 0.2 1.5 0.1 0.2 0.3
5
Temp (C) 23 23 21 24 25 20 0
1 2 3 4 5 6 7 8
pH
February 17, 2024 | Slide 17
Feature Extraction
Principal component analysis (PCA)
▪ In case of three parameter • If no more than three parameters then
plotting and grouping may not be
L3
possible
L1
• Therefore, a higher dimensional
problem has to be reduced to a lower
BOD dimension problem
• PCA help to reduce the higher
L4
dimension problem lower dimension
L5 through PCs
L6 L2
TDS
pH
(83%)
6.53%
(83%)
L5
L2
L4
3D scatter plot
Different view of
3D scatter plot of Centered Data
same 3D scatter plot
February 17, 2024 | Slide 31
Similarly for 3D
Scree Plot
Lastly, we find PC3, the best fitting line that
goes through the origin and is perpendicular 93.39%
to PC1 & PC2
PC3
6.53%
PC1 0.08%
PC2
6.0
6.0
4.0 𝐶𝑜𝑣(𝑋) = 𝐵𝑇 𝐵
4.0 𝐵 = 𝑋 − 𝑋ത 2.0
2.0
𝑃𝐶 = 𝐵𝑉
TDS
0.0
TDS
Step-by-Step Process
• Standardize the Data: Ensure each feature contributes equally.
• Transform the original dataset into a lower-dimensional space using this feature
vector.
February 17, 2024 | Slide 34
Application of PCA
How segregate the image content using PCs
Things like these can be achieved with PCA
Advantages of LDA
Efficiency: LDA is particularly useful when the datasets are
linearly separable.
Simplicity: It’s straightforward to implement and understand.
Performance: Often outperforms other linear classifiers,
especially when the assumptions of common covariance
(Homoscedasticity), Gaussian distributions and statistically
independent feature approximately hold.
t-SNE is a powerful tool for visualizing high-dimensional data in two or three dimensions. Unlike PCA, which
preserves global structure, t-SNE focuses on preserving local relationships between points, making it excellent
for identifying clusters or groups in the data.
How t-SNE Works:
• Neighborhoods in High Dimensions: Imagine each data point in high-dimensional space has a bubble of
neighbors around it. This bubble isn't rigid; it's more like a cloud of points that are close to it, where
"closeness" is measured by probability (similar points have a higher probability of being neighbors).
• Bringing it Down to Earth: t-SNE aims to represent these high-dimensional neighborhoods faithfully on a
2D or 3D map. It's like taking a complex constellation of stars and drawing it on paper such that stars close
together in space are also close on the paper.
• Attention to the Local: While PCA reduces dimensions by capturing the most variance across the whole
dataset, t-SNE zeroes in on the local structure. It ensures that if two points are neighbors in the high-
dimensional space, they should also be neighbors in the lower-dimensional space.
• Clusters Emerge: By focusing on these local relationships, t-SNE allows clusters of similar data points to
emerge naturally in the visualization, even if those clusters are shaped by complex, nonlinear relationships
that PCA might miss.
t-SNE
Link
When we need to reduce a non-linear data distribution into lower dimensions t-SNE
becomes a very important dimensionality reduction technique. It is especially used in CNN
networks to flatten images. In this technique, we calculate the similarity score (similarity of
the target data point to another datapoint is a conditional probability that it will pick as its
neighbour) of each point with other points in the mix. That similarity score distribution help
project different points in a lower dimension. Once Projected on a lower dimensional scale
they are clustered as per the different groups they belong to.
t-SNE vs. PCA: Key Differences Practical Tips for Using t-SNE
• Focus: PCA captures the directions of • t-SNE for Visualization: Use t-SNE when your
maximum variance, useful for reducing main goal is to visualize high-dimensional data
dimensions and sometimes for visualization. t- in a way that highlights clusters or groups.
SNE, however, excels in visualization by • Starting with PCA: For very large datasets, it's
preserving local neighbor relationships. often beneficial to first reduce the
• Linearity vs. Non-linearity: PCA is linear, while dimensionality with PCA (to about 50
t-SNE is non-linear, making t-SNE better for dimensions) before applying t-SNE, to make
capturing the true intricacies of data. the computation faster and less noisy.
• Global vs. Local: PCA looks at the big picture • Parameter Tuning: t-SNE has a few
(global structure), while t-SNE focuses on the parameters (like perplexity) that can
detailed local patterns, making it easier to significantly affect the outcome. Experimenting
identify clusters or groups. with these can help achieve a more informative
visualization.
UMAP
PCA might not work appropriately with very complex datasets. PCA only work for such visualisations when the first two or
first three components capture the maximum variance in the data. If that’s not the case we may not be able to visualise the
data with the help of PCA. We use UMAP for this purpose. UMAP is very popular because it's relatively faster than other
techniques like t-SNE. It works on a similar idea as that of t-SNE by calculating similarity scores. It does not involve
probability measures as in t-SNE rather it focuses on taking a log base 2 off the nearest neighbours one defines (an
important parameter for the UMAP) to decide characteristic curves. It clusters all the points together whose scores sum up
to the log base two of the number of nearest neighbours. UMAP scales the curve for every point in a way that they all will be
clubbed accurately to the same sum of log base 2 of the nearest neighbours. Once the similarity scores are received it
moves the points accordingly on a lower dimension to the cluster.
t-SNE : Time taken : 7.36 sec UMAP: Time taken : 1.23 sec
Github Link
February 17, 2024 | Slide 48
Disadvantages of Dimensionality Reduction