0% found this document useful (0 votes)
28 views53 pages

Feature Selection

Feature selection is a machine learning process aimed at identifying a subset of features that enhance model performance while avoiding overfitting and reducing training time. Dimensionality reduction techniques are essential due to the challenges posed by high-dimensional data, such as decreased accuracy and efficiency. Various methods for feature selection include filter methods, wrapper methods, and intrinsic methods, each with distinct advantages and applications in fields like customer relationship management and image recognition.

Uploaded by

enl36756
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views53 pages

Feature Selection

Feature selection is a machine learning process aimed at identifying a subset of features that enhance model performance while avoiding overfitting and reducing training time. Dimensionality reduction techniques are essential due to the challenges posed by high-dimensional data, such as decreased accuracy and efficiency. Various methods for feature selection include filter methods, wrapper methods, and intrinsic methods, each with distinct advantages and applications in fields like customer relationship management and image recognition.

Uploaded by

enl36756
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Feature Selection

Feature Selection 2

• It is a procedure that is used in machine learning to find a subset


of features that produces a good model for the given dataset
fulfilling the following requirements:

• Avoiding overfitting
• Achieving better generalization ability
• Reducing storage
• Reducing training time

Andrew
Why Dimensionality Reduction? 3
• It is so easy and convenient to collect data
• Data accumulates in an unprecedented speed
• Data preprocessing is an important part for effective machine learning and data
mining
• Most machine learning and data mining techniques may not be effective for
high-dimensional data
• Curse of Dimensionality
• Query accuracy and efficiency degrade rapidly as the dimension increases
• The intrinsic dimension may be small.
• For example, the number of genes responsible for a certain type of disease may be small.
• Dimensionality reduction is an effective approach to downsizing data

Andrew
Why Dimensionality Reduction? 4

• Visualization: projection of high-dimensional data onto 2D or 3D.

• Data compression: efficient storage and retrieval.

• Noise removal: positive effect on query accuracy.

Andrew
Application of Dimensionality Reduction 5

•Customer relationship management


•Text mining
•Image retrieval
•Microarray data analysis
•Protein classification
•Face recognition
•Handwritten digit recognition
•Intrusion detection
Andrew
Document Classification 6
Terms
Web Pages
Emails
T1 T2 ….…… TN CS
D p
12 0 ….…… 6 Tr
or
D
1 3 10 ….…… 28 av
ts
Documents J
el


2


D o
0 11 ….…… 16
M
b
Intern s
et ■ Task: To classify unlabeled
documents into categories
ACM PubMed
■ Challenge: thousands of
IEEE Xplore
Portal terms
Digital Libraries ■ Solution: to apply
dimensionality reduction
Andrew
Other Types of High-Dimensional Data 7

Face images Handwritten digits


Andrew
Major Techniques of Dimensionality Reduction 8

1. Feature selection
2. Feature Extraction (reduction)

Andrew
Feature Selection vs. Feature extraction 9

• Feature selection
• A process that chooses an optimal subset of features according to an objective
function (Only a subset of the original features are selected)

• Feature extraction/reduction
• All original features are used
• The transformed features are linear combinations of the original features

Andrew
1. Feature Selection
10
• Feature or Variable Selection refers to the process of selecting features that are used in
predicting the target or output.
• The purpose of Feature Selection is to select the features that contribute the most to output
prediction.
• The following line from the abstract of Machine Learning Journal sums up the purpose of
Feature Selection.
The objective of variable selection is three-fold: improving the prediction performance of
the predictors, providing faster and more cost-effective predictors, and providing a better
understanding of the underlying process that generated the data.

• Usually these benefits of Feature Selection are quoted :


• Reduces overfitting
• Improves accuracy
• Reduces training time

Andrew
Feature Selection models 11
Feature Selection Methods are categorized into the following methods.

Andrew
Feature Selection models 12

Andrew
a. Filter Method 13

• Filter Methods
• The Filter Methods involve selecting features based on their various statistical scores
with the output column.
• The filter method ranks each feature based on some uni-variate metric and then
selects the highest-ranking features.
• The selection of features is independent of any Machine Learning algorithm.
• The following rules of thumb are as follows:
• The more the features are correlated with the output column or the column to be
predicted, the better the performance of the model.
• Features should be least correlated with each other. If some of the input features
are correlated with some additional input features, this situation is known
as Multicollinearity. It is recommended to get rid of such a situation for better
performance of the model.

Andrew
a. Filter Method
14
Filter Selection Select independent features with:
• No constant Variables
• No/less Quasi-constant variables
• No Duplicate Rows
• High correlation with the target variable
• Low correlation with another independent variable
• Higher information gain or mutual information of the independent variable.
• mRMR Score:
• Selecting an optimal feature subset from a large feature space is considered challenging problem.
• The mRMR (Minimum Redundancy and Maximum Relevance) feature selection framework solves this
problem by selecting the relevant features while controlling for the redundancy within the selected features.
• mRMR feature selection method is used for different classification problem such as marketing machine
learning platform at Uber that automates creation and deployment of targeting and personalization models at
scale

Andrew
a. Filter Method 15

Andrew
a. Filter Method 16

• Removing features with low variance(Low variance filter)

• Variance Threshold is a simple baseline approach to Feature Selection.


• It removes all features whose variance doesn’t meet some threshold.
• This involves removing the features having the same value (zero-variance features)
in all of their rows or to a specified number of rows.
• Such features provide no value in building the machine learning predictive model.

Andrew
Example: rental bikes
predict the count of bikes that have been rented- 17

Andrew
Example contd. 18

Andrew
Example: rental bikes
19

❑ Target variable: count of bikes


❑ Total 6 variables or columns
❑ First drop the ID variable
❑ Apply the low variance filter and try to reduce the dimensionality of the
data.
1. Normalize the data
2. Compute the variance

Threshold>=0.006
3. Set variance threshold
4. Select features have Variance greater than set threshold

Andrew
Example: rental bikes 20

•4. Select features have Variance greater than set threshold

Andrew
How to choose a feature selection method? 21

Andrew
B. Pearson’s correlation coefficient 22

• Pearson Correlation
• A Pearson correlation is a number between -1 and 1 that indicates the
extent to which two variables are linearly related.
• The Pearson correlation is also known as the “product moment
correlation coefficient” (PMCC) or simply “correlation”
• Pearson correlations are suitable only for metric variables
• The correlation coefficient has values between -1 to 1
• A value closer to 0 implies weaker correlation (exact 0 implying no correlation)
• A value closer to 1 implies stronger positive correlation
• A value closer to -1 implies stronger negative correlation

Andrew
Mpg data set: heat map 23
• It is seen that the variables cyl and
disp are highly correlated with each
other (0.902033).
• Hence we compared with target
variable where target variable mpg is
highly correlated with cyl hence
would keep and drop the other.
• Then we check with other variable
same process is followed until last
variable.
• we are left with four features wt, qsec,
gear, carb.
• These are the final features given by
Pearson correlation.
• Multicolineartity: Variance inflation
factor(VIF)
Andrew
B. Pearson’s correlation coefficient 24

• Example: A researcher in a scientific foundation wished to evaluate the relation between


annuals salaries of mathematicians (Y, in thousand dollars) and an index of work quality
(X1), number of years of experience (X2), and an index of publication success (X3)
index number of index of annual Find the correlation Matrix for
of years of publicati salaries given data set
work experience on (in
quality (X2) success thousand
(X1) (X3) dollars
Y)
3.5 9 6.1 33.2
5.1 18 7.4 38.7
6 13 5.9 37.5
3.1 5 5.8 30.1
4.5 25 5 38.2
Andrew
Example: cont. 25

Andrew
C. Mutual Information
(Information Gain) 26

• Information gain as a measure of how much information a feature provides


about a class.
• The feature having the most information is considered important by the
algorithm and is used for training the model.
• The effort is to reduce the entropy and maximize the information gain.
• Information gain helps to determine the order of attributes in the nodes of
a decision tree.
• Information gain is used in decision trees and random forest to decide the
best split.

Andrew
C. Mutual Information 27

• A more robust approach would be to use Mutual Information, which can be


thought as the reduction in uncertainty about one random variable given
knowledge of another.
• Entropy of variable X

• Entropy of X after observing Y

• Information Gain

Andrew
Entropy 28
• Entropy is an information theory metric that measures the impurity or uncertainty in a group of
observations.
• Entropy is uncertainty/ randomness in the data, the more the randomness the higher will be the entropy.
Information gain uses entropy to make decisions.
• If the entropy is less, information will be more.
• Higher entropy implies greater uncertainty or lack of predictability.

Andrew
ENTROPY 29

Andrew
Example#1: 30

Andrew
Example#2: Illustrative Data Set 31

Sunburn data

Ex: Which attribute has the highest info gain for the data?

Andrew
Filter Method 32

• Univariate Selection Methods


• Univariate Feature Selection methods selects the best features based on
Univariate Statistical tests. Scikit Learn provides the following methods in
the Univariate Feature Selection methods.
• SelectKBest
• SelectPercentile
• These two above are the most commonly used methods.

Andrew
Filter Method 33
• Advantages of Filter methods
• Filter methods are model agnostic(compatible)
• Rely entirely on features in the dataset
• Computationally very fast
• Based on different statistical methods
• The disadvantage of Filter methods
• The filter method looks at individual features for identifying it’s relative importance.
A feature may not be useful on its own but may be an important influencer when
combined with other features. Filter methods may miss such features.

• One thing that should be kept in mind is that the filter method does not remove
multicollinearity. So, you must deal with the multicollinearity of features as well
before training models for your data.

Andrew
2. Wrapper Methods (Feature Selection) 34

• Wrapper Methods: In Wrapper Methods the problem of Feature


Selection is reduced to a search problem.
• A model is built using a set of features and its accuracy is recorded.
• Based on the accuracy, more features are added or removed, and
the process is repeated.

Andrew
2. Wrapper Methods (Feature Selection) 35

Andrew
2. Wrapper Methods (Feature Selection) 36

• We have the following methods in Wrapper Methods.


• Forward Selection
• Backward Elimination
• Exhaustive Search
• Recursive Feature Elimination

Andrew
Wrapper Methods 37

• A. Forward Selection
• Forward Selection is an iterative method.
• In this method, we start with one feature and we keep on adding
features until no improvement in the model is observed.
• The search is stopped after a pre-set criteria is met.
• This is a greedy approach because it always targets the features in a
forward fashion, which gives a boost to the performance.
• If the number of features are large, it can be computationally
expensive.
• B. Backward Elimination
• This process is the opposite of the Forward Selection Method.
• It starts initially with all the features and keeps on removing features
until no improvement is observed.
Andrew
Feature Selection :
Wrapper Methods 38
• C. Exhaustive Search
• This Feature Selection Method tries all the possible combinations of features to
select the best model.
• This method is quite computationally expensive.
• For example, if we have five features, we will be evaluating 2^5 =
3225=32 models before finalizing a model with good accuracy.
• D. Recursive Feature Elimination
• Recursively Feature Elimination (RFE) involves the following steps:
• We train the model on the initial set of features and the importance of each
feature is calculated.
• In the second iteration, a model is built again using the most important features
and excluding the least important features.
• These steps are repeated recursively until we are left with the most important
features for the problem under consideration.
• Scikit learn library provides functions for Recursive Feature Elimination.

Andrew
Wrapper method:
Forward feature selection 39

Andrew
Wrapper based model:
forward feature selection 40
• Fitness level prediction

• So the first step in Forward Feature Selection is


to train n models using each feature individually
and checking the performance.
• If you have three independent variables, we will
train three models using each of these three
features individually.

Andrew
forward feature selection: Example 41

• Let’s say we trained the model using


the Calories_Burnt feature and the target
variable, Fitness_Level and we’ve got an
accuracy of 87%

Andrew
forward feature selection: Example cont. 42

Next, we’ll train the model using


the Gender feature, and we get an accuracy
of 80%

Andrew
Example cont. 43
• Next, we will repeat this process
and add one variable at a time. So
of course we’ll keep
the Calories_Burnt variable and
keep adding one variable. So let’s
take Gender here and using this
we get an accuracy of 88%-

Andrew
Example cont 44

Plays_Sport along with Calories_Burnt, we get an


accuracy of 91%. A variable that produces the highest
improvement will be retained.

Andrew
Feature Selection (Wrapper model):
Backward Feature Elimination 45

• These are our assumptions-


• No missing values in the dataset
• Variance of the variables is high
• Low correlation between the independent variables

Andrew
Backward Feature Elimination: Example 46

• Fitness prediction level


• The first step is to train the model,
using all the variables.
• You’ll of course not take the ID
variable train the model as ID
contains a unique value for each
observation
• So we’ll first train the model using
the other three independent
variables. And of course, the target
variable, which is
the Fitness_Level.
• we get an accuracy of 92% using
all three independent variables.

Andrew
47

Andrew
Backward Feature Elimination: Example 48

• If you see gender has produced the smallest change in the performance
in the model first, it was 92% when we took all the variables and when
we dropped gender, it was 91.6%. So we can infer that gender does not
have a high impact on the Fitness_Level variable. And hence it can be
dropped.
• Finally, we will repeat all these steps until no more variables can be
dropped.
• It’s a very simple, but very effective technique.

Andrew
3.Intrinsic Methods (Feature Selection)
49

• Intrinsic or Embedded Methods


• Embedded methods learn about the features that contribute the most to the
model’s performance while the model is being created. You have seen Feature
Selection methods in the previous lessons, and we will discuss several more in
future lessons, like Decision Tree based methods.
• Ridge Regression (L2-Regularization)
• Lasso Regression (L1-Regularization)
• Elastic-Net Regression (uses both L1 and L2 Regularization)
• Decision Tree-Based Methods (Decision Tree Classification, Random Forest
Classification, XgBoost Classification, LightGBM).

Andrew
Variance Inflation Factor(VIF)
50
• Multicollinearity is a statistical phenomenon that occurs when two or more independent
variables in a regression model are highly correlated with each other
• It is challenging in the regression analysis because it becomes difficult to determine the
individual effects of each independent variable on the dependent variable accurately
• VIF determines the strength of the correlation between the independent variables
• VIF score of an independent variable represents how well the variable is explained by
other independent variables

Andrew
VIF 51

• VIF Range is 1 to ∞

• VIF = 1, no correlation between the independent variable and the other variables
• VIF between 1 and 5 = variables are moderately correlated
• VIF greater than 5 = variables are highly correlated
• VIF exceeding 10 indicates significant multicollinearity that needs to be corrected

Andrew
Example 52

Andrew
53

• One method to detect multicollinearity is to calculate the variance inflation factor


(VIF) for each independent variable

• To fix multicollinearity, one can remove one of the highly correlated variables,
combine them into a single variable, or use a dimensionality reduction technique
such as principal component analysis (PCA) to reduce the number of variables while
retaining most of the information.

Andrew

You might also like