Feature Selection
Feature Selection                                                       2
• It is a procedure that is used in machine learning to find a subset
  of features that produces a good model for the given dataset
  fulfilling the following requirements:
• Avoiding overfitting
• Achieving better generalization ability
• Reducing storage
• Reducing training time
                                                                            Andrew
Why Dimensionality Reduction?                                                                        3
• It is so easy and convenient to collect data
• Data accumulates in an unprecedented speed
• Data preprocessing is an important part for effective machine learning and data
  mining
• Most machine learning and data mining techniques may not be effective for
  high-dimensional data
        •   Curse of Dimensionality
        •   Query accuracy and efficiency degrade rapidly as the dimension increases
• The intrinsic dimension may be small.
    •     For example, the number of genes responsible for a certain type of disease may be small.
•       Dimensionality reduction is an effective approach to downsizing data
                                                                                                         Andrew
Why Dimensionality Reduction?                                         4
• Visualization: projection of high-dimensional data onto 2D or 3D.
• Data compression: efficient storage and retrieval.
• Noise removal: positive effect on query accuracy.
                                                                          Andrew
Application of Dimensionality Reduction   5
 •Customer relationship management
 •Text mining
 •Image retrieval
 •Microarray data analysis
 •Protein classification
 •Face recognition
 •Handwritten digit recognition
 •Intrusion detection
                                              Andrew
Document Classification                                                            6
                                                                  Terms
      Web Pages
                              Emails
                                                              T1 T2 ….…… TN   CS
                                                          D                   p
                                                              12 0 ….…… 6     Tr
                                                                              or
                                                          D
                                                          1   3 10 ….…… 28    av
                                                                              ts
                                              Documents                        J
                                                                              el
                                                                              …
                                                          2
                                                              …
                                                          D                    o
                                                              0 11 ….…… 16
                                                          M
                                                                               b
               Intern                                                          s
                 et                       ■     Task: To classify unlabeled
                                                documents into categories
    ACM                          PubMed
                                          ■     Challenge: thousands of
                IEEE Xplore
    Portal                                      terms
             Digital Libraries            ■     Solution: to apply
                                                dimensionality reduction
                                                                                       Andrew
Other Types of High-Dimensional Data              7
          Face images        Handwritten digits
                                                      Andrew
 Major Techniques of Dimensionality Reduction   8
1. Feature selection
2. Feature Extraction (reduction)
                                                    Andrew
Feature Selection vs. Feature extraction                                               9
•   Feature selection
    • A process that chooses an optimal subset of features according to an objective
       function (Only a subset of the original features are selected)
•   Feature extraction/reduction
    • All original features are used
    • The transformed features are linear combinations of the original features
                                                                                           Andrew
   1. Feature Selection
                                                                                           10
• Feature or Variable Selection refers to the process of selecting features that are used in
  predicting the target or output.
• The purpose of Feature Selection is to select the features that contribute the most to output
  prediction.
• The following line from the abstract of Machine Learning Journal sums up the purpose of
  Feature Selection.
The objective of variable selection is three-fold: improving the prediction performance of
the predictors, providing faster and more cost-effective predictors, and providing a better
            understanding of the underlying process that generated the data.
• Usually these benefits of Feature Selection are quoted :
   • Reduces overfitting
   • Improves accuracy
   • Reduces training time
                                                                                                  Andrew
Feature Selection models                                                 11
 Feature Selection Methods are categorized into the following methods.
                                                                              Andrew
Feature Selection models   12
                                Andrew
a. Filter Method                                                                               13
• Filter Methods
   • The Filter Methods involve selecting features based on their various statistical scores
     with the output column.
   • The filter method ranks each feature based on some uni-variate metric and then
     selects the highest-ranking features.
   • The selection of features is independent of any Machine Learning algorithm.
   • The following rules of thumb are as follows:
        • The more the features are correlated with the output column or the column to be
          predicted, the better the performance of the model.
        • Features should be least correlated with each other. If some of the input features
          are correlated with some additional input features, this situation is known
          as Multicollinearity. It is recommended to get rid of such a situation for better
          performance of the model.
                                                                                                    Andrew
   a. Filter Method
                                                                                                               14
Filter Selection Select independent features with:
    •   No constant Variables
    •   No/less Quasi-constant variables
    •   No Duplicate Rows
    •   High correlation with the target variable
    •   Low correlation with another independent variable
    •   Higher information gain or mutual information of the independent variable.
• mRMR Score:
    • Selecting an optimal feature subset from a large feature space is considered challenging problem.
    • The mRMR (Minimum Redundancy and Maximum Relevance) feature selection framework solves this
      problem by selecting the relevant features while controlling for the redundancy within the selected features.
    • mRMR feature selection method is used for different classification problem such as marketing machine
      learning platform at Uber that automates creation and deployment of targeting and personalization models at
      scale
                                                                                                                      Andrew
a. Filter Method   15
                        Andrew
a. Filter Method                                                                     16
• Removing features with low variance(Low variance filter)
   • Variance Threshold is a simple baseline approach to Feature Selection.
   • It removes all features whose variance doesn’t meet some threshold.
   • This involves removing the features having the same value (zero-variance features)
     in all of their rows or to a specified number of rows.
   • Such features provide no value in building the machine learning predictive model.
                                                                                          Andrew
Example: rental bikes
predict the count of bikes that have been rented-   17
                                                         Andrew
Example contd.   18
                      Andrew
  Example: rental bikes
                                                                              19
❑ Target variable: count of bikes
❑ Total 6 variables or columns
❑ First drop the ID variable
❑ Apply the low variance filter and try to reduce the dimensionality of the
  data.
  1. Normalize the data
  2. Compute the variance
                                  Threshold>=0.006
    3. Set variance threshold
    4. Select features have Variance greater than set threshold
                                                                                   Andrew
  Example: rental bikes                                        20
•4. Select features have Variance greater than set threshold
                                                                    Andrew
How to choose a feature selection method?   21
                                                 Andrew
B. Pearson’s correlation coefficient                                                    22
• Pearson Correlation
• A Pearson correlation is a number between -1 and 1 that indicates the
  extent to which two variables are linearly related.
• The Pearson correlation is also known as the “product moment
  correlation coefficient” (PMCC) or simply “correlation”
• Pearson correlations are suitable only for metric variables
• The correlation coefficient has values between -1 to 1
   • A value closer to 0 implies weaker correlation (exact 0 implying no correlation)
   • A value closer to 1 implies stronger positive correlation
   • A value closer to -1 implies stronger negative correlation
                                                                                             Andrew
Mpg data set: heat map                       23
• It is seen that the variables cyl and
  disp are highly correlated with each
  other (0.902033).
• Hence we compared with target
  variable where target variable mpg is
  highly correlated with cyl hence
  would keep and drop the other.
• Then we check with other variable
  same process is followed until last
  variable.
• we are left with four features wt, qsec,
  gear, carb.
• These are the final features given by
  Pearson correlation.
• Multicolineartity: Variance inflation
  factor(VIF)
                                                  Andrew
B. Pearson’s correlation coefficient                                                     24
• Example: A researcher in a scientific foundation wished to evaluate the relation between
  annuals salaries of mathematicians (Y, in thousand dollars) and an index of work quality
  (X1), number of years of experience (X2), and an index of publication success (X3)
  index     number of    index of    annual                  Find the correlation Matrix for
  of        years of     publicati   salaries                given data set
  work      experience   on          (in
  quality   (X2)         success     thousand
  (X1)                   (X3)        dollars
                                     Y)
  3.5           9           6.1          33.2
  5.1           18          7.4        38.7
  6             13          5.9        37.5
  3.1           5           5.8        30.1
  4.5           25           5         38.2
                                                                                               Andrew
Example: cont.   25
                      Andrew
C. Mutual Information
(Information Gain)                                                       26
• Information gain as a measure of how much information a feature provides
  about a class.
• The feature having the most information is considered important by the
  algorithm and is used for training the model.
• The effort is to reduce the entropy and maximize the information gain.
• Information gain helps to determine the order of attributes in the nodes of
  a decision tree.
• Information gain is used in decision trees and random forest to decide the
  best split.
                                                                                Andrew
C. Mutual Information                                                       27
• A more robust approach would be to use Mutual Information, which can be
  thought as the reduction in uncertainty about one random variable given
  knowledge of another.
• Entropy of variable X
• Entropy of X after observing Y
• Information Gain
                                                                                 Andrew
Entropy                                                                                             28
• Entropy is an information theory metric that measures the impurity or uncertainty in a group of
  observations.
• Entropy is uncertainty/ randomness in the data, the more the randomness the higher will be the entropy.
  Information gain uses entropy to make decisions.
• If the entropy is less, information will be more.
• Higher entropy implies greater uncertainty or lack of predictability.
                                                                                                         Andrew
ENTROPY   29
               Andrew
Example#1:   30
                  Andrew
Example#2: Illustrative Data Set                              31
                                              Sunburn data
Ex: Which attribute has the highest info gain for the data?
                                                                   Andrew
Filter Method                                                            32
• Univariate Selection Methods
• Univariate Feature Selection methods selects the best features based on
  Univariate Statistical tests. Scikit Learn provides the following methods in
  the Univariate Feature Selection methods.
   • SelectKBest
   • SelectPercentile
• These two above are the most commonly used methods.
                                                                              Andrew
Filter Method                                                                              33
• Advantages of Filter methods
   •   Filter methods are model agnostic(compatible)
   •   Rely entirely on features in the dataset
   •   Computationally very fast
   •   Based on different statistical methods
• The disadvantage of Filter methods
   • The filter method looks at individual features for identifying it’s relative importance.
     A feature may not be useful on its own but may be an important influencer when
     combined with other features. Filter methods may miss such features.
   • One thing that should be kept in mind is that the filter method does not remove
     multicollinearity. So, you must deal with the multicollinearity of features as well
     before training models for your data.
                                                                                                Andrew
2. Wrapper Methods (Feature Selection)                                 34
• Wrapper Methods: In Wrapper Methods the problem of Feature
  Selection is reduced to a search problem.
    • A model is built using a set of features and its accuracy is recorded.
    • Based on the accuracy, more features are added or removed, and
      the process is repeated.
                                                                            Andrew
2. Wrapper Methods (Feature Selection)   35
                                              Andrew
2. Wrapper Methods (Feature Selection)                36
• We have the following methods in Wrapper Methods.
   • Forward Selection
   • Backward Elimination
   • Exhaustive Search
   • Recursive Feature Elimination
                                                           Andrew
Wrapper Methods                                                         37
• A. Forward Selection
  • Forward Selection is an iterative method.
  • In this method, we start with one feature and we keep on adding
    features until no improvement in the model is observed.
  • The search is stopped after a pre-set criteria is met.
  • This is a greedy approach because it always targets the features in a
    forward fashion, which gives a boost to the performance.
  • If the number of features are large, it can be computationally
    expensive.
• B. Backward Elimination
  • This process is the opposite of the Forward Selection Method.
  • It starts initially with all the features and keeps on removing features
    until no improvement is observed.
                                                                               Andrew
Feature Selection :
Wrapper Methods                                                                     38
• C. Exhaustive Search
   • This Feature Selection Method tries all the possible combinations of features to
     select the best model.
   • This method is quite computationally expensive.
   • For example, if we have five features, we will be evaluating 2^5 =
     3225=32 models before finalizing a model with good accuracy.
• D. Recursive Feature Elimination
   • Recursively Feature Elimination (RFE) involves the following steps:
   • We train the model on the initial set of features and the importance of each
     feature is calculated.
   • In the second iteration, a model is built again using the most important features
     and excluding the least important features.
   • These steps are repeated recursively until we are left with the most important
     features for the problem under consideration.
   • Scikit learn library provides functions for Recursive Feature Elimination.
                                                                                         Andrew
Wrapper method:
Forward feature selection   39
                                 Andrew
Wrapper based model:
forward feature selection                             40
• Fitness level prediction
• So the first step in Forward Feature Selection is
  to train n models using each feature individually
  and checking the performance.
• If you have three independent variables, we will
  train three models using each of these three
  features individually.
                                                           Andrew
forward feature selection: Example            41
• Let’s say we trained the model using
  the Calories_Burnt feature and the target
  variable, Fitness_Level and we’ve got an
  accuracy of 87%
                                                   Andrew
forward feature selection: Example cont.     42
Next, we’ll train the model using
the Gender feature, and we get an accuracy
of 80%
                                                  Andrew
Example cont.                          43
• Next, we will repeat this process
  and add one variable at a time. So
  of course we’ll keep
  the Calories_Burnt variable and
  keep adding one variable. So let’s
  take Gender here and using this
  we get an accuracy of 88%-
                                            Andrew
 Example cont                                           44
 Plays_Sport along with Calories_Burnt, we get an
accuracy of 91%. A variable that produces the highest
improvement will be retained.
                                                             Andrew
Feature Selection (Wrapper model):
Backward Feature Elimination                             45
• These are our assumptions-
   • No missing values in the dataset
   • Variance of the variables is high
   • Low correlation between the independent variables
                                                              Andrew
  Backward Feature Elimination: Example   46
• Fitness prediction level
• The first step is to train the model,
  using all the variables.
• You’ll of course not take the ID
  variable train the model as ID
  contains a unique value for each
  observation
• So we’ll first train the model using
  the other three independent
  variables. And of course, the target
  variable, which is
  the Fitness_Level.
• we get an accuracy of 92% using
  all three independent variables.
                                               Andrew
47
     Andrew
Backward Feature Elimination: Example                                      48
• If you see gender has produced the smallest change in the performance
  in the model first, it was 92% when we took all the variables and when
  we dropped gender, it was 91.6%. So we can infer that gender does not
  have a high impact on the Fitness_Level variable. And hence it can be
  dropped.
• Finally, we will repeat all these steps until no more variables can be
  dropped.
• It’s a very simple, but very effective technique.
                                                                                Andrew
3.Intrinsic Methods (Feature Selection)
                                                                                      49
• Intrinsic or Embedded Methods
   • Embedded methods learn about the features that contribute the most to the
     model’s performance while the model is being created. You have seen Feature
     Selection methods in the previous lessons, and we will discuss several more in
     future lessons, like Decision Tree based methods.
   • Ridge Regression (L2-Regularization)
   • Lasso Regression (L1-Regularization)
   • Elastic-Net Regression (uses both L1 and L2 Regularization)
   • Decision Tree-Based Methods (Decision Tree Classification, Random Forest
     Classification, XgBoost Classification, LightGBM).
                                                                                           Andrew
 Variance Inflation Factor(VIF)
                                                                                      50
• Multicollinearity is a statistical phenomenon that occurs when two or more independent
  variables in a regression model are highly correlated with each other
• It is challenging in the regression analysis because it becomes difficult to determine the
  individual effects of each independent variable on the dependent variable accurately
• VIF determines the strength of the correlation between the independent variables
• VIF score of an independent variable represents how well the variable is explained by
  other independent variables
                                                                                           Andrew
VIF                                                                                51
• VIF Range is 1 to ∞
• VIF = 1, no correlation between the independent variable and the other variables
• VIF between 1 and 5 = variables are moderately correlated
• VIF greater than 5 = variables are highly correlated
• VIF exceeding 10 indicates significant multicollinearity that needs to be corrected
                                                                                        Andrew
Example   52
               Andrew
                                                                                 53
• One method to detect multicollinearity is to calculate the variance inflation factor
  (VIF) for each independent variable
• To fix multicollinearity, one can remove one of the highly correlated variables,
  combine them into a single variable, or use a dimensionality reduction technique
  such as principal component analysis (PCA) to reduce the number of variables while
  retaining most of the information.
                                                                                         Andrew