Feature Selection
Feature Selection
Abstract— Feature selection is one of the important step in Dimensionality is decreased in feature selection datasets to
feature reduction in data science. High dimensional data create a high-quality dataset that can be utilized for model
generally have extra or redundant features because of which training and prediction. The hypothesis space is smaller when
overfitting problem may occur and performance may get data dimensionality is decreased, which enables algorithms to
degraded. There are number of different feature selection run more quickly and efficiently. The flow of paper is as
techniques are present. The application of feature selection is in follows. Feature selection is covered in section 2. Methods for
different task of data science like classification, clustering, feature selection are covered in Section 3. Related work is
regression etc. The objective of the research is to study all discussed in Section 4.Performance measurement is discussed
different types of feature selection methods in detail. This review
in section 5 and the discussion and conclusion are in Section
gives detailed information about all these methods which are
applied in different application area to select the most important
6.
features and removes redundant or irrelevant features.
I. INTRODUCTION
Knowledge may be extracted and patterns can be
discovered through the application of machine learning and
data mining techniques. These gathered data are typically
accompanied by a significant amount of noise. Noise can be
introduced in data by different ways and because of a variety Fig. 1. General prediction system
of factors .Two main drawback are 1) technologies used to
collect the data and 2) The data’s source of origin. It is not II. FEATURE SELECTION
easy to shift through such vast amounts of noise data and find The most popular technique for getting rid of superfluous
patterns and meaningful information. A number of factors and unnecessary features is feature selection. Feature selection
affect machine learning’s success. One of the factor is quality techniques minimize the dimensionality of training data by: 1)
of the data. It is exceedingly difficult to obtain an exact result eliminating features that are redundant; and 2) deleting
and the computation time will increase if the data is redundant, features with little or no predictive ability. 3) Superfluous
irrelevant, noisy or inaccurate. Effective prevention and early elements. A well-chosen feature set can improve predictive
treatment of high-risk diseases such as diabetes, obesity, and efficiency and accuracy [4]. Therefore, it is crucial and will be
cardiovascular diseases can result from the prediction of highly beneficial to the researcher to extract the most
complex diseases. The general process for prediction is shown significant features [1, 2]
in Fig 1. Among the different machine learning models like
logistic regression, support vector machine , decision tree, Techniques for feature selection can be categorized as
Artificial neural network ,random forest, k nearest neighbor supervised, unsupervised, or semi-supervised based on
are some models which takes input data and predict the risk of whether the training set is labeled. Filter models, wrapper
disease. Preprocessing is the initial step in this. Removing models, embedding models, and hybrid models are more
redundant and noisy features is often achieved through the use broad categories into which supervised feature selection
of dimensionality reduction. Feature extraction and feature techniques can be divided. Fig 2 presents the feature selection
selection are the two Techniques for reducing dimension. By methods classification.. It is based on metrics for the general
projecting features onto a new, lower-dimensional feature properties of the training set, including correlation,
space, feature extraction techniques create new features, information, consistency, distance, and dependency. Filter
which are typically combinations of the original features. methods are not dependent on classifier. Among the filter
Linear Discriminant Analysis (LDA), Principle Component model's most representative algorithms are those based on
Analysis (PCA), and Canonical Correlation Analysis (CCA)[3] information gain, Fisher score, and relief. The wrapper model
are some of the common feature extraction methods which are uses a predetermined learning method's expected accuracy to
used. Feature selection processes aims to select a some set of determine the quality of selected features. Wrapper methods
features those will reduce duplication and select relevance are classifier dependent. Because running these algorithms on
features only. The following are some of feature selection datasets with lots of features is more expensive and comes at
techniques: Lasso, Fisher Score, Information Gain, and Relief. a high computational cost, to remove the gap between the
2
Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on October 19,2024 at 15:14:17 UTC from IEEE Xplore. Restrictions apply.
Where cov(Pi,Q) is the covariance, ıis the standard deviation. performance. The wrapper approach requires a lot of
This approach may not adequately capture many physical computing power.
phenomena because it only looks for linear correlations.
1) Forward feature Selection::
2) Linear Discriminant Analysis (LDA): Feature selection method wherein a predetermined
For multi-class categorization, this is employed. It criterion is used to gradually add one feature at a time from an
minimizes variance within each class while maximizing the empty feature set. By repeatedly assessing the effectiveness of
distance between classes. In order to optimize the separation various feature combinations, it seeks to identify the optimal
between the classes, LDA projects the data onto a lower- subset of features.
dimensional space. In order to do this, a set of linear
discriminants that maximize the ratio of variance within a 2) Backward feature selection:
class to variation between classes is identified. Feature selection that begins with every feature in the
model and proceeds to eliminate each one separately until the
3) Mutual Information: subset of features that best serves the goals of the model is
Mutual Information operates based on the variables' identified. When identifying the most significant traits from a
entropy. Mutual information, or MI, is a non-negative number relatively small number of features is the aim, the backward
that expresses how dependent two random variables are on elimination method can be helpful.
one another. It equals zero if and only if two random variables
are independent. Higher values denote stronger dependency. 3) Exhaustive Feature Selection:
It is the degree to which one variable reveals information In a machine learning issue, this technique—also referred
about another. The mutual information between two random to as best subset selection—selects the optimal feature
variables, X and Y is expressed as follows: combination from a given set of features. Finding the feature
subset that optimizes the model's performance is the aim.
I(X ; Y) = H(X) — H(X | Y) (2) Using a performance metric like accuracy or mean squared
error, the optimal subset of features is chosen after all potential
The entropy for X in this instance is H(X), the conditional
feature combinations are assessed. The quantity of
entropy for X given Y is H(X|Y) mutual information for X
characteristics increases exponentially in the number of
and Y is I(X:Y). The outcome is expressed in bits (zero to one).
possible combinations.
4) Chi square test of independence:
4) Recursive feature elimination :
A derivable (sometimes called an inferential) statistical
One feature selection method used to identify critical
test that determines whether or not two sets of variables are
elements in a dataset is called Recursive Feature Elimination
likely to be connected to one another is the Chi-Square Test of
(RFE). Iteratively eliminating the least important components
Independence [20]. This test, which is regarded as non-
is the methodology's final step before building a model using
parametric, is applied when we have counts of values for two
the features that remain. Until the goal feature count is
nominal or categorical variables.
reached—a number defined by RFE—this removal process is
5) Missing value ratio: repeated. In some circumstances, it may not be known how
The calculation of the missing value ratio involves many original characteristics were legitimate. To solve this,
dividing the total number of observations by the number of several feature subsets are evaluated using cross-validation
missing values and then multiplying the result by 100 for combined with RFE, and the most advantageous combination
every column. A threshold must be set, and features with is finally chosen based on scores. Finding the ideal feature
missing value ratios higher than this must be dropped. count for the model is made easier with the help of this method.
Missing value ratio =(Number of missing values/Total 5) Embedded method:
number of observations) * 100 The advantage of both wrapper and filter approaches are
combined in embedded methods. The Feature selection
6) Dispersion Ratio: process is interwoven into the classification algorithm. During
It is the arithmetic mean divided by the geometric mean training phase, the classifier adjusts its internal parameters and
AM/GM. A larger AM/GM number is indicated by greater determine the appropriate weights and priorities for each
dispersion which makes it a more significant property. feature in order to attain the best classification accuracy.
7) Relief: Therefore selecting the best feature subset and creating the
Each feature is given a feature score by Relief. This score model are completed in a single step when using an embedded
is used to rank and choose the feature having highest score for technique. Algorithm training and feature selection are carried
feature selection. A "hit" is when a discrepancy in feature out concurrently using embedded techniques.
values is found in a nearby instance pair that is in the same RIDGE regression and LASSO are two of the most well-
class; this reduces the feature score. Alternatively, the feature known applications of these techniques. Both approaches
score rises in the event that a ‘miss'—a difference in feature penalize the size of the coefficient of feature while minimizing
values—is noticed in a nearby instance pair with a different the error between anticipated and actual values or records..
class value.
With Lasso regression, L1 regularization is carried out,
E. Wrapper method: Applying a penalty that is directly proportional to the absolute
The wrapper approach is a strategy that uses several values of the coefficients.. Ridge regression is used to do L2
subsets of features to assess a model's performance and choose regularization, which applies a penalty proportional to the
a subset of those characteristics. A model is used to train and square of the coefficient magnitude. Lasso Regression is more
assess each of the candidate feature subsets that are produced effective at lowering the variance in data that contains a large
by the wrapper technique. The optimal subset of features is number of insignificant features because it can neutralize the
chosen by the wrapper method based on the model's effects of irrelevant features in the data, which means it can
3
Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on October 19,2024 at 15:14:17 UTC from IEEE Xplore. Restrictions apply.
reduce a feature's coefficient to zero and eliminate it entirely. In this paper[8] a decision tree classifier with hyper
Nevertheless, ridge regression is unable to get the coefficients parameter optimization is used .The AUC of the classifier with
down to zero. When the data contains features that are certain all features and with reduced features is calculated by K fold
to be more relevant and valuable, ridge regression performs cross validation with k=10. CART classifier performed well
better. on reduced dataset without compromising performance. In
this paper [9] they have used Boruta feature selection
Lasso = Residual Sum of Squares + Ȝ Sum of the algorithm with ensemble learning to select relevant features.
coefficients). On PIMA dataset they got 98% accuracy by k cross validation
Ridge = Residual Sum of Squares + Ȝ Sum of square of where k=10. In this paper [10] they have used rough set theory
coefficients) to select the important features from PIMA dataset.
The benefits and drawbacks of the primary three feature In this paper [11] they have used filter algorithm which is
selection techniques are displayed in the following Table 1. based on correlation. They have applied it on continuous class
data and discrete. It shows that it can reduce dimensionality of
TABLE I. BENEFITS AND DRAWBACKS OF FEATURE SELECTION dataset with increase in performance. It reduces the feature set
METHODS by 54%. In this paper [12] correlation feature selection method
Feature Advantages Disadvantages is used to find feature set. They have used PIMA dataset in
selection MATLAB environment. The selected features are given to
method Probabilistic Neural Network for classification. In this
algorithm [13] they have used chi-square method and
Filter •Quick processing; •No thought was given to advanced clustering algorithm for feature selection and then
method •Effective computing; feature dependency.
•Reduced chance of •The interplay of
classification is done. The prediction accuracy is increased as
overfitting classifiers was overlooked compared to previous.
•Quicker than a wrapper;
independent of classifier In this paper they are using a method which dynamically
selects the features which represents the best features.[15]
wrapper x Interaction between x Overfitting risk; They have used 12 different datasets. They have selected
method classifiers; x high computing costs; different feature subsets and classifier, trained separately and
x Taking into account x slowness, then predictions are summed to get ensembles prediction.
feature x classifier They found that the proposed method gives best solution set.
dependencies; dependent selection
Accuracy greater In this paper [16] they have reduced the number of features
than filter in three different steps. In first step they have used pairwise
embedded x Interaction x Identification of a small correlation and discarded the redundant features. In second
method between collection of features is
classifiers; challenging;
step individual method select their own feature set
x Accuracy x complex independently. In step three features are equalized. Then they
greater than implementation, are combined with the help of union and quorum techniques.
filter; x classifier In their experiment they got result of 99.2% with reduced
x Quick and dependent selection dataset.
precise
x computational In this paper [17] PCA method is used for feature
cost lower transformation. PCA identifies best subset of feature
than wrapper components. By using feature transformation accuracy can be
increased as compared to feature selection method Also.
IV. RELATED WORK In this paper [12] they have used correlation based feature
The two main categories of current filter techniques are selection. Irrelevant features are not taken into consideration
univariate and multivariate. While multivariate approaches as they will have low correlation with the class. In initial stage
take into account a selection of features at the same time, feature-feature and feature-class matrix correlation are
univariate methods examine each feature separately. calculated. Then it find the feature space and the feasible
subset in the subsequent step of the exploration procedure.
This paper [6] uses a wrapper based feature selection
method in order to minimize the amount of feature In paper [18] Attributes are assigned values based on
characteristics. It employs 96% accurate Grey wolf importance of that attribute. Correlation between attributes is
optimization (GWO) and an Adaptive Particle Swam calculated and compared and best attributes are selected. SVM
Optimization (APSO) techniques. got an improved accuracy of 77% and NB got 82.3%.
To identify pertinent features and remove redundant An ensemble feature selection technique based on sort
features, more sophisticated multivariate feature approaches aggregation (SA-EFS) is presented in this paper. Chi-square,
have been developed. In [7] minimal-redundancy-maximal- the maximum information coefficient, and the XGBoost
relevance (mRMR) feature selection approach is used. Again, algorithm were employed to generate candidate sets of
though, they are primarily limited to feature interactions multiple ideal feature subsets. The learning outcomes of
between pairs. Relief does not look for feature interactions in several candidate sets for optimal feature subsets are then
all possible ways. Rather, it assigns a feature's relevance based combined, ranking the features based on significance, to
produce the ideal feature subsets.
on how efficiently its value separates samples that share
similarities (like genotype) but fall into different When choosing the optimal subset, wrapper methods
classifications . inherently account for feature dependencies, such as
interactions and redundancies. However, wrapper approaches
4
Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on October 19,2024 at 15:14:17 UTC from IEEE Xplore. Restrictions apply.
are computationally demanding (compared to filter and VI. DISCUSSION AND CONCLUSION
embedded methods) because of the numerous calculations This paper summarize different feature selection
needed to build and assess the feature subsets. The wrapper methods .Each method has advantages and problems of its
methods provides the "best" feature subset. As the output is own. Numerous research works have contrasted the prediction
already a feature subset, one advantage of this is that the user abilities of various feature selection techniques. Different
does not need to select out the best amount of features . The approaches are appropriate and provide the greatest results
fact that the characteristics in the collection that are depending on the challenge. The type of dataset being studied
significantly more relevant aren't always obvious is a and the nature of the challenge will determine which feature
drawback. selection strategy is best for your application. Therefore, since
Regularization models, such as LASSO or elastic net, are filter methods generate a sorted list of features, they are the
used in feature selection, and decision tree-based algorithms, most effective way to determine which features are relatively
such as gradient boosting, random forest, and decision trees, most significant. The best results will come from wrapper
are examples of embedded techniques. Note that the decision methods if the dataset has fewer features. New feature
tree-based and regularization methods discussed above, like selection methods are also used in which one or more feature
many filter methods, also yield a ranked list of features. Using selection methods are combined so there is always tradeoff
metrics such as the Mean Decrease Impurity (MDI), decision between complexity of methods and performance of algorithm.
tree-based algorithms prioritize features The magnitude of the Hybrid methods gives better performance as compared to
feature coefficients provides the feature ranking for single feature selection methods.
regularization procedures. Penalized approaches, such as REFERENCES
LASSO, have the capacity to eliminate redundant features,
unlike decision tree-based algorithms. [1] E. H. Rachmawanto, D. R. Ignatius Moses Setiadi, N. Rijati, A. Susanto,
I. U. Wahyu Mulyono, and H. Rahmalan, “Attribute Selection Analysis
V. PERFORMANCE MEASUREMENT for the Random Forest Classification in Unbalanced Diabetes Dataset,”
in 2021 International Seminar on Application for Technology of
The confusion matrix which is shown in Fig 3 is used to Information and Communication (iSemantic), Sep. 2021, pp. 82–86.
assess each model's performance and determine how effective doi: 10.1109/iSemantic52711.2021.9573181.
each strategy is. The machine learning model's projected value [2] D. R. Ignatius Moses Setiadi et al., “Effect of Feature Selection on The
and the actual target values are compared in the confusion Accuracy of Music Genre Classification using SVM Classifier,” in
2020 International Seminar on Application for Technology of
matrix. It uses metrics such as F1 score, recall, accuracy and Information and Communication (iSemantic), Sep. 2020, pp. 7–11. doi:
precision. Its constituent are True-positive (TP), false-positive 10.1109/iSemantic50169.2020.9234222.
(FP), false-negative (FN), and true-negative (TN).Accuracy is [3] S. Raghavendra and J. Santosh Kumar: Performance evaluation of
calculated as the ratio of correct predictions to all of the random forest with feature selection methods in prediction of diabetes.
models predictions. Precision is the ability of a model to Int. J. Electr. Comput. Eng., vol. 10, no. 1, pp. 353–359 (2020). doi:
produce positive predictions with accuracy. The ratio of https://doi.org/10.11591/ijece.v10i1.pp353-359.
correct positive predictions to the total number of positives in [4] Akhiat, Y., Asnaoui, Y., Chahhou, M., & Zinedine, A. A new graph
feature selection approach. In 2020 6th IEEE Congress on Information
the actual class is known as recall. The value of the weighted Science and Technology (CiSt) (pp. 156-161). IEEE. (2021, June).
average of recall and precision is known as the F1 score. [5] Y. Bouchlaghem, Y. Akhiat, and S. Amjad, “Feature Selection: A
Review and Comparative Study,” E3S Web Conf., vol. 351, pp. 1–6,
2022, doi: 10.1051/e3sconf/202235101046.
[6] T. M. Le, T. M. Vo, T. N. Pham, and S. V. T. Dao, “A novel wrapper–
based feature selection for early diabetes prediction enhanced with
ametaheuristic,” IEEE Access, vol. 9, pp. 7869–7884, 2020
[7] Peng, H., Long, F., and Ding, C. (2005). Feature Selection Based on
Mutual Information: Criteria of Max-Dependency, Max-Relevance,
and Min-Redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27,
1226–1238. doi:10.1109/TPAMI.2005.159
[8] Dharyll Prince M. Abellana, Robert R. Roxas, Demelo M. Lao, Paula
E. Mayol, Sanghyuk Lee, "Ensemble Feature Selection in Binary
Machine Learning Classification: A Novel Application of the
Evaluation Based on Distance from Average Solution (EDAS)
Method", Mathematical Problems in Engineering, vol. 2022, Article ID
Fig. 3. Confusion Matrix 4126536, 13 pages, 2022. https://doi.org/10.1155/2022/4126536
[9] Zhou, H., Xin, Y. & Li, S. A diabetes prediction model based on Boruta
Accuracy = (TP + TN)/(TN + TP + FP + FN) feature selection and ensemble learning. BMC Bioinformatics 24, 224
Precision = TP/(TP + FP) (2023). https://doi.org/10.1186/s12859-023-05300-5
Recall = TP/(TP + FN) [10] Kaka-Khan, K. M., Mahmud, H., & Ali, A. A. (2022). Rough Set-Based
Feature Selection for Predicting Diabetes Using Logistic Regression
F1 Score = 2(Recall*Precision)/(Recall + Precision) with Stochastic Gradient Decent Algorithm. UHD Journal of Science
Accuracy is the most commonly used correctness measure. and Technology, 6(2), 85–93.
It refers to the degree of correctness. Precision is used to check [11] Hall, M.A. (2000). Correlation-based feature selection of discrete and
the accuracy of positive prediction made by model. It is a numeric class machine learning. (Working paper 00/08). Hamilton,
useful metric in scenarios where false positives are costly or New Zealand: University of Waikato, Department of Computer Science.
undesirable. Recall quantifies the model's accuracy in [12] Kalaiselvi, K., & Sujarani, P. (2018). Correlation Feature Selection
(CFS) and Probabilistic Neural Network (PNN) for Diabetes Disease
identifying positive examples. The harmonic mean of recall Prediction. International Journal of Engineering & Technology, 7(3.27),
and precision is the F1 score. It offers a harmony between 325–330. https://doi.org/10.14419/ijet.v7i3.27.17965
recall and precision. [13] R. Mythily, and D. Mavaluru, “An efficient feature selection algorithm
for health care data analysis,” Bulletin of Electrical Engineering and
5
Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on October 19,2024 at 15:14:17 UTC from IEEE Xplore. Restrictions apply.
Informatics, vol. 9, no. 3, pp. 877-885, 2020, doi:
10.11591/eei.v9i3.1744.
[14] Pudjihartono N, Fadason T, Kempa-Liehr AW and O'Sullivan JM (2022)
A Review of Feature Selection Methods for Machine Learning-Based
Disease Risk Prediction. Front. Bioinform. 2:927312. doi:
10.3389/fbinf.2022.927312
[15] H. E. Kiziloz and A. Deniz, "Feature Selection with Dynamic Classifier
Ensembles," 2020 IEEE International Conference on Systems, Man,
and Cybernetics (SMC), Toronto, ON, Canada, 2020, pp. 2038-2043,
doi: 10.1109/SMC42975.2020.9282969.
[16] Doreswamy, M. K. Hooshmand, I. Gad, M. K.H. Doreswamy, and I.
Gad, “Feature selectionapproach using ensemble learning fornetwork
anomaly detection,” CAAI Trans.Intell. Technol., vol. 5, no. 4, pp.
283–293,Dec. 2020, doi: 10.1049/trit.2020.0073
[17] B. Senthil Kumar, & R. Gunavathi. (2020). Early prediction of diabetes
using Feature Transformation and hybrid Random Forest Algorithm.
International Journal of Engineering and Advanced Technology
(IJEAT), 9(5), 787–791. https://doi.org/10.35940/ijeat.E9836.069520
[18] Sneha, N., Gangil, T. Analysis of diabetes mellitus for early prediction
using optimal features selection. J Big Data 6, 13 (2019).
https://doi.org/10.1186/s40537-019-0175-6
6
Authorized licensed use limited to: Universita degli Studi di Bologna. Downloaded on October 19,2024 at 15:14:17 UTC from IEEE Xplore. Restrictions apply.