Feature selection
Feature selection is the process of picking the most important data columns (features) from your dataset that help a
machine learning model make better predictions.
The process of identifying and selecting a subset of relevant features.
It helps make the model:
• Faster to train
• More accurate
• Easier to understand
Why Feature Selection Is Important:
1. Improves Accuracy: Reducing irrelevant features can lead to better model performance.
2. Speeds Up Training: Fewer data points reduce algorithm complexity and training time.
3. Enhances Interpretability: Simpler models are easier to understand and explain.
Reduces Overfitting: Less redundant data means less chance to make decisions based on noise.
Steps in Feature Engineering:
1. Creating new features
2. Transforming features (e.g., scaling, encoding)
3. Selecting important features
4. Reducing the number of features (dimensionality reduction)
Types of Feature Selection Methods:
1. Filter Methods (Like Pre-checking Features)
• Think of this like doing a quick check before training a model.
• It looks at each feature on its own to see if it has a strong relationship with the target (the thing you're trying
to predict).
• Example: "Does age affect whether someone buys a product?"
Tools used: correlation, chi-square test, etc.
Good: Fast and simple
Bad: Doesn't look at how features work together
2. Wrapper Methods (Like Trying Different Combinations)
• These methods test many combinations of features by actually training models.
• It picks the combination that works best for prediction.
• Like trying different outfits to see which one fits best!
Tools used: Recursive Feature Elimination (RFE), forward selection, backward elimination
Good: Very accurate
Bad: Slower and takes more computer power
3. Embedded Methods (Built Into the Model)
• These methods let the model pick the best features while it's being trained.
• It's like cooking and choosing the best ingredients at the same time.
Tools used: Lasso (L1), Decision Trees, Random Forest
Good: A nice balance between speed and accuracy
Bad: Tied to specific types of models
REQUIREMENT AND CLASSIFICATION OF FEATURE ENGINEERING
➤ Why is it needed?
• Raw data often contains irrelevant or redundant features.
• Helps models learn better by focusing on useful data.
➤ Classification:
1. Feature Selection
This means choosing only the most useful features from the data and removing the rest.
• Why? Some features may not help the model and can even make it worse.
• How?
o Filter Methods:
o Wrapper Methods
o Embedded Methods
2. Feature Extraction
This means creating new features from the existing ones.
• Why? Sometimes combining or transforming old features can give better insights.
• Example: PCA (Principal Component Analysis) combines many features into fewer ones that still keep most
of the information.
3. Dimensionality Reduction
This means reducing the number of features while keeping the important information.
• Why?
o Makes models faster to train.
o Easier to visualize the data.
o Reduces overfitting (when a model learns too much from noise in the data).
• Common Techniques: PCA, t-SNE, LDA.
Univariate Analysis
➤ What is it?
Univariate analysis means analysing one variable at a time.
• It helps us understand the basic characteristics of that variable.
• Useful to see how data is spread and whether there are any unusual values (outliers).
Why is it important?
• Helps find central values (like average).
• Shows variability (how spread out the values are).
• Makes it easier to decide how to clean or transform the data.
Common Techniques & Tools
• Mean, Median, Mode: Show the central tendency.
• Histogram: Shows the distribution (how frequently values appear).
• Box Plot: Shows spread, median, and outliers.
• Standard Deviation & Range: Show how spread out the values are.
CORRELATION BASED FEATURE SELECTION (CFS)
➤ What is it?
• Selects features that are highly correlated with the target variable but not with each other.
Method: Chi-Square Test
• Used for categorical features.
• Measures the relationship between two variables.
• Large value → Strong relationship between feature and target.
CFS - Heatmap (Correlation Feature Selection Heatmap)
➤ What is it?
A CFS Heatmap is a visual tool used to show how strongly features in a dataset are related (correlated) to each other.
• It uses color intensity to represent the strength of correlation between pairs of features.
How to Read It:
• Darker colors = Stronger correlation (positive or negative).
• Lighter colors = Weaker or no correlation.
• Values usually range from -1 to 1:
o +1 = Strong positive correlation (both features increase together)
o -1 = Strong negative correlation (one increases while the other decreases)
o 0 = No correlation
Why is it useful?
• Helps in Feature Selection: Identify and remove highly correlated (redundant) features.
• Makes complex relationships easy to understand visually.
• Often used before model building to clean up and simplify data.
FEATURE EXTRACTION – PCA (Principal Component Analysis)
➤ What is PCA?
PCA is a technique to reduce the number of features by combining them into new variables (principal components)
that still capture most of the information (variance).
Dimensionality Reduction in Machine Learning
Dimensionality reduction is the process of reducing the number of input variables (features) in a dataset while
keeping as much important information as possible.
It helps simplify the data, make models faster, reduce overfitting, and improve visualization.
• Real-world data often has too many features (high-dimensional data).
• More features = more complexity, harder to visualize and slower to train.
• Some features may be irrelevant, redundant, or noisy.
Example:
Imagine you're analyzing students with these features:
• Age
• Height
• Weight
• Shoe Size
• Shirt Size
• Grade
• Exam Score
Many of these features are correlated (e.g., height and shoe size). Dimensionality reduction can combine them into
fewer features without losing useful information.