FEATURE TRANSFORMATION
Feature Transformation is the process of changing, improving or deriving features
so that a machine learning model can understand and use them better.
It mainly includes:
1. Feature Construction (building/deriving new features).
2. Feature Extraction (reducing and combining features into useful
representations).
FEATURE CONSTRUCTION
1. Encoding Categorical (Nominal) Variables
Nominal variables represent categories without any order.
Eg: gender (Male, Female), city (Paris, London, Tokyo), color (Red, Blue, Green).
Since most ML algorithms require numeric input, these must be encoded.
Techniques:
i). One-Hot Encoding
Creates a new column for each category.
Eg: Color = {Red, Blue, Green} →
Red = [1,0,0]
Blue = [0,1,0]
Green = [0,0,1]
ii). Dummy Encoding
Similar to one-hot, but one column is dropped to avoid redundancy.
Eg: Color = {Red, Blue, Green} →
Red = [1,0]
Blue = [0,1]
Green = [0,0]
2. Encoding Categorical (Ordinal) Variables
Ordinal variables represent categories with a natural order or ranking, but the
distance between levels is not meaningful.
Eg: education level (High School < Bachelor < Master < PhD), satisfaction (Low <
Medium < High).
Techniques:
i). Label Encoding
Assigns numbers according to order.
Eg: Size = {Small, Medium, Large} →
Small = 1, Medium = 2, Large = 3
ii). Custom Scoring / Ranking
Numbers are given based on domain knowledge.
Eg: Credit Rating = {Poor, Average, Good, Excellent} →
Poor = 1, Average = 2, Good = 3, Excellent = 4
3. Transforming Numeric Features into Categories
Continuous/numeric variables are sometimes grouped into bins or categories to
simplify models or highlight meaningful ranges.
Techniques:
i). Binning / Discretization
Divides data into fixed ranges.
Eg: Age →
0–17 = Child
18–59 = Adult
60+ = Senior
ii). Equal-Frequency Binning (Quantile Binning)
Groups data so each bin has roughly the same number of samples.
Eg: Income of 1000 people → split into 4 groups (25% each) = Low, Medium,
High, Very High.
4. Text-Specific Feature Construction
Text is unstructured and must be transformed into numerical features for ML.
Techniques:
i). Bag of Words (BoW)
Represents text by counting word appearances.
Eg: “I love AI” → {I:1, love:1, AI:1}
ii). TF-IDF (Term Frequency–Inverse Document Frequency)
Gives more weight to important/rare words.
Eg: In product reviews, the word “excellent” is given higher weight than common
words like “the” or “is.”
FEATURE TRANSFORMATION
1. Principal Component Analysis (PCA)
PCA is an unsupervised dimensionality reduction technique.
It creates new features (principal components) which are combinations of the
original features.
The first component captures the maximum variance in the data, the second
captures the next highest variance, and so on.
PCA reduces the number of features while still keeping most of the important
information.
Helps in handling multicollinearity (when two or more features are highly
correlated).
Eg 1: A dataset with 100 features can be reduced to 20 features that still explain
about 95% of the data variance.
Eg 2: If “Height” and “Weight” are strongly correlated, PCA combines them into
one new feature like “Body Size.”
2. Singular Value Decomposition (SVD)
SVD is a matrix factorization technique that breaks a matrix into three smaller
matrices (U, Σ, V).
It helps identify hidden patterns and structure in large datasets.
Very useful when data is represented as a large matrix (e.g., documents vs words,
users vs movies).
Helps reduce high-dimensional and sparse data into fewer meaningful dimensions.
Eg 1: In a term-document matrix (rows = documents, columns = words), SVD
reduces thousands of words into a smaller set of “concepts,” which improves
search engines and topic modeling.
Eg 2: Netflix uses SVD in its recommender system. From the sparse ratings of
thousands of users and movies, SVD predicts how much a user might like a new
movie.
3. Linear Discriminant Analysis (LDA)
LDA is a supervised dimensionality reduction method.
Unlike PCA (which ignores class labels), LDA uses class information to find
features that best separate different categories.
It projects data into a new space where the distance between different classes is
maximized and the variation within the same class is minimized.
Especially useful when the goal is classification.
Eg 1: In medical data, LDA can separate patients into “Cancer Present” vs “Cancer
Absent” groups by projecting features onto a line that maximizes class separation.
Eg 2: In face recognition, each image has thousands of pixel features. LDA
reduces them into fewer features that highlight differences between people’s faces,
making classification easier.