Seven Lab Instruction
Seven Lab Instruction
Data Normalization
• Normalization is particularly useful for classification algorithms involving
neural networks or distance measurements such as nearest-neighbor
classification and clustering.
• If using the neural network backpropagation algorithm for classification,
normalizing the input values for each attribute measured in the training
tuples will help speed up the learning phase.
• For distance-based methods, normalization helps prevent attributes with
large ranges (e.g., income) from outweighing attributes with smaller
ranges (e.g., binary attributes).
• It is also useful when given no prior knowledge of the data.
1
26-02-2025
LP-Norm
• For normalizing data using LP norms, first we need to calculate their
norms.
• Norm of a vector represents some property of vector; e.g. L2 norm of a
vector denotes its length/magnitude.
• One important use of L2 norm is to transform a given vector into a unit-
length vector
• i.e., making the magnitude of vector = 1, while still preserving its direction.
• This is achieved by dividing each element in a vector by its length i.e. its
L2-norm.
• Another use of L1 & L2 norm:
• in computation of loss in regularized gradient descent algorithms.
• These are also used in famous Ridge and Lasso regression algorithms.
LP-Norm
• The p-norm is a norm on suitable real vector spaces given by the pth root of the sum (or
integral) of the pth-powers of the absolute values of the vector components.
• The general definition for the p-norm of a vector v that has N elements is:
• https://www.journaldev.com/45324/norm-of-vector-python
where p is any positive real value, Inf, or -Inf. Some common values of p are 1, 2,
and Inf.
•If p is 1, then the resulting 1-norm is the sum of the absolute values of the vector
elements.
•If p is 2, then the resulting 2-norm gives the vector magnitude or Euclidean
length of the vector.
•If p is Inf, then
2
26-02-2025
Rescaling
• Rescaling changes the distance between the min and max values in a data
set by stretching or squeezing the points along the number line.
3
26-02-2025
73,600 12,000
(1.0 0) 0 0.716
98,000 12,000
• Normalization by decimal scaling
v
v' Where j is the smallest integer such that Max(|ν’|) < 1
10 j
• It can be most appropriate when there are just a few outliers, since it
provides a simple way to compare a data point to the norm.
4
26-02-2025
Z-score
• z-scores measure the distance of a data point from the mean in
terms of the standard deviation.
• The standardized data set has mean=0 and standard deviation 1,
and retains the shape properties of the original data set.
Z-score
• The z-scores of the data are preserved, so the shape of the
distribution remains the same.
• E.g. v A
v'
(μ: mean, σ: standard deviation):
A
Ex. Let μ = 54,000, σ = 16,000. Then $73,000 is mapped to
73,600 54,000
1.225
16,000
5
26-02-2025
normalize()
method
https://scikit-
learn.org/stable/modules/gen
erated/sklearn.preprocessing.n
ormalize.html
6
26-02-2025
7
26-02-2025
Demo of dropna()
-It returns only those rows for which no
occurrences of np.nan
Demo
of
dropna()
-Column wise
8
26-02-2025
Demo of fillna()
Demo of dropna()
and
fillna()
-By default inplace flag is set as
‘False’
-Hence all these fillna() or dropna()
won’t have any effect until you set
inplace flag as ‘True’
-After executing a number of fillna()
and dropna(), if you display your data
frame, you get that same old
dataframe.
-You need to re-issue these statements
again but with inplace flag reset.
9
26-02-2025
with
inplace
True
Demo of conditional
selection
column wise
selection
10
26-02-2025
Normalization
11
26-02-2025
Normalizer
class
https://scikit-
learn.org/stable/modules/gen
erated/sklearn.preprocessing.
Normalizer.html
12
26-02-2025
13
26-02-2025
14
26-02-2025
Normalizer
(default: row-wise)
15
26-02-2025
Max norm
row-wise
16
26-02-2025
17
26-02-2025
Comparison
LP norm
Normalizer class v/s normalize method
18
26-02-2025
19
26-02-2025
#we can control data type of get_dummy resultant DF using dtype option
df_fu_ow=pd.get_dummies(df,columns=['fuel','owner'],dtype=np.int32,drop_first=True)
Now we will need to OHE ‘brand’ and drop a few redundant columns before concatenating it with normalized df_numeric
20
26-02-2025
Method 2: Separating
out Categorical
columns
df_categorical = df.select_dtypes(include=['object',
'category’])
21
26-02-2025
• You must have an idea of how many unique values a categorical feature
holds; in order to know how many OHE columns it will generate.
• If you used ‘K-1’ OHE outcomes with option ‘drop_first=True’ for
columns with ‘K’ unique values, it will add ‘K-1’ columns in the resulting
DF corresponding to each categorical column in your df.
22
26-02-2025
Pandas
● Often the data you need exists in two separate sources,
fortunately, Pandas makes it easy to combine these
together.
● The simplest combination is if both sources are already in
the same format, then a concatenation through the
pd.concat() call is all that is needed.
Concatenating multiple df
23
26-02-2025
More on df.concat()
# Concatenating DataFrames row-wise (axis=0)
df_concat = pd.concat([df1, df2], ignore_index=True)
• ignore_index=True resets the index after stacking.
# Horizontal Concatenation (Merging Columns)
df_concat_cols = pd.concat([df1, df2], axis=1)
• Columns may get duplicated if they don’t have unique labels.
24
26-02-2025
Demo of df.join()
The same output you get
with this:
df_joined=
pd.concat([df1, df2],
axis=1)
25
26-02-2025
Here size of resultant df is 8128 X 40, first 38 columns are from OHE df on 3
categorical fields, and 2 numeric normalized columns
26
26-02-2025
27
26-02-2025
Demo on
‘brand’
feature of
Cars.csv
28
26-02-2025
Demo on
‘brand’
feature of
Cars.csv
Count of columns in
resultant df=
32-23 = 9
You saved on
dimension of data
without
compromising much
on data quality.
controling data
type of
get_dummies
resultant DF
using dtype
option
29
26-02-2025
30
26-02-2025
OHE demo
OHE demo
31
26-02-2025
Ordinal Encoding
Using Pandas
32
26-02-2025
Using df[‘Ordinal_col_name’].cat.codes
•Identify ordinal columns manually or use a predefined dictionary.
•Convert them to categorical dtype with ordered categories.
•Apply ordinal encoding using .cat.codes.
•Education Encoding:
•High School → 0, Bachelor → 1, Master → 2, PhD → 3
•Satisfaction Encoding:
•Very Dissa sfied → 0, Dissa sfied → 1, Neutral → 2, Sa sfied → 3, Very Sa sfied → 4
• This approach ensures that the ordinal nature of the data is preserved.
import pandas as pd
# Sample DataFrame
Using df.cat.codes
df = pd.DataFrame({
'Education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor', 'Master'],
'Satisfaction': ['Neutral', 'Satisfied', 'Very Satisfied', 'Dissatisfied', 'Very
Dissatisfied', 'Neutral']
})
print(df)
33
26-02-2025
from sklearn.preprocessing
import OrdinalEncoder
Alternate OHE method using OrdinalEncoder class
34
26-02-2025
OrdinalEncoder class
import pandas as pd
• Best for: Features with from sklearn.preprocessing import OrdinalEncoder
an inherent order (e.g.,
Education Level, df = pd.DataFrame({'Education': ['High School',
'Bachelor', 'Master', 'PhD', 'Bachelor', 'Master']})
Satisfaction)
• This assigns integer
values such as 0, 1, 2, # Define ordinal categories
education_order = [['High School', 'Bachelor',
3………based on a 'Master', 'PhD']]
predefined order.
# Apply ordinal encoding
encoder = OrdinalEncoder(categories=education_order)
df['Education_Encoded'] =
encoder.fit_transform(df[['Education']])
print(df)
OrdinalEncoder demo
35
26-02-2025
OrdinalEncoder demo
36
26-02-2025
X_train,X_test,y_train,y_test =
train_test_split(X,Y,test_size=0.2,random_state=82, stratify=Y)
Option random_state
• The random_state parameter in Scikit-Learn ensures reproducibility
by controlling the randomness in various functions like
train_test_split(), KMeans(), RandomForestClassifier(), etc.
Why Use random_state?
• Reproducibility: Ensures that the results remain the same every time
you run the code.
• Consistency: Useful when comparing models to ensure they are
trained on the same Train/Test data split.
• Debugging: Helps track and debug issues by keeping the same
random selections.
37
26-02-2025
Train/Test split
X_train,X_test,y_train,y_test =
train_test_split(X,Y,test_size=0.2,random_state=82)
38