0% found this document useful (0 votes)

18 views7 pages

Question 3 DM

The document outlines the use of various preprocessing filters in Weka using the Iris dataset, including both supervised and unsupervised attribute and instance filters. It explains the purpose and effects of filters such as AttributeSelection, Resample, Center, Discretize, Normalize, and others, demonstrating their application and impact on the dataset. Key results include dimensionality reduction, handling class imbalance, and transforming data for improved model performance.

Uploaded by

s2024393005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views7 pages

Question 3 DM

Uploaded by

s2024393005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Question 3:

Demonstrate the use of the following preprocessing filters in Weka, using any dataset (e.g., Iris).
(Submit a report showing the purpose and use of these filters).

Attribute Filters (supervised): AttributeSelection

Instance Filters (supervised): Resample

Attribute Filters (unsupervised): Center, Discretize, InterquartileRange, Normalize,

PrincipalComponents, Standardize, ReplaceMissingValues

Instance Filters (unsupervised): RemoveDuplicates, RemoveMisclassified, Resampl

We would be demonstrating the preprocessing filters using weka on the dataset of Iris

Attribute filters

Attribute filter (supervised) use class labels to guide about the selection of the attributes and choose the
attributes which have the strongest relationship with the target attribute and remove the other

Attribute selection

Attribute selection filter selects the most relevant and important attributes that will have great effect on
the accuracy of model ,it causes dimensionality reduction, improve model performance and reduce over
fitting

Demonstration

Using the Iris dataset, we notice that all attributes are important because they all have a strong
relationship with the target attribute. However, let's say we decide that "sepal width" is not relevant.
We can select this attribute and apply a filter to remove it, leaving us with only three input attributes.

Instances Attribute

Resample:

The `Resample` filter in Weka is used to create a new dataset by sampling instances from the original
dataset. This filter is particularly useful for handling class imbalance. Class imbalance occurs when the
number of instances in one class is significantly higher or lower than the number of instances in other
classes.

Demonstration

When we apply the filter to "sepal length," the number of distinct values decreases from 35 to 32, and
the number of unique values decreases from 9 to 5. Similarly, the distinct and unique values of other
attributes like sepal width, petal length, and petal width are also reduced. However, since our class
labels are not imbalanced, we do not need to perform resampling.

Figure 1 Original Dataset

Figure 2 after Resampling

Attribute Selection (unsupervised):

Centre :

It is used to standardized the data by subtracting the mean value of each attribute resulting in zero
mean

Demonstration

When we apply the center filter in Weka, the mean of every attribute is subtracted from each value, so
the mean of the attributes becomes zero. This process centers the data but does not standardize it
because the standard deviation remains unchanged. Standardization involves both centering the data
(making the mean zero) and scaling it (making the standard deviation one), which is not done by the
center filter alone.

SEPAL LENGTH SEPAL WIDTH PETAL LENGHT PETAL WIDTH

MAX VALUE 7.9 4.4 6.9 2.5
MIN VALUE 4.3 2 1 0.1
MEAN 5.843 3.054 3.759 1.199
STD DEV 0.828 0.434 1.764 0.763

ORIGINAL VALUES

SEPAL LENGTH SEPAL WIDTH PETAL LENGHT PETAL WIDTH

MAX VALUE 2.057 1.346 3.141 1.301
MIN VALUE -1.543 -1.054 -2.759 -1.099
MEAN 0 0 0 0
STD DEV 0.828 0.434 1.764 0.763

AFTER APPLYING FILTER

Discretization Filter

The Discretize filter in Weka converts numeric attributes into nominal (categorical) attributes by binning
the numeric values into discrete intervals. This process is known as discretization and is useful for
several reasons like compatibility with algorithms , handling non linear relationship ,reducing sensitivity
to outliers and simplifying data .

Demonstration

When we apply the discretizing filter in Weka, the continuous values of each attribute are converted
into discrete bins. This process makes the data easier to visualize. For example, instead of having a range
of continuous values for an attribute, we might see discrete categories or bins.
Figure 3 Original dataset visualization

Figure 4 After discretization

Interquartile Range

The Interquartile Range (IQR) filter in Weka adds new attributes to a dataset that indicate the
interquartile range for each numeric attribute. The IQR is a measure of statistical dispersion and is
defined as the range between the first quartile (25th percentile) and the third quartile (75th percentile)
of the data. The purpose of adding IQR attributes is to provide additional information about the spread
of the data, which can be useful for various data analysis and preprocessing tasks, specially in outlier
detection , feature engineering and data exploration

Demonstration

On applying the interquartile range filter there becomes 2 additional attributes one is outlier and other
is extreme values . Both shows the results that there are no outliers and extreme values

Normalize

Normalize filter in weka is used to scale the numeric values of continuous attributes into the range from
0 – 1 for better results

Demonstration

When we apply the normalization filter in Weka, we observe changes in the maximum, minimum, mean,
and standard deviation values of the attributes. All the values are scaled between 0 and 1. This scaling
ensures that the maximum and minimum values of each attribute become the same (1 and 0,
respectively).

SEPAL LENGTH SEPAL WIDTH PETAL LENGHT PETAL WIDTH

MAX VALUE 7.9 4.4 6.9 2.5
MIN VALUE 4.3 2 1 0.1
MEAN 5.843 3.054 3.759 1.199
STD DEV 0.828 0.434 1.764 0.763

ORIGINAL VALUES

SEPAL LENGTH SEPAL WIDTH PETAL LENGHT PETAL WIDTH

MAX VALUE 1 1 1 1
MIN VALUE 0 0 0 0
MEAN 0.429 0.439 0.468 0.458
STD DEV 0.23 0.181 0.299 0.318

AFTER NORMALIZATION

Principal Components

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a
dataset while retaining as much variability (information) as possible. It transforms the original attributes
into a new set of uncorrelated variables called principal components. These components are ordered by
the amount of variance they capture from the data, with the first principal component capturing the
most variance, the second capturing the next most, and so on.

Demonstration

On applying the principle component filter we observed that PCA transform 4 dimensional data into 2
dimensional forming PC1 and PC2

PC1 = −0.581×petallength−0.566×petalwidth−0.522×sepallength+0.263×sepalwidth

This means that PC1 is a weighted sum of the original attributes, with the weights (coefficients)
indicating the contribution of each attribute to the principal component , we analyze that petal length
,petal width ,sepal length contribute negatively ,while sepalwidth contribute positively

PC2= 0.926×sepalwidth+0.372×sepallength+0.065×petalwidth+0.021×petallength

This shows that all the attributes contribute positively while sepal width has a strong positive
contribution to PC2 on the other hand sepal length has a moderate positive contribution , and
petalwidth and petallength has a small contribution .

Standardize

Standardize filter in weka is used to scale the data to have a zero mean and unit variance .

Demonstration

On applying the filter we observed that the values are transformed so that they follow a standard
normal distribution (mean 0 and standard deviation 1)making the model we have to train less sensitive
and data get easy to handle

SEPAL LENGTH SEPAL WIDTH PETAL LENGHT PETAL WIDTH

MAX VALUE 7.9 4.4 6.9 2.5
MIN VALUE 4.3 2 1 0.1
MEAN 5.843 3.054 3.759 1.199
STD DEV 0.828 0.434 1.764 0.763

ORIGINAL VALUES

SEPAL LENGTH SEPAL WIDTH PETAL LENGHT PETAL WIDTH

MAX VALUE 2.484 3.104 1.78 1.705
MIN VALUE -1.864 -2.431 -1.563 -1.44
MEAN 0 0 0 0
STD DEV 1 1 1 1

AFTER STANDARDIZATION
Replace Missing Values

Replace missing values filter is used to replace the missing values in the dataset with mean,median or
mode accordingly

Results

As our dataset do not contain any missing values so on applying this filter we see no effect.

Instances filter (unsupervised )

Remove duplicates

Remove duplicates filter will remove all the duplicate instances in the dataset

Demonstration

When we applied the filter to remove duplicate instances, we observed that the number of instances
was reduced from 150 to 147, indicating that 3 duplicate instances were removed. This change affected
the values of the mean and standard deviation of each attribute, as the dataset's composition was
altered.

Remove Misclassified

This filter removes instances that are misclassified by a specified classifier. It is useful for cleaning the
dataset.

Demonstration

On applying the filter only the iris-setosa class instances are left others are removed .

Resample

The Resample filter is designed to create a new dataset by randomly selecting instances from the
original dataset. This process can be used to either reduce the size of the dataset (downsampling) or
increase it (upsampling)

Demonstration

When we applied the resample filter to the Iris dataset, we observed that the number of instances for
the iris-setosa and iris-versicolor classes was reduced to 48 each, while the instances for the iris-virginica
class increased to 54. This change resulted in an imbalance in the class labels,

DMLab
No ratings yet
DMLab
27 pages
NguyenThanhNam ITCSIU22311 Lab01
No ratings yet
NguyenThanhNam ITCSIU22311 Lab01
20 pages
Assignment Template
No ratings yet
Assignment Template
24 pages
Rapid Miner - Data Preparation
100% (1)
Rapid Miner - Data Preparation
17 pages
Presentation 9
No ratings yet
Presentation 9
12 pages
Task - 3
No ratings yet
Task - 3
4 pages
DWM Lab Manual 2025-26 Updated
No ratings yet
DWM Lab Manual 2025-26 Updated
47 pages
Task 0: Weka Introduction
No ratings yet
Task 0: Weka Introduction
11 pages
DAV Lab1 For Students
No ratings yet
DAV Lab1 For Students
2 pages
Weka Lab
No ratings yet
Weka Lab
11 pages
Filter Based Feature Selection Using ANOVA: Suppose A Company Wants To Analyze Whether The
No ratings yet
Filter Based Feature Selection Using ANOVA: Suppose A Company Wants To Analyze Whether The
66 pages
Lecture 12 - Weka Tutorial
No ratings yet
Lecture 12 - Weka Tutorial
84 pages
DMDW LAB NEW - Merged
No ratings yet
DMDW LAB NEW - Merged
53 pages
Introduction To Weka-A Toolkit For Machine Learning
No ratings yet
Introduction To Weka-A Toolkit For Machine Learning
11 pages
Workshop 1
No ratings yet
Workshop 1
16 pages
Chapter Five
No ratings yet
Chapter Five
48 pages
Can You Double Check It and Give Me Detailed Step - .
No ratings yet
Can You Double Check It and Give Me Detailed Step - .
56 pages
Wa0000.
No ratings yet
Wa0000.
4 pages
Data Mining Lab Questions
100% (1)
Data Mining Lab Questions
47 pages
Weka Filters Unsupervised Attribute
No ratings yet
Weka Filters Unsupervised Attribute
3 pages
Module 2 Data Preprocessing
No ratings yet
Module 2 Data Preprocessing
31 pages
Lab 01-PhamBinhDuong ITCSIU21054
No ratings yet
Lab 01-PhamBinhDuong ITCSIU21054
9 pages
Weka 3.6 Tutorial: Data Mining Guide
No ratings yet
Weka 3.6 Tutorial: Data Mining Guide
4 pages
EX-01-Weka and Rapidminer
No ratings yet
EX-01-Weka and Rapidminer
9 pages
WEKA Manual
No ratings yet
WEKA Manual
25 pages
A2 - 20 - 01 - Ansh Agrawal
No ratings yet
A2 - 20 - 01 - Ansh Agrawal
11 pages
An Introduction To WEKA Explorer: in Part From: Yizhou Sun 2008
No ratings yet
An Introduction To WEKA Explorer: in Part From: Yizhou Sun 2008
104 pages
Data Mining: Index
No ratings yet
Data Mining: Index
47 pages
DWDM - Case Study On Weka - Ceb624
No ratings yet
DWDM - Case Study On Weka - Ceb624
13 pages
Weka 3 Tool
No ratings yet
Weka 3 Tool
37 pages
Data Warehousing - To Write
No ratings yet
Data Warehousing - To Write
23 pages
Demonstration of Preprocessing On Dataset Student - Arff Aim: This Experiment Illustrates Some of The Basic Data Preprocessing Operations That Can Be
100% (1)
Demonstration of Preprocessing On Dataset Student - Arff Aim: This Experiment Illustrates Some of The Basic Data Preprocessing Operations That Can Be
4 pages
Feature Selection - New
No ratings yet
Feature Selection - New
41 pages
NguyenCongSang ITITIU20292 Lab1
No ratings yet
NguyenCongSang ITITIU20292 Lab1
7 pages
Week 8
No ratings yet
Week 8
13 pages
ML Unit-5
No ratings yet
ML Unit-5
12 pages
WEKA Data Preprocessing Guide
No ratings yet
WEKA Data Preprocessing Guide
15 pages
4-Data Preprocessing (Cleaning) and Exploration
No ratings yet
4-Data Preprocessing (Cleaning) and Exploration
54 pages
DWDM LAB Manual SVEC-16
No ratings yet
DWDM LAB Manual SVEC-16
8 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Feature Selection Methods Review
No ratings yet
Feature Selection Methods Review
6 pages
An Introduction To WEKA: Contributed by Yizhou Sun 2008
No ratings yet
An Introduction To WEKA: Contributed by Yizhou Sun 2008
85 pages
3point5point2 Normalization
No ratings yet
3point5point2 Normalization
3 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
47 pages
Lecture 7 Data Transformation and Dimensionality Reduction
No ratings yet
Lecture 7 Data Transformation and Dimensionality Reduction
22 pages
Data Mining - Session #1 - Unlocked
No ratings yet
Data Mining - Session #1 - Unlocked
22 pages
M 2.2 8data Reduction
No ratings yet
M 2.2 8data Reduction
34 pages
Model Selection and Feature Engineering
No ratings yet
Model Selection and Feature Engineering
64 pages
Weka Tool
No ratings yet
Weka Tool
9 pages
CP1407 Assignment Final
No ratings yet
CP1407 Assignment Final
13 pages
Weka LAB-ALL
No ratings yet
Weka LAB-ALL
19 pages
Data Assigment 1
100% (2)
Data Assigment 1
32 pages
Xplore Feature Engineering
No ratings yet
Xplore Feature Engineering
9 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
2023 Its665 - Isp565 - Group Project
No ratings yet
2023 Its665 - Isp565 - Group Project
6 pages
Introduction To Feature Selection Methods With An Example
No ratings yet
Introduction To Feature Selection Methods With An Example
10 pages
Data Warehousing and Data Mining Lab
No ratings yet
Data Warehousing and Data Mining Lab
53 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
36 pages
B.Sc. Statistics Curriculum
No ratings yet
B.Sc. Statistics Curriculum
19 pages
(Ebook PDF) Elementary Statistics: A Step by Step Approach 9th Edition - Instantly Access The Full Ebook Content in Just A Few Seconds
100% (4)
(Ebook PDF) Elementary Statistics: A Step by Step Approach 9th Edition - Instantly Access The Full Ebook Content in Just A Few Seconds
48 pages
K Nearest Neighbors (KNN)
No ratings yet
K Nearest Neighbors (KNN)
14 pages
Lecture 6 - Support Vector Regression Imran 07032025 114229am
No ratings yet
Lecture 6 - Support Vector Regression Imran 07032025 114229am
30 pages
Exam 3 Review
No ratings yet
Exam 3 Review
16 pages
Chi-Square Test Guide for Students
100% (1)
Chi-Square Test Guide for Students
8 pages
Unit 6 MS
No ratings yet
Unit 6 MS
5 pages
Consequences and Detection of Misspecified Nonlinear Regression Models
No ratings yet
Consequences and Detection of Misspecified Nonlinear Regression Models
16 pages
MC Math 13 Module 14
No ratings yet
MC Math 13 Module 14
10 pages
Chapter 5
No ratings yet
Chapter 5
11 pages
Handout No. 1
No ratings yet
Handout No. 1
11 pages
Data Analytics Stats Viz Python PowerBi Excel SQL
No ratings yet
Data Analytics Stats Viz Python PowerBi Excel SQL
8 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
80 pages
Statistics For Business and Economics 11th Edition David Ray Anderson Download
100% (5)
Statistics For Business and Economics 11th Edition David Ray Anderson Download
71 pages
B.A. Statistics Syllabus 2009
No ratings yet
B.A. Statistics Syllabus 2009
16 pages
Missing Data Mechanisms and Imputation Methods
No ratings yet
Missing Data Mechanisms and Imputation Methods
16 pages
Test Bank
100% (1)
Test Bank
34 pages
Testing of Hypothesis For Single Proportion - : Large Sample Test
No ratings yet
Testing of Hypothesis For Single Proportion - : Large Sample Test
16 pages
G. S. Maddala - Introduction To Econometrics-Macmillan Pub. Co. - Maxwell Macmillan Canada - Maxwell Macmillan International (1992)
No ratings yet
G. S. Maddala - Introduction To Econometrics-Macmillan Pub. Co. - Maxwell Macmillan Canada - Maxwell Macmillan International (1992)
637 pages
Regression Discontinuity Design
No ratings yet
Regression Discontinuity Design
29 pages
Cs3491-Artificial Intelligence and Machine Learning Unit Iii - Supervised Learning
No ratings yet
Cs3491-Artificial Intelligence and Machine Learning Unit Iii - Supervised Learning
12 pages
BS - Abid - Term Paper
No ratings yet
BS - Abid - Term Paper
29 pages
Quantitative Data Analysis Guide
No ratings yet
Quantitative Data Analysis Guide
78 pages
Understanding Populations, Samples and Sample Size Requirements
No ratings yet
Understanding Populations, Samples and Sample Size Requirements
14 pages
Econometric Analysis for Researchers
No ratings yet
Econometric Analysis for Researchers
1 page
Estabrook Intro To Mplus
No ratings yet
Estabrook Intro To Mplus
81 pages
ANOVA Analysis for Students
No ratings yet
ANOVA Analysis for Students
5 pages
Discovering Statistics Using IBM SPSS Statistics (6th Edition) Field
0% (4)
Discovering Statistics Using IBM SPSS Statistics (6th Edition) Field
10 pages
Summer Internship Review-1 of Harshitha.m
No ratings yet
Summer Internship Review-1 of Harshitha.m
23 pages