0% found this document useful (0 votes)
18 views7 pages

Question 3 DM

The document outlines the use of various preprocessing filters in Weka using the Iris dataset, including both supervised and unsupervised attribute and instance filters. It explains the purpose and effects of filters such as AttributeSelection, Resample, Center, Discretize, Normalize, and others, demonstrating their application and impact on the dataset. Key results include dimensionality reduction, handling class imbalance, and transforming data for improved model performance.

Uploaded by

s2024393005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views7 pages

Question 3 DM

The document outlines the use of various preprocessing filters in Weka using the Iris dataset, including both supervised and unsupervised attribute and instance filters. It explains the purpose and effects of filters such as AttributeSelection, Resample, Center, Discretize, Normalize, and others, demonstrating their application and impact on the dataset. Key results include dimensionality reduction, handling class imbalance, and transforming data for improved model performance.

Uploaded by

s2024393005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Question 3:

Demonstrate the use of the following preprocessing filters in Weka, using any dataset (e.g., Iris).
(Submit a report showing the purpose and use of these filters).

Attribute Filters (supervised): AttributeSelection

Instance Filters (supervised): Resample

Attribute Filters (unsupervised): Center, Discretize, InterquartileRange, Normalize,


PrincipalComponents, Standardize, ReplaceMissingValues

Instance Filters (unsupervised): RemoveDuplicates, RemoveMisclassified, Resampl

We would be demonstrating the preprocessing filters using weka on the dataset of Iris

Attribute filters

Attribute filter (supervised) use class labels to guide about the selection of the attributes and choose the
attributes which have the strongest relationship with the target attribute and remove the other

Attribute selection

Attribute selection filter selects the most relevant and important attributes that will have great effect on
the accuracy of model ,it causes dimensionality reduction, improve model performance and reduce over
fitting

Demonstration

Using the Iris dataset, we notice that all attributes are important because they all have a strong
relationship with the target attribute. However, let's say we decide that "sepal width" is not relevant.
We can select this attribute and apply a filter to remove it, leaving us with only three input attributes.

Instances Attribute

Resample:

The `Resample` filter in Weka is used to create a new dataset by sampling instances from the original
dataset. This filter is particularly useful for handling class imbalance. Class imbalance occurs when the
number of instances in one class is significantly higher or lower than the number of instances in other
classes.

Demonstration

When we apply the filter to "sepal length," the number of distinct values decreases from 35 to 32, and
the number of unique values decreases from 9 to 5. Similarly, the distinct and unique values of other
attributes like sepal width, petal length, and petal width are also reduced. However, since our class
labels are not imbalanced, we do not need to perform resampling.

Figure 1 Original Dataset

Figure 2 after Resampling


Attribute Selection (unsupervised):

Centre :

It is used to standardized the data by subtracting the mean value of each attribute resulting in zero
mean

Demonstration

When we apply the center filter in Weka, the mean of every attribute is subtracted from each value, so
the mean of the attributes becomes zero. This process centers the data but does not standardize it
because the standard deviation remains unchanged. Standardization involves both centering the data
(making the mean zero) and scaling it (making the standard deviation one), which is not done by the
center filter alone.

SEPAL LENGTH SEPAL WIDTH PETAL LENGHT PETAL WIDTH


MAX VALUE 7.9 4.4 6.9 2.5
MIN VALUE 4.3 2 1 0.1
MEAN 5.843 3.054 3.759 1.199
STD DEV 0.828 0.434 1.764 0.763

ORIGINAL VALUES

SEPAL LENGTH SEPAL WIDTH PETAL LENGHT PETAL WIDTH


MAX VALUE 2.057 1.346 3.141 1.301
MIN VALUE -1.543 -1.054 -2.759 -1.099
MEAN 0 0 0 0
STD DEV 0.828 0.434 1.764 0.763

AFTER APPLYING FILTER

Discretization Filter

The Discretize filter in Weka converts numeric attributes into nominal (categorical) attributes by binning
the numeric values into discrete intervals. This process is known as discretization and is useful for
several reasons like compatibility with algorithms , handling non linear relationship ,reducing sensitivity
to outliers and simplifying data .

Demonstration

When we apply the discretizing filter in Weka, the continuous values of each attribute are converted
into discrete bins. This process makes the data easier to visualize. For example, instead of having a range
of continuous values for an attribute, we might see discrete categories or bins.
Figure 3 Original dataset visualization

Figure 4 After discretization

Interquartile Range

The Interquartile Range (IQR) filter in Weka adds new attributes to a dataset that indicate the
interquartile range for each numeric attribute. The IQR is a measure of statistical dispersion and is
defined as the range between the first quartile (25th percentile) and the third quartile (75th percentile)
of the data. The purpose of adding IQR attributes is to provide additional information about the spread
of the data, which can be useful for various data analysis and preprocessing tasks, specially in outlier
detection , feature engineering and data exploration

Demonstration

On applying the interquartile range filter there becomes 2 additional attributes one is outlier and other
is extreme values . Both shows the results that there are no outliers and extreme values

Normalize

Normalize filter in weka is used to scale the numeric values of continuous attributes into the range from
0 – 1 for better results

Demonstration

When we apply the normalization filter in Weka, we observe changes in the maximum, minimum, mean,
and standard deviation values of the attributes. All the values are scaled between 0 and 1. This scaling
ensures that the maximum and minimum values of each attribute become the same (1 and 0,
respectively).

SEPAL LENGTH SEPAL WIDTH PETAL LENGHT PETAL WIDTH


MAX VALUE 7.9 4.4 6.9 2.5
MIN VALUE 4.3 2 1 0.1
MEAN 5.843 3.054 3.759 1.199
STD DEV 0.828 0.434 1.764 0.763

ORIGINAL VALUES

SEPAL LENGTH SEPAL WIDTH PETAL LENGHT PETAL WIDTH


MAX VALUE 1 1 1 1
MIN VALUE 0 0 0 0
MEAN 0.429 0.439 0.468 0.458
STD DEV 0.23 0.181 0.299 0.318

AFTER NORMALIZATION

Principal Components

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a
dataset while retaining as much variability (information) as possible. It transforms the original attributes
into a new set of uncorrelated variables called principal components. These components are ordered by
the amount of variance they capture from the data, with the first principal component capturing the
most variance, the second capturing the next most, and so on.

Demonstration

On applying the principle component filter we observed that PCA transform 4 dimensional data into 2
dimensional forming PC1 and PC2

PC1 = −0.581×petallength−0.566×petalwidth−0.522×sepallength+0.263×sepalwidth

This means that PC1 is a weighted sum of the original attributes, with the weights (coefficients)
indicating the contribution of each attribute to the principal component , we analyze that petal length
,petal width ,sepal length contribute negatively ,while sepalwidth contribute positively

PC2= 0.926×sepalwidth+0.372×sepallength+0.065×petalwidth+0.021×petallength

This shows that all the attributes contribute positively while sepal width has a strong positive
contribution to PC2 on the other hand sepal length has a moderate positive contribution , and
petalwidth and petallength has a small contribution .

Standardize

Standardize filter in weka is used to scale the data to have a zero mean and unit variance .

Demonstration

On applying the filter we observed that the values are transformed so that they follow a standard
normal distribution (mean 0 and standard deviation 1)making the model we have to train less sensitive
and data get easy to handle

SEPAL LENGTH SEPAL WIDTH PETAL LENGHT PETAL WIDTH


MAX VALUE 7.9 4.4 6.9 2.5
MIN VALUE 4.3 2 1 0.1
MEAN 5.843 3.054 3.759 1.199
STD DEV 0.828 0.434 1.764 0.763

ORIGINAL VALUES

SEPAL LENGTH SEPAL WIDTH PETAL LENGHT PETAL WIDTH


MAX VALUE 2.484 3.104 1.78 1.705
MIN VALUE -1.864 -2.431 -1.563 -1.44
MEAN 0 0 0 0
STD DEV 1 1 1 1

AFTER STANDARDIZATION
Replace Missing Values

Replace missing values filter is used to replace the missing values in the dataset with mean,median or
mode accordingly

Results

As our dataset do not contain any missing values so on applying this filter we see no effect.

Instances filter (unsupervised )

Remove duplicates

Remove duplicates filter will remove all the duplicate instances in the dataset

Demonstration

When we applied the filter to remove duplicate instances, we observed that the number of instances
was reduced from 150 to 147, indicating that 3 duplicate instances were removed. This change affected
the values of the mean and standard deviation of each attribute, as the dataset's composition was
altered.

Remove Misclassified

This filter removes instances that are misclassified by a specified classifier. It is useful for cleaning the
dataset.

Demonstration

On applying the filter only the iris-setosa class instances are left others are removed .

Resample

The Resample filter is designed to create a new dataset by randomly selecting instances from the
original dataset. This process can be used to either reduce the size of the dataset (downsampling) or
increase it (upsampling)

Demonstration

When we applied the resample filter to the Iris dataset, we observed that the number of instances for
the iris-setosa and iris-versicolor classes was reduced to 48 each, while the instances for the iris-virginica
class increased to 54. This change resulted in an imbalance in the class labels,

You might also like