Question 3:
Demonstrate the use of the following preprocessing filters in Weka, using any dataset (e.g., Iris).
(Submit a report showing the purpose and use of these filters).
Attribute Filters (supervised): AttributeSelection
Instance Filters (supervised): Resample
Attribute Filters (unsupervised): Center, Discretize, InterquartileRange, Normalize,
PrincipalComponents, Standardize, ReplaceMissingValues
Instance Filters (unsupervised): RemoveDuplicates, RemoveMisclassified, Resampl
We would be demonstrating the preprocessing filters using weka on the dataset of Iris
Attribute filters
Attribute filter (supervised) use class labels to guide about the selection of the attributes and choose the
attributes which have the strongest relationship with the target attribute and remove the other
Attribute selection
Attribute selection filter selects the most relevant and important attributes that will have great effect on
the accuracy of model ,it causes dimensionality reduction, improve model performance and reduce over
fitting
Demonstration
Using the Iris dataset, we notice that all attributes are important because they all have a strong
relationship with the target attribute. However, let's say we decide that "sepal width" is not relevant.
We can select this attribute and apply a filter to remove it, leaving us with only three input attributes.
Instances Attribute
Resample:
The `Resample` filter in Weka is used to create a new dataset by sampling instances from the original
dataset. This filter is particularly useful for handling class imbalance. Class imbalance occurs when the
number of instances in one class is significantly higher or lower than the number of instances in other
classes.
Demonstration
When we apply the filter to "sepal length," the number of distinct values decreases from 35 to 32, and
the number of unique values decreases from 9 to 5. Similarly, the distinct and unique values of other
attributes like sepal width, petal length, and petal width are also reduced. However, since our class
labels are not imbalanced, we do not need to perform resampling.
                                       Figure 1 Original Dataset
                                        Figure 2 after Resampling
Attribute Selection (unsupervised):
Centre :
It is used to standardized the data by subtracting the mean value of each attribute resulting in zero
mean
Demonstration
When we apply the center filter in Weka, the mean of every attribute is subtracted from each value, so
the mean of the attributes becomes zero. This process centers the data but does not standardize it
because the standard deviation remains unchanged. Standardization involves both centering the data
(making the mean zero) and scaling it (making the standard deviation one), which is not done by the
center filter alone.
                        SEPAL LENGTH        SEPAL WIDTH           PETAL LENGHT          PETAL WIDTH
MAX VALUE               7.9                 4.4                   6.9                   2.5
MIN VALUE               4.3                 2                     1                     0.1
MEAN                    5.843               3.054                 3.759                 1.199
STD DEV                 0.828               0.434                 1.764                 0.763
                                         ORIGINAL VALUES
                        SEPAL LENGTH        SEPAL WIDTH           PETAL LENGHT          PETAL WIDTH
MAX VALUE               2.057               1.346                 3.141                 1.301
MIN VALUE               -1.543              -1.054                -2.759                -1.099
MEAN                    0                   0                     0                     0
STD DEV                 0.828               0.434                 1.764                 0.763
                                           AFTER APPLYING FILTER
Discretization Filter
The Discretize filter in Weka converts numeric attributes into nominal (categorical) attributes by binning
the numeric values into discrete intervals. This process is known as discretization and is useful for
several reasons like compatibility with algorithms , handling non linear relationship ,reducing sensitivity
to outliers and simplifying data .
Demonstration
When we apply the discretizing filter in Weka, the continuous values of each attribute are converted
into discrete bins. This process makes the data easier to visualize. For example, instead of having a range
of continuous values for an attribute, we might see discrete categories or bins.
                                     Figure 3 Original dataset visualization
                                          Figure 4 After discretization
Interquartile Range
The Interquartile Range (IQR) filter in Weka adds new attributes to a dataset that indicate the
interquartile range for each numeric attribute. The IQR is a measure of statistical dispersion and is
defined as the range between the first quartile (25th percentile) and the third quartile (75th percentile)
of the data. The purpose of adding IQR attributes is to provide additional information about the spread
of the data, which can be useful for various data analysis and preprocessing tasks, specially in outlier
detection , feature engineering and data exploration
Demonstration
On applying the interquartile range filter there becomes 2 additional attributes one is outlier and other
is extreme values . Both shows the results that there are no outliers and extreme values
Normalize
Normalize filter in weka is used to scale the numeric values of continuous attributes into the range from
0 – 1 for better results
Demonstration
When we apply the normalization filter in Weka, we observe changes in the maximum, minimum, mean,
and standard deviation values of the attributes. All the values are scaled between 0 and 1. This scaling
ensures that the maximum and minimum values of each attribute become the same (1 and 0,
respectively).
                      SEPAL LENGTH          SEPAL WIDTH           PETAL LENGHT           PETAL WIDTH
MAX VALUE             7.9                   4.4                   6.9                    2.5
MIN VALUE             4.3                   2                     1                      0.1
MEAN                  5.843                 3.054                 3.759                  1.199
STD DEV               0.828                 0.434                 1.764                  0.763
                                         ORIGINAL VALUES
                      SEPAL LENGTH          SEPAL WIDTH           PETAL LENGHT           PETAL WIDTH
MAX VALUE             1                     1                     1                      1
MIN VALUE             0                     0                     0                      0
MEAN                  0.429                 0.439                 0.468                  0.458
STD DEV               0.23                  0.181                 0.299                  0.318
                                      AFTER NORMALIZATION
Principal Components
Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a
dataset while retaining as much variability (information) as possible. It transforms the original attributes
into a new set of uncorrelated variables called principal components. These components are ordered by
the amount of variance they capture from the data, with the first principal component capturing the
most variance, the second capturing the next most, and so on.
Demonstration
On applying the principle component filter we observed that PCA transform 4 dimensional data into 2
dimensional forming PC1 and PC2
PC1 = −0.581×petallength−0.566×petalwidth−0.522×sepallength+0.263×sepalwidth
This means that PC1 is a weighted sum of the original attributes, with the weights (coefficients)
indicating the contribution of each attribute to the principal component , we analyze that petal length
,petal width ,sepal length contribute negatively ,while sepalwidth contribute positively
PC2= 0.926×sepalwidth+0.372×sepallength+0.065×petalwidth+0.021×petallength
This shows that all the attributes contribute positively while sepal width has a strong positive
contribution to PC2 on the other hand sepal length has a moderate positive contribution , and
petalwidth and petallength has a small contribution .
Standardize
Standardize filter in weka is used to scale the data to have a zero mean and unit variance .
Demonstration
On applying the filter we observed that the values are transformed so that they follow a standard
normal distribution (mean 0 and standard deviation 1)making the model we have to train less sensitive
and data get easy to handle
                     SEPAL LENGTH          SEPAL WIDTH           PETAL LENGHT          PETAL WIDTH
MAX VALUE            7.9                   4.4                   6.9                   2.5
MIN VALUE            4.3                   2                     1                     0.1
MEAN                 5.843                 3.054                 3.759                 1.199
STD DEV              0.828                 0.434                 1.764                 0.763
                                         ORIGINAL VALUES
                     SEPAL LENGTH          SEPAL WIDTH           PETAL LENGHT          PETAL WIDTH
MAX VALUE            2.484                 3.104                 1.78                  1.705
MIN VALUE            -1.864                -2.431                -1.563                -1.44
MEAN                 0                     0                     0                     0
STD DEV              1                     1                     1                     1
                                    AFTER STANDARDIZATION
Replace Missing Values
Replace missing values filter is used to replace the missing values in the dataset with mean,median or
mode accordingly
Results
As our dataset do not contain any missing values so on applying this filter we see no effect.
Instances filter (unsupervised )
Remove duplicates
Remove duplicates filter will remove all the duplicate instances in the dataset
Demonstration
When we applied the filter to remove duplicate instances, we observed that the number of instances
was reduced from 150 to 147, indicating that 3 duplicate instances were removed. This change affected
the values of the mean and standard deviation of each attribute, as the dataset's composition was
altered.
Remove Misclassified
This filter removes instances that are misclassified by a specified classifier. It is useful for cleaning the
dataset.
Demonstration
On applying the filter only the iris-setosa class instances are left others are removed .
Resample
The Resample filter is designed to create a new dataset by randomly selecting instances from the
original dataset. This process can be used to either reduce the size of the dataset (downsampling) or
increase it (upsampling)
Demonstration
When we applied the resample filter to the Iris dataset, we observed that the number of instances for
the iris-setosa and iris-versicolor classes was reduced to 48 each, while the instances for the iris-virginica
class increased to 54. This change resulted in an imbalance in the class labels,